Complete guide to scatter plot using ggplot2


R is great for data visualization. One of the amazing packages in visualization in R is the ggplot2. Hadley Wickham created ggplot2 in 2005 as an implementation of Leland Wilkinson’s Grammar of Graphics. It divides graphs into semantic components like scales and layers.

After reading this post, you’ll be able to create beautiful scatter plots like the one below.

Getting Started

Install library using install.packages("ggplot2").

If you’ve already installed it in your computer then load it -

library("ggplot2")

Loading the data set and do some changes to make it usable -

college <- read.csv('Data/college.csv', stringsAsFactors = TRUE)

You can get the data set here


Calling ggplot() function alone just creates a blank canvas -

ggplot()

Adding geom_point layer to the ggplot object to create a scatter plot -

Adding a layer to the ggplot object with argument geom='point' -

ggplot(data = college) +
  layer(geom = 'point', stat = "identity", position = "identity",
        mapping = aes(x = tuition, y = sat_avg))

But the easier and widely used way of adding a layer is using geom_* -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg))

Shape

You can change the shape of the points from black dot to something else. For example -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg),
             shape = 1)

You can use different shapes for different values/levels. For example in our college data there is a column named control, that has the information on whether a school is public or private.

So if you want to differentiate public vs. private schools but shape you can do that using the shape argument inside the aesthetic mapping -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, shape = control))

Color

You can change the color of the points from black dot to something else. For example -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg),
             color = 'darkorchid1')

You can know all the color names by running the code colors()

Similar to changing shape based on the levels of a variable, you can also change color. For this you need to pass the color argument inside the aesthetic mapping specifying the variable name -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, color = control))

Now you can clearly see how the private and public schools are performing.

Manually Changing Color

You can assign colors of your choice to plot using the function scale_color_manual() -

manu_colors <- c("#FF8C32", "#06113C")
ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, color = control))+
  scale_color_manual(values = manu_colors)

You can hide the legend using the argument show.legend outside of the aesthetic mapping -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, color = control),
             show.legend = FALSE)+
  scale_color_manual(values = manu_colors)

colourpicker Addin for choosing color

View this link for details on how to install and use this.

CPCOLS <- c("#8B0A50", "#9A32CD")

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, color = control))+
  scale_color_manual(values=CPCOLS)

Point Size

You can change the size of the points from regular size to something else. For example -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg), size = 2)

Let’s alter the size of pointers in accordance to the number of undergraduates in each point -

ggplot(data = college) +
  geom_point(aes(x = tuition, y = sat_avg, size = undergrads))

Adding Transparency/Alpha

The transparency of the points can be controlled using the argument alpha outside of the aesthetic mapping -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, size = undergrads),
             alpha = 0.35)

alpha takes values from 0 to 1.

Notice how transparency of the points in the legend also changes. To remove any transparency we can use the guides() function to override the aesthetic of the point -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                           size = undergrads), 
             alpha = 0.35) +
  guides(size = guide_legend(override.aes = list(alpha = 1)))

Title & Subtitle

Add title and subtitle using the ggtitle -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  ggtitle("SAT Average score VS Tuition Fee",
          subtitle = "A comparison study")

I prefer using labs() because it gives more space to customization, for example changing label of legends -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       subtitle = "A comparison study")

Alignment of Title & Subtitle

To align the title and subtitle in the middle you can customize the theme manually -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       subtitle = "A comparison study",
       color = "Control", size = "No. of Undergrads") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

The title and subtitle by default is plotted on the panel. And so the alignment is done based on the panel. If you want to align the plot based on ‘plot’ than you have to specify it using plot.title.position argument inside the function theme() -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       subtitle = "A comparison study",
       color = "Control", size = "No. of Undergrads") +
  theme(plot.title.position = "plot", 
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

Axis Labels

Like ggtile() there are functions called xlab() and ylab() that can be used to change axis labels -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  xlab("Tuition Fees") +
  ylab("SAT Average Score")

But the labs() function with arguments x and y may seem more convenient -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       subtitle = "A comparison study",
       color = "Control", size = "No. of Undergrads",
       x = "Tuition Fees", y = "SAT Average Score")

Axis Limits

Using xlim and ylim you can specify the limits -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       subtitle = "A comparison study",
       color = "Control", size = "No. of Undergrads",
       x = "Tuition Fees", y = "SAT Average Score") +
  xlim(0, 60000) + ylim(700, 1500)

When restricting the axis, some of the values may have removed from the plot.

The expand_limits() function does the same work. It does not remove the points, rather adjusts the limits to include all the points -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       subtitle = "A comparison study",
       color = "Control", size = "No. of Undergrads",
       x = "Tuition Fees", y = "SAT Average Score") +
  expand_limits(x = c(0, 60000), y = c(700, 1500))

scale_*_continuous()

This function can be used to manually set the axis labels, breaks, limits and many more. For example -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       subtitle = "A comparison study",
       color = "Control", size = "No. of Undergrads") +
  scale_x_continuous(name = "Tuition Fees", 
                     limits = c(0, 56000),  # to change limit
                     breaks = seq(0, 56000, by = 8000), # to specify breaks
                     labels = scales::dollar  
                     ) +
  scale_y_continuous(name = "SAT Average Score")

To know more run ?scale_x_continuous.


Caption

The caption appears in the bottom-right, and is often used for sources, notes or copyright -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       subtitle = "A comparison study",
       color = "Control", size = "No. of Undergrads",
       x = "Tuition Fees", y = "SAT Average Score",
       caption = "Source: U.S. Department of Education") +
  theme(plot.caption.position = "plot")

Tag

The plot tag appears at the top-left, and is typically used for labelling a subplot -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       subtitle = "A comparison study",
       color = "Control", size = "No. of Undergrads",
       x = "Tuition Fees", y = "SAT Average Score",
       tag = "A")

Legend

You can also customize the legends!

Legend Title

Using the labs() function to change the titles of the legends -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       subtitle = "A comparison study",
       # changing legend labels
       color = "Control", size = "No. of Undergrads"
       )

Another way to change the titles is using guides() function -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       subtitle = "A comparison study") +
  guides(color = guide_legend(title = "Control"),
         size = guide_legend(title = "No. of Undergrads"))

To hide the legend titles use element_blank() -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       x = "Tuition Fees", y = "SAT Average Score") +
  theme(legend.title = element_blank())

Legend Position & Box Direction

To place at the bottom, one under another, and reduce the margin -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       x = "Tuition Fees", y = "SAT Average Score") +
  theme(legend.position = "bottom",
        legend.box = "vertical",
        legend.margin=margin())

The legend.position argument takes the values: right, left, bottom, top, none.

More example -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       x = "Tuition Fees", y = "SAT Average Score") +
  theme(legend.position = "right",
        legend.box = "horizontal",
        legend.margin=margin())

To justify the contents of a legend’s box use legend.box.just argument -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       subtitle = "A comparison study",
       color = "Control", size = "No. of Undergrads",
       x = "Tuition Fees", y = "SAT Average Score") +
  theme(legend.position = "bottom",
        legend.box = "vertical",
        legend.margin = margin(),
        legend.box.just = "left")

To hide the legend -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       x = "Tuition Fees", y = "SAT Average Score") +
  theme(legend.position = "none")
ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       x = "Tuition Fees", y = "SAT Average Score") +
  theme(legend.position = "right",
        legend.box = "horizontal",
        legend.margin=margin())

Legend Order

Using the guides() function, you will be able to assign order in which the legends will be shown.

For example -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       x = "Tuition Fees", y = "SAT Average Score") +
  guides(colour = guide_legend(order = 1),
         size = guide_legend(order = 2))
ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(title = "SAT Average score VS Tuition Fee",
       x = "Tuition Fees", y = "SAT Average Score") +
  guides(colour = guide_legend(order = 2),
         size = guide_legend(order = 1))

Customizing Theme

element_rect() function with fill argument in action in changing colors of different parts -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(x = "Tuition Fees", y = "SAT Average Score",
       title = "SAT Average score VS Tuition Fee",
       subtitle = "A comparison study",
       caption = "Source: U.S. Department of Education",
       tag = "A"
       ) +
  theme(plot.caption.position = "plot",
        plot.background = element_rect(fill='#E2D784'),
        panel.background = element_rect(fill = '#E5EFC1'),
        legend.background = element_rect(fill = '#E2D784'),
        legend.key = element_rect(fill = "#E2D784")
        ) +
  guides(color = guide_legend(override.aes = list(alpha = 1, size = 4)),
         size = guide_legend(override.aes = list(alpha = 1)))

Use element_blank() to remove all grids and colors from background -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(x = "Tuition Fees", y = "SAT Average Score",
       title = "SAT Average score VS Tuition Fee")+
  theme(panel.background = element_blank())

Axis grids

Showing both grids in a single color using panel.grid.major -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(x = "Tuition Fees", y = "SAT Average Score",
       title = "SAT Average score VS Tuition Fee")+
  theme(panel.background = element_blank(),
        panel.grid.major = element_line("grey"))

Showing only X axis grid using panel.grid.major.x-

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(x = "Tuition Fees", y = "SAT Average Score",
       title = "SAT Average score VS Tuition Fee")+
  theme(panel.background = element_blank(),
        panel.grid.major.x = element_line("grey"))

Similarly Y axis grid -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(x = "Tuition Fees", y = "SAT Average Score",
       title = "SAT Average score VS Tuition Fee")+
  theme(panel.background = element_blank(),
        panel.grid.major.y = element_line("grey"))

To hide grids use element_blank() -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(x = "Tuition Fees", y = "SAT Average Score",
       title = "SAT Average score VS Tuition Fee")+
  theme(panel.background = element_blank(),
        panel.grid.major = element_blank())

Using Predefined Themes

List of themes available in ggplot:

  • theme_bw()
  • theme_minimal()
  • theme_linedraw()
  • theme_light()
  • theme_dark()
  • theme_classic()
  • theme_void()
  • theme_test()

Using classic theme -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(x = "Tuition Fees", y = "SAT Average Score",
       title = "SAT Average score VS Tuition Fee") +
  theme_classic()

Using minimal theme -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(x = "Tuition Fees", y = "SAT Average Score",
       title = "SAT Average score VS Tuition Fee") +
  theme_minimal()

Using dark theme -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(x = "Tuition Fees", y = "SAT Average Score",
       title = "SAT Average score VS Tuition Fee") +
  theme_bw()

More themes can be found from the package ggthemes. Load the package -

library(ggthemes)

Details on the themes can be found here.

Using the theme solarized -

ggplot(data = college) +
  geom_point(mapping = aes(x = tuition, y = sat_avg, 
                         color = control, size = undergrads), 
             alpha = 0.35) +
  labs(x = "Tuition Fees", y = "SAT Average Score",
       title = "SAT Average score VS Tuition Fee") +
  theme_solarized()

Md Ahsanul Islam
Md Ahsanul Islam
Freelance Data Analysis and R Programmer

Statistics graduate student currently researching on econometrics