Introduction to ggplot2

Let’s face it, I am addicted to R, and this is partly (mostly) due to ggplot and Rstudio. If you need an introduction to R, Coursera is a good place to start, as well as swirl. Of course will be small compared to the documentation or the book, but I’ll make it enough to get you started.

Generating data

Thanks to what I learnt on Stack Overflow, I’m making sure that what I’m doing is reproducible - it’s always annoying to see a beautiful example that doesn’t work when you try to reproduce. So we’ll generate data for this (who said mtcars?)

set.seed(123)
dataset<-data.frame(A = sample(x = 1:10, size = 1000, replace = TRUE),
					B = sample(x = LETTERS[1:5], size = 1000, replace = TRUE),
					C = rnorm(n = 1000,mean = 0, sd = 1)
					)
dataset$D<-rnorm(n=1000, mean = dataset$C,sd = 1)

We thus have a dataframe with four columns, A contains integer, B has letters, C and D have numeric values.

plot vs qplot vs ggplot: non-standard evaluation

What is non-standard evaluation? This is the fact that the arguments passed to the aesthetics will not be evaluated, but will be used to call columns. For people used to the plot syntax, it can be disturbing at first, as you should not give vectors as input - we’ll see a workaround to that after. Let’s make this clear with an example of plot, qplot and ggplot:

plot(x= dataset$A, y = dataset$C)

alt -halfwidth

qplot(x= dataset$C, y= dataset$D)
ggplot(data = dataset, aes(x=C,y=D))+geom_point()

Notice:

1. We never explicitely provided the vector in the `ggplot` syntax yet we have the same output as the `qplot`. 
The same syntax is used in `plyr` or `dplyr` (we'll talk about these another time). 
2. The `ggplot` creates an object where characteristics are linked to values, you then have to use to combine this object with a representation `geom_point`
in our case to get a plot! Think of it as an instruction of what you would like to plot.

This is what you should see: alt -halfwidth

It doesn’t look too good but we’ll make it better.

Editing your plot: basic operations

Changing the theme

By default, the background of ggplot is grey to have a better experience on screens. However, for printing, or by habits, it can be useful to have a black and white theme, or edit the axes. This is feasible with theme(), but some themes are already available with ggplot2, such as theme_bw, theme_classic, or theme_minimal.

ggplot(data = dataset, aes(x=C,y=D))+geom_point()+theme_bw()

alt -halfwidth

ggplot(data = dataset, aes(x=C,y=D))+geom_point()+theme_classic()

alt -halfwidth

ggplot(data = dataset, aes(x=C,y=D))+geom_point()+theme_minimal()

alt -halfwidth

Often, where labels on the axes are quite large, it can be useful to turn them. This is also an operation handled in theme. We’ll keep the black & white theme, and add a 45° shift to the x axis and make it bigger:

ggplot(data = dataset, aes(x=C,y=D))+geom_point()+theme_bw()+
theme(axis.text.x = element_text(angle = 45,size = 16))+
  xlab("X axis")+ylab("Y axis")

alt -halfwidth

Notice how you can also change the labels of the axes with xlab and ylab. You don’t remember if it is x.ticks.axis or x.axis.ticks to change the ticks in your theme? Don’t worry, everything is listed here (It is axis.ticks.x )

Adding color and editing legend

Now that we have a theme, we will color the points based on values from column A.

ggplot(data = dataset, aes(x=C,y=D,color = A))+geom_point()+theme_bw()

alt -halfwidth

We have a gradient of colors, but our data is integer? ggplot assumes that numerical data are continuous, so the best way to represent these is a gradient. To change that, you need to make this data a factor or a character. Otherwise you will get an error message:

Error: Continuous value supplied to discrete scale

dataset$A<-as.character(dataset$A)
ggplot(data = dataset, aes(x=C,y=D,color = A))+geom_point()+theme_bw()+
			scale_color_discrete("This is \n a legend")

alt -halfwidth

I also provided the name of the legend for colors in the scale_color_discrete, but this legend is ordered in alphabetical order… It would be better in numerical order:

dataset$A<-factor(dataset$A,levels  = 
				unique(dataset$A)[order(as.numeric(unique(dataset$A)))])
ggplot(data = dataset, aes(x=C,y=D,color = A))+geom_point()+theme_bw()+
			scale_color_discrete("This is \n a legend")

alt -halfwidth

We’ll also add shapes to this plot:

ggplot(data = dataset, aes(x=C,y=D,color = A,shape = B))+geom_point()+theme_bw()+
			scale_color_discrete("This is \n a legend")+scale_shape_discrete("Shape legend")

alt -halfwidth

Adding lines

What happens if we try to connect the previous points?

ggplot(data = dataset, aes(x=C,y=D,color = A))+geom_line()+theme_bw()+
			scale_color_discrete("This is \n a legend")

alt -halfwidth

Let’s make it a bit more transparent so that we can see better:

ggplot(data = dataset, aes(x=C,y=D,color = A))+geom_line(alpha = I(0.5))+theme_bw()+
			scale_color_discrete("This is \n a legend")

alt -halfwidth

The I function tells ggplot that 0.5 is a numeric value of transparency, and not a value to group different lines. You can now see that each color has its own line! This is because ggplot has grouped the points by values of the color, shape, etc. It is useful to remember this when you want to have a single line with multiple colors.

Adding a trend

Now we know how to make a line, but we’d like a trend for all points and one trend for each color. We’ll try using geom_smooth:

ggplot(data = dataset, aes(x=C,y=D,color = A))+geom_smooth(alpha = I(0.3))+geom_point()+theme_bw()+
  scale_color_discrete("This is \n a legend")

alt -halfwidth

Not very useful to have a trend for each class, is it? Remember, we have grouped everything because of the color in ggplot, so if I remove it, and only use it as an argument in geom_point what happens?

ggplot(data = dataset, aes(x=C,y=D))+geom_smooth(alpha = I(0.3))+geom_point(aes(color = A))+theme_bw()+
  scale_color_discrete("This is \n a legend")

alt -halfwidth

The trend has been calculated on the whole dataset, and we still have our points shown with different colors!

Showing density

We have shown the trend of points, but sometimes it can be more useful to show a density of points, either one or two-dimensional. Here we will show for 2D density, but the same works with geom_density. We also increase the transparency of points in our plot:

ggplot(data = dataset, aes(x=C,y=D))+geom_density_2d()+geom_point(aes(color = A),alpha = I(0.3))+theme_bw()+
  scale_color_discrete("This is \n a legend")

alt -halfwidth

You now know a lot in making plots with ggplot2. One last useful thing is to apply what we have seen with histograms.

Making an histogram

An histogram shows the number of values that fall into a bin, for example in the sequence {0,1,5,7,2,10}, 3 values fall in [0;2].

We are going to make an histogram of our data with 50 bins:

ggplot(data = dataset, aes(x=C))+geom_histogram(bins = 50)+theme_bw()

alt -halfwidth

Editing the histogram

How do we add color to this plot? With with the color argument again?

ggplot(data = dataset, aes(x=C,color = B))+geom_histogram(bins = 50)+theme_bw()

alt -halfwidth

As you can see, only the edge of the bars have been colored. This is because areas have a fill argument.

ggplot(data = dataset, aes(x=C,fill = B))+geom_histogram(bins = 50)+theme_bw()+
  scale_fill_discrete("This is \n a fill ")

alt -halfwidth

And now we will add a density, just to remember what a gaussian distribution looks like!

ggplot(data = dataset, aes(x=C,fill = B))+geom_histogram(bins = 50)+theme_bw()+
  scale_fill_discrete("This is \n a fill ")

alt -halfwidth

The smoothing here is a mixture of gaussians, which explains the multiple peaks, but increasing the number of observation would make all distributions converge to the same gaussian distribution: we just sampled from the same distribution at the beginning.

Saving your file

You can save your file with ggsave now:

ggsave("gaussian.png")

or

my_ggplot<-ggplot(data = dataset, aes(x=C,fill = B))+geom_histogram(bins = 50)+theme_bw()+
  scale_fill_discrete("This is \n a fill ")
ggsave(filename = "gaussian.pdf,plot = my_ggplot)

Note that ggsave automatically saves the last plot unless you specify otherwise, and will produce an image with format depending on the name of your extension (“.png”, “.pdf”,”.jpg”)

What’s next?

This closes this introduction to ggplot2. Of course there are many more things to cover, such as scales, long vs short data formatting and transitions with reshape2, heatmaps and multiple plots (with gridExtra or facet), so I’ll make more posts on these.

Written on August 29, 2016