flowchart LR A{Data type} -->|Continuous| B{Purpose} A{Data type} -->|Discrete| C{Purpose} B{Purpose}-->|Exploration| D((Histogram/Boxpot)) B{Purpose} -->|Association| E((Scatter plot)) B{Purpose} -->|Association+time| T((Line plot)) C{Purpose}-->|Exploration| F((Bar chart)) C{Purpose} -->|Association| G((Tree map))
Introduction to ggplot2
Introduction to ggplot2
In this chapter, we will introduce ggplot2, a powerful plotting library in R for creating elegant and complex visualizations. We will guide you through the basics of ggplot2, including its grammar of graphics approach, and demonstrate how to create various types of plots.
What is ggplot2?
ggplot2 is part of the tidyverse collection of R packages developed by Hadley Wickham in his work (Wickham 2010). He received the COPSS Presidents’ Award for his contributions to the tidyverse collections. It is based on the grammar of graphics (Wilkinson 2011). As part of the tidyverse collection, it shares a similar framework that allows users to build plots layer by layer. This approach makes it highly flexible and intuitive once you understand its core concepts.
Basic Concepts of the Grammar of Graphics
Based on the data we collected in the first lecture:
- What kind of message can be expressed through visualization?
- What kind of graph will you use?
- Why will you choose this graph to express such an idea?
The concept of the grammar of graphics was first proposed in Wilkinson (2011) (I cite the second edition, but it was actually described in the first edition in 2005). It describes the seven basic elements of a statistical graph:
- Data:
- The information to visualize.
- Mapping:
- How data variables connect to aesthetic attributes.
- Displayed as x, y, color, shape, etc.
- Layer:
- Combines geometric elements (geoms: points, lines, polygons) and statistical transformations (stats: e.g., binning for histograms, fitting models).
- Represents what is visually displayed in the plot.
- Scales:
- Map data values to aesthetic values (e.g., color, size).
- Generate legends and axes for reading original data values.
- Coordinate System (Coord):
- Defines how data is mapped to the plot plane.
- Provides axes and gridlines (e.g., Cartesian, polar, map projections).
- Facet:
- Breaks data into subsets for small multiple plots (also called conditioning or trellising).
- Theme:
- Adjusts visual elements like fonts and colors.
- Default settings in ggplot2 are carefully chosen, but customization may require references like Tufte (1990, 1997, 2001).
Although this approach can identify individual elements of a statistical graph, it has several critiques:
- What graph should I use
- This framework does not work well in the programming language setting, and later Wickham (2010) implicitly modified these layers
- It does not describe an interactive graph
Choosing the visualization
Desciding which plot to use is sometime ambiguous from the user. The following question and decision flow chart is helpful for you to sort out which kind of graph you need to use ( at least within the scope of this course)
- What is the purpose of displaying graph?
- What are the types of data you gonna present?
Getting Started
To use ggplot2, you first need to install and load the package in R:
Basic Plot Example
Data and mapping layer
Let’s create a simple scatter plot using the mtcars
dataset, which is built into R:
From the code, you can figure out that, data and mapping was encode in the first line of ggplot function
\[ \text{ggplot}(\text{data=}\underbrace{\text{mtcars2}}_{\text{data}}, \text{mapping=}\underbrace{\text{aes(x = mpg, y = cyl, colour=am)}}_{\text{mapping}}) \]
Occasionally (actually, very frequently), you will see people ignoring everything on the left-hand side (LHS) of the equal sign for data
, mapping
, x
, and y
as they are standard arguments for ggplot2.
There are also some other mapping options other than color
- Size
- Line
- linetype
- lineend
- linejoin
- Dot
- Shape
Layers
This series of function are named in the format geom_XXXX
Function name | |
---|---|
Histogram | geom_hist() |
Box chart | geom_boxplot() |
Bar chart | geom_bar() |
Scatter chart | geom_scatter() |
Line chart | geom_line() |
One variable
Two variables
Scales
Scales are functions (processes) that transform data for the graph. This process is trivial and is done by observing the type of layer and the data, so the only thing that people frequently need to use is to modify the axis/legend display. The series of functions are named in the following format: scale_(AES)_(datatype)
.
Discrete | Continuous | |
---|---|---|
X | scale_x_discrete() |
scale_x_continuous() |
Y | scale_y_discrete() |
scale_y_continuous() |
color | scale_color_discrete() |
scale_color_discrete() |
The argument and its corresponding components are listed in the below table and figure.
Argument name | Axis | Legend |
---|---|---|
name | Label | Title |
breaks | Ticks | Key |
labels | Tick label | Key Label |
Coordinates and facet
Coordinates refer to the coordinate system on the graph. They can help you adjust your plot. In most of the data we will encounter, you sometime need to adjust the range of x and y axis. This can bedome through arguments xlim=c(LOWER_BOUND,UPPER_BOUND)
for x axis and ylim=c(LOWER_BOUND,UPPER_BOUND)
for y-axis.
Facets facilitate breaking data into several subgraphs, separated by different subgroups. You can specify by the notation 1. one-factor scenario ~ FACTOR_A
which generate subplots oer different level of Factor_A
( and you additional specify nrow=x
or ncol=x
to let it spread over columns or rows). For two-factor scenario Factor_A ~ Factor_B
which spreads Factor_B
over rows and Factor_A
over columns.
From time to time you might want each subplots share (or not share) same scale (axis). You can specify throught the argumnt scales=XXX
in facet_wrap()
functions. The options of scales
is listed in below.
free x axis | fixed x axis | |
---|---|---|
free y axis | free |
free_y |
fixed y axis | free_x |
fixed |
Theme
There are several available options for the theme of your plot. Meanwhile, there are also third-party packages designing different themes for plots, such as ggtheme
.
Reading R documentation
It is by no means possibile to introduce all functions and attribute in this library, even though ggplot2
is a relatively stable library. The key to survive in the coding world is to understand the mechanism and concept of the code. And rest of them you can checkout the official documentation from the library authors
Wrap-up
As a wrap-up, your code is usually in the following format:
\[ \small \begin{aligned} \text{ggplot()}+ &\\ \underbrace{\text{geom\_XXXX(data=DATA,mapping=aes(x,y,color,...))}}_{\text{plotting data}} + &\\ \underbrace{\text{scale\_AES\_TYPE(name="TITLE",breaks="TICK LOC",labels="TICK LAB")}}_{\text{Handeling axis, legend etc}} + & \\ \underbrace{ \text{coord\_cartesian(xlim=c(min,MAX), ylim=c(min,MAX))}}_{\text{Adjust coordinate systems}} + &\\ \underbrace{\text{ggtitle("CHART TITLE")}}_{\text{Plotting title}} \end{aligned} \]
From the code introduced, it will be great to reflect on
- How does the code construction differ from your human process of plotting code?
- How does the 7 layer graphic language differ from the ggplot syntax?
- If you could make
ggplot
easier to use, how would you design it?
Final remarks and acknowledgement
The materials and contents are mostly adapted from Wickham (2016a). You can get the latest edition from the book website, which also covers more advanced topics. For details on how to use the code and each function, you can find the documentation of the ggplot2
library through ??ggplot2
.