Visualization makes data decipherable. Have you ever tried to study a table of raw data? You can examine values one at a time, but you cannot attend to many values at once. The data overloads your attention span, which makes it hard to spot patterns in the data. See this for yourself; can you spot the striking relationship between $X$ and $Y$ in the table below?
Raw data is difficult to comprehend, but visualized data is easy to understand. Once you plot your data, you can see the relationships between data points---instantly. For example, the graph below shows the same data as above. Here, the relationship between the points is obvious.
This chapter will teach you how to visualize your data with R and the `ggplot2` package. R contains several systems for making graphs, but the `ggplot2` system is one of the most beautiful and most versatile. `ggplot2` implements the *grammar of graphics*, a coherent system for describing and building graphs. With `ggplot2`, you can do more faster by learning one system and applying it in many places.
*Section 1* will get you started making graphs right away. You'll learn how to make several common types of plots, and you will explore `ggplot2`'s syntax.
*Section 2* will teach you the _grammar of graphics_, a versatile system for building plots. You'll learn to assemble any plot you like with _layers_, _geoms_, _stats_, _aesthetic mappings_, _position adjustments_, and _coordinate systems_.
You probably have an intuitive answer to this question. Now try to make your answer precise: What does the relationship between engine size and fuel efficieny look like? Is it positive? Negative? Linear? Nonlinear? Strong? Weak?
You can test your hypothesis with the `mpg` data set in the `ggplot2` package. The data set contains observations collected by the EPA on 38 models of car. Among the variables in `mpg` are `displ`, a car's engine size in litres, and `hwy`, a car's fuel efficiency on the highway in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.
*Tip*: If you have trouble loading `mpg`, its help page, or any of the functions in this chapter, you may need to load the `ggplot2` package with the command
The graph shows a negative relationship between engine size (`displ`) and fuel efficiency (`hwy`). In other words, cars with big engines use more fuel. But the graph shows us something else as well.
One group of points seems to fall outside of the linear trend. These cars have a higher mileage than you might expect. Can you tell why? Before we examine these cars, let's review the code that made our graph.
With `ggplot2`, you begin a plot with the function `ggplot()`. `ggplot()` doesn't create a plot by itself; instead it initializes a new plot that you can add layers to.
You complete your graph by adding one or more layers to `ggplot()`. Here, the function `geom_point()` adds a layer of points to the plot, which creates a scatterplot. `ggplot2` comes with other `geom_` functions that you can use as well. Each function creates a different type of layer, and each function takes a mapping argument.
The mapping argument explains where your points should go. You must set mapping to a call to `aes()`. The `x` and `y` arguments of `aes()` explain which variables to map to the x and y axes of the graph. `ggplot()` will look for those variables in your data set, `mpg`.
You can use this code as a template to make many graphs with `ggplot2`. To make a graph, replace the bracketed sections in the code below with a new data set, a new geom function, or a new set of mappings.
The next few subsections will introduce several arguments (and functions) that you can add to the template. Each argument will come with a new set of options---and likely a new set of questions. Hold those questions for now. We will catalogue your options in Section 2. Use this section to become familiar with the `ggplot2` syntax. Once you do, the low level details of `ggplot2` will be easier to understand.
Let's hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the `class` value for each car. The `class` variable of the `mpg` data set classifies cars into groups such as compact, midsize, and suv. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and suvs became popular).
An aesthetic is a visual property of the points in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing its aesthetic properties.
You can convey information about your data by mapping the aesthetics in your plot to the variables in your data set. For example, we can map the colors of our points to the `class` variable. Then the color of each point will reveal its class affiliation.
`ggplot2` will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable. `ggplot2` will also add a legend that explains which levels correspond to which values.
The colors reveal that many of the unusual points are two seater cars. These cars don't seem like hybrids. In fact, they seem like sports cars---and that's what they are. Sports cars have large engines like suvs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have such large engines.
In the above example, we mapped `class` to the color aesthetic, but we could have mapped `class` to the size aesthetic in the same way. In this case, the exact size of each point reveals its class affiliation.
Or we could have mapped `class` to the _alpha_ aesthetic, which controls the transparency of the points. Now the transparency of each point corresponds with its class affiliation.
In each case, you set the name of the aesthetic to the variable to display, and you do this within the `aes()` function. The syntax highlights a useful insight because you also set `x` and `y` to variables within `aes()`. The insight is that the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.
Once you set an aesthetic, `ggplot2` takes care of the rest. It selects a pleasing set of levels to use for the aesthetic, and it constructs a legend that explains the mapping. For x and y aesthetics, `ggplot2` does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts in the same way as a legend; it explains the mapping between locations and values.
See the help page for `geom_point()` (`?geom_point`) to learn which aesthetics are available to use in a scatterplot. See the help page for the `mpg` data set (`?mpg`) to learn which variables are in the data set.
Did you notice that there is another riddle hidden in our scatterplot? The plot displays 126 points, but there are 234 observations in the `mpg` data set. Also, the points appear to fall on a grid. Why should this be?
The points appear in a grid because the `hwy` and `displ` measurements in `mpg` are rounded to the nearest integer and tenths values. This also explains why our graph appears to contain 126 points. Many points overlap each other because they have been rounded to the same values of `hwy` and `displ`. 108 points are hidden on top of other points located at the same value.
You can avoid this overplotting problem by setting the position argument of `geom_point()` to "jitter". `position = "jitter"` adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.
But isn't random noise, you know, bad? It *is* true that jittering your data will make it less accurate at the local level, but jittering may make your data _more_ accurate at the global level. Occasionally, jittering will reveal a pattern that was hidden within the grid.
The chart above displays the total number of diamonds in the `diamonds` data set, grouped by `cut`. The `diamonds` data set comes in `ggplot2` and contains information about 53940 diamonds, including the `price`, `carat`, `color`, `clarity`, and `cut` of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.
A bar has different visual properties than a point, which can create some surprises. For example, how would you create this simple chart? If you have an R session open, give it a try.
```{r echo=FALSE}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
```
It may be tempting to call the color aesthetic, but for bars the color aesthetic controls the _outline_ of the bar, e.g.
Bar charts also use different position adjustments than scatterplots. Every geom function in `ggplot2` accepts a position argument, but it wouldn't make sense set `position = "jitter"` for a bar chart. However, you could set `position = "dodge"` to create an unstacked bar chart.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
On the x axis it displays `cut`, a variable in the `diamonds` data set. On the y axis, it displays count. But count is not a variable in the diamonds data set:
```{r}
head(diamonds)
```
Nor did we tell `ggplot2` in our code where to find count values.
Some graphs, like scatterplots, plot the raw values of your data set. Other graphs, like bar charts, do not plot raw values at all. These graphs apply an algorithm to your data and then plot the results of the algorithm. Consider how often graphs do this.
`ggplot2` calls the algorithm that a graph uses to transform raw data a _stat_, which is short for statistical transformation. Each geom in `ggplot2` is associated with a stat that it uses to plot your data. `geom_bar()` uses the "bin" stat, which bins raw data and computes bin counts. `geom_point()` uses the "identity" stat, which applies the identity transformation, i.e. no transformation.
You can change the stat that your geom uses. For example, you can ask `geom_bar()` to use the "identity" stat. This is a useful way to plot data that already lists the heights for each bar, like the data set below.
To plot your data in polar coordinates, add `coord_polar()` to your plot call. Polar bar charts will look better if you also set the width parameter of `geom_bar()` to 1.
You can add `coord_polar()` to any plot in `ggplot2` to draw the plot in polar coordinates. `ggplot2` will map the y variable to $r$ and your x variable to $\theta$.
Coxcomb plots make a useful glyph that you can use to compare subgroups of data. _Facetting_ provides a quick way to do this.
You can create a separate polar chart for each level of a third variable by _facetting_ your plot. For example, you can create a separate subplot for each level of the `clarity` variable.
Here, the first subplot displays the group of points that have the `clarity` value `I1`. The second subplot displays the group of points that have the `clarity` value `SI2`. And so on.
To facet your plot on a single discrete variable, add `facet_wrap()` to your plot call. The first argument of `facet_wrap()` is a formula, always a `~` followed by a variable name.
To facet your plot on the combinations of two variables, add `facet_grid()` to your plot call. The first argument of `facet_grid()` is a formula, always two variable names separated by a `~`.
Here the first subplot displays all of the points that have an `I1` code for `clarity` _and_ a `D` code for `color`. Don't be confused; `color` is a variable name in the `diamonds` data set; `facet_grid(color ~ clarity)` is not invoking the color aesthetic.
Facetting works on more than just polar charts. You can add `facet_wrap()` or `facet_grid()` to any plot in `ggplot2`. For example, you could facet our original scatterplot.
In this section, you learned more than how to make scatterplots, bar charts, and coxcomb plots; you learned a foundation that you can use to make _any_ type of plot with `ggplot2`.
To see this, let's add position adjustments, stats, coordinate systems, and facetting to our code template. In `ggplot2`, each of these parameters will work with every plot and every geom.
The template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters because `ggplot2` will provide useful defaults for everything except the data, mappings, and geom function.
The seven parameters in the template are connected by a powerful idea known as the _Grammar of Graphics_, a system for describing plots. The grammar shows that you can uniquely describe _any_ plot as a combination of---you guessed it: a data set, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a facetting scheme.
In other words, you can use the template above to make any graph that you can imagine---at least in theory. Section 2 will examine how this works in practice. The section explains how the grammar of graphics works and how `ggplot2` implements the grammar to build real graphs. It also catalogues all of the options that `ggplot2` puts at your fingertips for geoms, mappings, stats, position adjustments, and coordinate systems.
The "gg" of `ggplot2` stands for the grammar of graphics, a system for describing plots. According to the grammar, a plot is a combination of seven elements:
This may not be an obvious way to think about plots, so let's explore the formula above with a thought exercise. You can build any plot in the following manner.
To build the plot, you begin with a data set to visualize and a coordinate system to visualize it in. We'll visualize an abbreviated version of the `mpg` data set, and the cartesian coordiante system.
You then choose whether to visualize the data itself, or whether to summarize the data with a transformation and then visualize the summary. Let's visualize our data as it is. This would be the same as applying an identity transformation to the data, since an identity transformation returns the data as it is.
Next, you need to choose some sort of visual object to represent the observations in your data set. This object will be what you actually draw in the coordinate system.
One pair of mappings is particularly important. To place your points into your coordinate system, you map the x location aesthetic to a variable. Here `displ`.
You can reuse this process to make any graph. If you change any of the elements involved, you will end up with a new graph. For example, we can change our geom to a line to make a line graph, or to a bar to make a bar chart. Or we can change the position to "jitter" to make a jittered plot.
A layer is a collection of a data set, a stat, a geom, and a position adjustment. You can add a layer to a coordinate system and facetting scheme to make a complete graph, or you can add a layer to an existing graph to make a layered graph.
Let's build a layer that uses the same data set as our previous graph. In this layer, we will apply a "smooth" stat to the data. The stat fits a model to the data and then returns a transformed data set with three new columns:
In this layer, we will represent the observations with a line geom. We map the x values of the line to `displ` and we map the y values to our new `y` variable. We won't use a position adjustment.
For completion, let's add one more layer. This layer will begin with the same data set as the previous layer. It will also use the same stat. However, we will use the ribbon geom to visualize the data points. A ribbon is similar to a shaded region contained by two lines.
We map the top of the ribbon to `ymax` and the bottom of the ribbon to `ymin`. We map the x position of the ribbon to `displ`. We will not use a position adjustment.
The thought experiment shows that the elements of the grammar of graphics work together to build a graph. You can describe any graph with these elements, and each unique combination of elements makes a single, unique graph. You can also extend a graph by adding layers of new data, stats, geoms, mappings, and positions.
In other words, you can extend the grammar of graphics formula indefinitely to make layered plots:
`ggplot2` syntax matches this formulation almost exactly. The basic low level function of `ggplot2` is `layer()` which combines data, stats, geoms, mappings, and positions into a single layer to plot.
If you have time on your hands, you can use `layer()` to create a multi-level plot like the one above. Initialize your plot with `ggplot()`. Then add as many calls to `layer()` as you like. Give each layer its own `data`, `stat`, `geom`, `mapping`, and `position` arguments.
For example, you will find in practice that you always pair the same geoms with the same stats and position adjustments. You'll almost always use the point geom with the "identity" stat and the "identity" position. You'll almost always use the bar geom with the "bin" stat and the "stack" position.
The `geom_` functions in `ggplot2` take advantage of these common combinations. Like `layer()`, each geom function builds a layer, but the geom functions preset the geom, stat, and position values of the layer to useful defaults. The geom becomes the geom that appears in the function name. The stat and position become the stat and postion most commonly asscoiated with the geom.
`ggplot2` even provides geom functions for less common, but still useful combinations of geoms, stats, and positions. For example, the function `geom_jitter()` builds a layer that has a point geom, an "identity" stat and a "jitter" position. The function `geom_smooth()` builds two layers: a ribbon layer that is combined with a line layer as in the plot above. Together these layers display a model line with its standard error band.
This system lets you build sophisticated graphs geom by geom, but it also makes it possible to write repetitive code. For example, the code above repeats the arguments `data = mpg, mapping = aes(x = displ, y = hwy)`. Repetition makes your code harder to read and write, and it also increases the chance of errors and typos.
You can avoid repetition by passing `ggplot()` a set of global mappings to apply to each layer. For example, we can eliminate the duplication of `mapping = aes(x = displ, y = hwy)` in our previous code with a global mapping argument:
* If a local aesthetic mapping conflicts with a global aesthetic mapping, `ggplot2` will use the local mapping. This is arbitrated on an aesthetic by aesthetic basis.
For example, the smooth line above is a single line with a single color. This does not occur if you add the color aesthetic to the global mappings. Smooth will draw a different colored line for each class of cars.
Have you experimented with aesthetics? Great! Here are some things that you may have noticed.
#### Continuous data
A continuous variable can contain an infinite number of values that can be put in order, like numbers or date-times. If your variable is continuous, `ggplot2` will treat it in a special way. `ggplot2` will
* use a gradient of colors from blue to black for the color aesthetic
* display a colorbar in the legend for the color aesthetic
* not use the shape aesthetic
`ggplot2` will not use the shape aesthetic to display continuous information because the human eye cannot easily interpolate between shapes. Can you tell whether a shape is three-quarters of the way between a triangle and a circle? How about five-eights of the way?
`ggplot2` will treat your variable as continuous if it is a numeric, integer, or a recognizable date-time structure (but not a factor, see `?factor`).
#### Discrete data
A discrete variable can only contain a finite (or countably infinite) set of values. Character strings and boolean values are examples of discrete data. `ggplot2` will treat your variable as discrete if it is not a numeric, integer, or recognizable date-time structure.
If your data is discrete, `ggplot2` will:
* use a set of colors that span the hues of the rainbow. The exact colors will depend on how many hues appear in your graph. `ggplot2` selects the colors in a way that ensures that one color does not visually dominate the others.
* use equally spaced values of size and alpha
* display up to six shapes for the shape aesthetic.
If your data requires more than six unique shapes, `ggplot2` will print a warning message and only display the first six shapes. You may have noticed this in the graph above (and below), `ggplot2` did not display the suv values, which were the seventh unique class.
Here, `ggplot2` treats `color = "blue"` as a mapping because it appears in the mapping argument. `ggplot2` assumes that "blue" is a value in the data space. It uses R's recycling rules to pair the single value "blue" with each row of data in `mpg`. Then `ggplot2` creates a mapping from the value "blue" in the data space to the pinkish color that we see in the visual space. `ggplot2` even creates a legend to let you know that the color pink represents the value "blue." The choice of pink is a coincidence; `ggplot2` defaults to pink whenever a single discrete value is mapped to the color aesthetic.
* define an aesthetic _within_ the `aes()` function to map levels of the aesthetic to values of data. You would expect a legend after this operation.
* define an aesthetic _outside of_ the `aes()` function to manually set the aesthetic to a specific level. You would not expect a legend after this operation.
* define an aesthetic _within_ the `aes()` function to map levels of the aesthetic to values of data. You would expect a legend after this operation.
* define an aesthetic _outside of_ the `aes()` function to manually set the aesthetic to a specific level. You would not expect a legend after this operation.
By default, `geom_smooth()` draws a single smoothed line for the entire data set. To draw a separate line for each group of points, set the group aesthetic to a grouping variable or expression.
`ggplot2` will automatically infer a group aesthetic when you map an aesthetic of a monolithic geom to a discrete variable. Below `ggplot2` infers a group aesthetic from the `linetype = drv` aesthetic. It is useful to combine group aesthetics with secondary aesthetics because `ggplot2` cannot build a legend for a group aesthetic.
You can add new data to your scatterplot with aesthetics and facets, but how can you summarize the data that is already there, for example with a trend line?
But the plots are not identical. Each uses a different _geom_, or geometrical object, to represent the data. The first plot uses a set of points to represent the data. The second plot uses a single, smoothed line.
`ggplot2` comes with 37 `geom_` functions that you can use to to visualize your data. Each function will represent the data with a different type of geom, like a bar, a line, a boxplot, a histogram, etc. You select the type of plot you wish to make by calling the geom_ function that draws the geom you have in mind.
Each `geom_` function takes a `mapping` argument. However, the aesthetics that you pass to the argument will change from geom to geom. For example, you can set the shape of points, but it would not make sense to set the shape of a line.
To see which aesthetics your geom uses, visit its help page. To see a list of all available geoms, open the `ggplot2` package help page with `help(package = ggplot2)`.
You can map the heights of bars in a bar chart to data values---not counts---by changing the stat of the bar chart. This works best if your data set contains one observation per bar, e.g.
To find a list of available stats, run `help(package = "ggplot2")`. Each stat is listed as a function that begins with `stat_`. Set a geom's stat argument to the part of the function name that follows the underscore, surrounded in quotes, as above.
Many stats use _parameters_ arguments that fine tune the statistical transformation. For example, the bin stat takes the parameter `width`, which controls the width of the bars in a bar chart.
Many stats in `ggplot2` create more data than they display. For example, the `?stat_bin` help page explains that the `stat_bin()` transformation creates four new variables: `count`, `density`, `ncount`, and `ndensity`. `geom_bar()` uses only one of these variables. It maps the `count` variable to the y axis of your plot.
You can use any of the variables created by a stat in an aesthetic mapping. To use a variable created by a stat, surround its name with a pair of dots, `..`.
However, the position of the bars within the two charts is different. In the stacked bar chart, `ggplot2` stacked the bars on top of each other if they had the same cut. In the second plot, `ggplot2` placed the bars beside each other if they had the same cut.
You can control this behavior by adding a _position adjustment_ to your call. A position adjustment tells `ggplot2` what to do when two or more objects overlap.
To set a position adjustment, set the `position` argument of your geom function to one of `"identity"`, `"stack"`, `"dodge"`, `"fill"`, or `"jitter"`.
#### Position = "identity"
For many geoms, the default position value is "identity". When `position = "identity"`, `ggplot2` will place each object exactly where it falls in the context of the graph.
This would make little sense for our bar chart. Each bar would start at `y = 0` and would appear directly above the `cut` value that it describes. Since there are seven bars for each value of `cut`, many bars would overlap. The plot will look suspiciously like a stacked bar chart, but the stacked heights will be inaccurate, as each bar actually extends to `y = 0`. Some bars would not appear at all because they would be completely overlapped by other bars.
geom_bar(mapping = aes(x = cut, fill = clarity), position = "identity") +
ggtitle('Position = "identity"')
```
#### Position = "stack"
To avoid confusion, `ggplot2` uses a default "stack" position adjustment for bar charts. When `position = "stack"` `ggplot2` places overlapping objects directly _above_ one another.
Here each bar begins exactly where the bar below it ends.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack") +
ggtitle('Position = "stack"')
```
#### Position = "dodge"
When `position = "dodge"`, `ggplot2` places overlapping objects directly _beside_ one another. This is how I created the graph at the start of the section.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge") +
When `position = "fill"`, `ggplot2` uses all of the available space to display overlapping objects. Within that space, `ggplot2` scales each object in proportion to the other objects. `position = "fill"` is the most unusual of the position adjustments, but it creates an easy way to compare relative frequencies across groups.
Why does the plot appear to display only 126 points? There are 234 observations in the data set. Also, why do the points appear to be arranged on a grid?
The points appear in a grid because the `hwy` and `displ` measurements were rounded to the nearest integer and tenths values. As a result, many points overlap each other because they've been rounded to the same values of `hwy` and `displ`. This also explains why our graph appears to contain only 126 points. 108 points are hidden on top of other points located at the same value.
This arrangement can cause problems because it makes it hard to see where the mass of the data is. Is there one special combination of `hwy` and `displ` that contains 109 values? Or are the data points more or less equally spread throughout the graph?
You can avoid this overplotting problem by setting the position adjustment to "jitter". `position = "jitter"` adds a small amount of random noise to each point, as we see above. This spreads the points out because no two points are likely to receive the same amount of random noise.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
```
But isn't this, you know, bad? It *is* true that jittering your data will make it less accurate at the local level, but jittering may make your data _more_ accurate at the global level. By jittering your data, you can see where the mass of your data falls on an overplotted grid. Occasionally, jittering will reveal a pattern that was hidden within the grid.
`ggplot2` recognizes `position = "jitter"` as shorthand for `position = position_jitter()`. This is true for the other values of position as well:
* `position = "identity"` is shorthand for `position = position_identity()`
* `position = "stack"` is shorthand for `position = position_stack()`
* `position = "dodge"` is shorthand for `position = position_dodge()`
* `position = "fill"` is shorthand for `position = position_fill()`
You can use the explanded syntax to specify details of the position process. You can also use the expanded syntax to open a help page for each position process (which you will need to do if you wish to learn more).
You can make your bar charts even more versatile by changing the coordinate system of your plot. For example, you could flip the x and y axes of your plot, or you could plot your bar chart on polar coordinates to make a coxcomb plot or a polar clock chart.
To change the coordinate system of your plot, add a `coordinate_` function to your plot call. `ggplot2` comes with seven coordinate functions that each implement a different coordinate system.
`coord_cartesian()` generates a cartesian coordinate system for your plot. `ggplot2` adds a call to `coord_cartesian()` to your plot by default, but you can also manually add this call. Why would you want to do this?
You can set the `xlim` and `ylim` arguments of `coord_cartesian()` to zoom in on a region of your plot. Set each argument to a vector of length 2. `ggplot2` will use the first value as the minimum value on the x or y axis. It will use the second value as the maximum value.
However, `xlim()` and `ylim()` do not provide a true zoom. Instead, they plot the subset of data that appears within the limits. This may change the appearance of elements that rely on unseen data points, such as a smooth line.
`coord_fixed()` also generates a cartesian coordinate system for your plot. However, you can used `coord_fixed()` to set the visual ratio between units on the x axis and units on the y axis. To do this, set the `ratio` argument to the desired ratio in length between y units and x units, e.g.
You can also use the `start` argument to control where in the plot your data starts, from 0 to 12 (o'clock), and the `direction` argument to control the orientation of the plot (1 for clockwise, -1 for anti-clockwise).
*Tip*: `ggplot2` does not come with a pie chart geom, but you can make a pie chart by plotting a stacked bar chart in polar coordinates. To do this, ensure that:
* your x axis only has one value, e.g. `x = factor(1)`
Add `coord_trans()` to plot your data on cartesian coordinates that have been transformed in some way. To use `coord_trans()`, set the `xtrans` and/or `ytrans` argument to the name of a function that you would like to apply to the x and/or y values.
Facets provide a second way to add a variables to a two dimensional graph. When you facet a graph, you divide your data into subgroups and then plot a separate graph, or _facet_, for each subgroup.
For example, we can divide our data set into four subgroups based on the `cyl` variable:
1. all of the cars that have four cylinder engines
2. all of the cars that have five cylinder engines (there are some)
3. all of the cars that have six cylinder engines, and
4. all of the cars that have eight cylinder engines
Or we could divide our data into three groups based on the `drv` variable:
The graphs below show what a faceted graph looks like. They also show how you can build a faceted graph with `facet_grid()`. I'm not going to tell you how `facet_grid()` works---well at least not yet. That would be too easy. Instead, I would like you to try to induce the syntax of `facet_grid()` from the code below. Consider:
Make an honest effort at answering these questions, and then read on past the graphs.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
```
Ready for the answers?
To facet your graph, add `facet_grid()` to your code. The first argument of `facet_grid()` is always a formula, two variable names separated by a `~`.
`facet_grid()` will use the first variable in the formula to split the graph into rows. Each row will contain data points that have the same value of the variable.
`facet_grid()` will use the second variable in the formula to split the graph into columns. Each column will contain data points that have the same value of the second variable.
This syntax mirrors the rows first, columns second convention of R.
If you prefer to facet your plot on only one dimension, add a `.` to your formula as a place holder. If you place a `.` before the `~`, `facet_grid()` will not facet on the rows dimension. If you place a `.` after the `~`, `facet_grid()` will not facet on the columns dimension.
Facets let you quickly compare subgroups by glancing down rows and across columns. Each facet will use the same limits on the x and y axes, but you can change this behavior across rows or columns by adding a scales argument. Set scales to one of
* `"free_y"` - to let y limits vary accross rows
* `"free_x"` - to let x limits vary accross columns
* `"free"` - to let both x and y limits vary
For example, the code below lets the limits of the x axes vary across columns.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl, scales = "free_x")
```
#### `facet_wrap()`
What if you want to facet on a variable that has too many values to display nicely?
For example, if we facet on `class`, `ggplot2` must display narrow subplots to fit each subplot into the same column. This makes it diffcult to compare x values with precision.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ class)
```
`facet_wrap()` provides a more pleasant way to facet a plot across many values. It wraps the subplots into a multi-row, roughly square result.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class)
```
The results of `facet_wrap()` can be easier to study than the results of `facet_grid()`. However, `facet_wrap()` can only facet by one variable at a time.