Two Wrongs

Blogging With R and ggplot2 in Org

Blogging With R and ggplot2 in Org

I want to include more graphics in my writing. I’m not very good at it, so I will use this place to practise. One type of graphic – that is very efficient at communicating things – is the data plot.

Plots Matter

I used to read The Economist regularly, and they are very good at plotting data. When I wanted to share information from their articles, it was often sufficient to just share a single key plot. That is how powerful plots are. Edward Tufte also hammers this concept in his fantastic book11 Edward Tufte; The Visual Display About Quantitative Information; Graphics Press; 1986. about presenting data visually. A well designed plot can mean the difference between something important going unnoticed or being the focus of an article.

At some point, I would like to learn more about making statistical calculations with computers. I also want nice plots. It appears the R language fulfills both those criteria, so I’ll start by using it to draw plots.

Setup: Org with R and ggplot2

The setup requires two steps. The first is installing R and ggplot2, which is easy enough in Debian:

$ sudo apt-get install r-base r-cran-ggplot2

If we want to test whether the installation worked22 Do people not do this anymore? It was surprisingly hard to find how to just start R., we should be able to start R on the terminal by running the R command33 Yes, that’s an upper-case R – a relief for those of us who have an alias r='fc -e -' to reapeat the last used command. and then enter a string like "hello" and seeing it being echoed back.

We then want Org to be allowed to execute R code. We can specify that by setting the org-babel-load-languages variable to include the languages we want executed.

(org-babel-do-load-languages 'org-babel-load-languages
    '((emacs-lisp . t) (R . t)))

We also want an Emacs mode for editing R code, since it makes the process of creating graphs much easier. Ess is the Emacs Speaks Statistics collection of extensions that give support for, among other things, R.

(use-package ess :init (require 'ess-site))

And, finally, we may want to set the following variables. They reduce the amount of question-asking Org and ess does when publishing44 This is a requirement for me, because I publish posts in a non-interactive git post-receive hook..

(setq org-confirm-babel-evaluate nil)
(setq ess-directory "/tmp")
(setq ess-ask-for-ess-directory nil)

Vector Graphics From R to Org

Since baby steps is a good idea in general, that’s what we’ll start with. We want our Org document to include55 Okay, in this guide I’ll have to embed Org code in an Org document – this is going to be tricky. an R source block that can generate the plot we want. We add some special parameters for publishing, which are explained now.

#+HEADER: :file myplot.svg
#+HEADER: :R-dev-args bg="transparent"
#+BEGIN_SRC R :exports results :session :results graphics
  # R code to draw plot
#+END_SRC

The svg export is the tricky bit, because information is scarce. To get proper SVG exports from R in Org, we need a #+BEGIN_SRC R block, of course.66 And it’s worth mentioning for your debugging ability that this stuff is handled by the Org Babel extension. The parameters we need are as follows.

Obviously, :file myplot.svg specifies a filename for the graphics. If it ends in .svg, we get svg graphics. We want svg graphics, because as with anything else on the web these days, they scale to arbitrary pixel densities77 Hey, this is actually somewhat of a realisation. Vector graphics are no longer about scaling to arbitrary sizes, they are about scaling to arbitrary pixel densities. Neat.. This file needs to be exported along with the html.

We tell Org to replace the source code block with the results of its evaluation using the :exports results arguments. In our case, that will be the svg file embedded. Then :results graphics makes sure everyone is on the same page with regards to R being used to generate graphics, not to print text.

For the longest time, I could not get a transparent svg file. For some reason R just wanted to put a huge <rect> with a white fill colour at the base of the image, regardless of what my code said. Eventually, I figured out that I needed the :R-dev-args bg="transparent" argument to preserve image transparency.

Multiple online manuals are very assertive about the :session argument being required, so I’ve included it. I think it also implies that variables will be shared across R code blocks in the document, which may or may not be what you want.

Editing R code embedded in Org

You can, of course, edit the code block straight in Org as any text. However, there is a better way. By pressing C-c ' with the cursor positioned inside the source block, Org will open up a new Emacs window with only the R source code in it. This new window is synchronised with the Org file, meaning that if you edit and save in the R window, the Org file will also be updated and saved.

If you want to test run the code in the window, press C-c C-b. To exit the new window, press C-c ' again.

R and the Grammar of Graphics Paradigm

This is only a whirlwind tour. There is so much more to learn.

The ggplot2 api is based on something called the Grammar of Graphics, which is a standardised way to talk about plots. This grammar (and, consequently, ggplot2), has three key concepts we need to introduce right away.

The first concept is probably the most intuitive. It’s called data, and it is the information we want to plot, plain and simple. Each data point consists of two or more variables88 So, for example, a data point can be (1) the amount of money I have (2) at a particular point in time. Or it can be (1) the amount of money (2) a person in a group has. Or maybe (1) the gas mileage of (2) a car model at (3) a particular point in time for (4) a particular air density. You can cram a lot of variables into a data point and plotting them all sensibly can be a challenge., and some of these variables will be represented in the plot.

How the data appears in the plot is based on a mapping from data variables to either spacial dimensions99 say, money on the Y axis and time on the X axis or some other dimension that is intuitively understood visually1010 like thickness of lines, darkness of shading or colours of a heatmap; these all give a quick sense of how large the value they represent is. The mapping between data variables and plot dimensions is called an aesthetic.

Finally, the physical shape used to represent the data points, so for example lines, points, violins1111 That sounds more funny than it is. There’s actually a thing called a violin plot. and so on, is called a geometry. The same data points can be plotted using many geometries, and all will be rendered on the same plot.

To begin with, we can use this R code, representing two weeks worth of sleep and mood tracking data.

library(ggplot2)

simple_data = data.frame(
  sleep     = c(8, 7, 5, 6, 7, 8, 9, 7, 2, 8, 5, 5, 5, 8),
  happiness = c(4, 2, 2, 1, 5, 4, 3, 2, 1, 5, 4, 4, 4, 5)
)

ggplot(simple_data, aes(x=sleep, y=happiness)) +
    geom_point()

The top vector represents how many hours of sleep someone has had the night before. The bottom vector is a happiness rating from one to five. We pack it into a dataframe and then we start plotting.

We configure an aesthetic which maps the amount of sleep to the X axis and the happiness rating to the Y axis. We don’t specify any colours or other such dimensions, because our data points consists of only two variables.1212 Generally, do not plot data using more dimensions than the data has to begin with. Do not plot two-dimensional data with three dimensions, for example. The opposite is okay. We can reduce the complexity of data by plotting only some of its dimensions. After the ggplot(data, aes(x=sleep, y=happiness)) call, we have data, and we have an aesthetic that maps the data to plot coordinates. But we still don’t have any geometrical shape to represent the data entries. If we showed the plot at that point, it would essentially be an empty grid.

So we add a geometry to our data with the + operator: we tell ggplot2 that, “After you have drawn the empty plot with grid lines and stuff, please add a point geometry for each entry in the data.”

With that, we get

Sorry, your browser does not support SVG.

which is probably the most basic plot you’ll create in ggplot2.

Styling the Plot

The previous plot uses the default ggplot2 style, which is decent but I want something slightly more modern and optimised for the web. A good start is to apply the classic theme that ships with ggplot2. This removes a lot of visual noise.

ggplot(simple_data, aes(x=sleep, y=happiness)) +
    geom_point() +
    theme_classic()

Sorry, your browser does not support SVG.

We may also want to remove the tickmarks on the axes, as well as the solid white background.

ggplot(simple_data, aes(x=sleep, y=happiness)) +
    geom_point() +
    theme_classic() +
    theme(axis.ticks=element_blank(),
          panel.background=element_blank(),
          plot.background=element_blank())

Sorry, your browser does not support SVG.

Making the Plot Easier to Read

The distinction between this section and the previous one is a bit arbitrary, because good style makes the plot easy to read, and something that makes the plot easy to read is good style.

There are several issues with the X axis at the moment. Just off the top of my head: the label does not convey a lot of information, the scale is cut off on the left hand side, and doesn’t extend all the way to the origin. All configuration of the X scale is done in the example below. The ticks are placed in the values indicated by breaks, which is set to the vector c(0,2,4,…,8,10).

While we’re at it, we also configure the Y axis to extend all the way down to zero.

ggplot(simple_data, aes(x=sleep, y=happiness)) +
    geom_point() +
    scale_x_continuous("Sleep (hours)",
                       limit=c(0, 10),
                       breaks=seq(0, 10, 2)) +
    scale_y_continuous("Happiness (subjective rating)",
                       limit=c(0, 5),
                       breaks=0:5) +
    theme_classic() +
    theme(axis.ticks=element_blank(),
          panel.background=element_blank(),
          plot.background=element_blank())

Sorry, your browser does not support SVG.

Actually, the Y axis label is a bit hard to read. We can move that to the title of the plot instead, making it horizontal without robbing too much space.

ggplot(simple_data, aes(x=sleep, y=happiness)) +
    geom_point() +
    scale_x_continuous("Sleep (hours)",
                       limit=c(0, 10),
                       breaks=seq(0, 10, 2)) +
    scale_y_continuous("",
                       limit=c(0, 5),
                       breaks=0:5) +
    labs(title="Happiness (subjective rating 1–5) as a function of sleep") +
    theme_classic() +
    theme(axis.ticks=element_blank(),
          panel.background=element_blank(),
          plot.background=element_blank())

Sorry, your browser does not support SVG.

The graphic is square at the moment, which is often not ideal. Tufte talks about having a width about 1.2–1.8 times the height, but I’m going to pick something slightly more extreme for this example, primarily because my X axis is covering a larger range than my Y axis. Note that we cannot set the height inside the R code itself, so this is something we set in the Org #+HEADER:. I’m now running with

#+HEADER: :R-dev-args bg="transparent" :width 7 :height 3.5
ggplot(simple_data, aes(x=sleep, y=happiness)) +
    geom_point() +
    scale_x_continuous("Sleep (hours)",
                       limit=c(0, 10),
                       breaks=seq(0, 10, 2)) +
    scale_y_continuous("",
                       limit=c(0, 5),
                       breaks=0:5) +
    labs(title="Happiness (subjective rating 1–5) as a function of sleep") +
    theme_classic() +
    theme(axis.ticks=element_blank(),
          panel.background=element_blank(),
          plot.background=element_blank())

Sorry, your browser does not support SVG.

Adding Annotations

If we’re courious about the potential relationship between sleep and happiness, it could be interesting to overlay a linear regression.

Sorry, your browser does not support SVG.

It’s also useful to be able to annotate the plot – but beware that this is not a place where ggplot2 shines. If you’re doing serious publishing, you probably want to annotate the plot in something that’s better for it, like Inkscape. But for simple notes, go ahead and embed it in the code!

ggplot(simple_data, aes(x=sleep, y=happiness)) +
    geom_point() +
    geom_smooth(method="lm", formula=y~x, fullrange=TRUE) +
    scale_x_continuous("Sleep (hours)",
                       limit=c(0, 10),
                       breaks=seq(0, 10, 2)) +
    scale_y_continuous("",
                       limit=c(0,5),
                       breaks=0:5) +
    labs(title="Happiness (subjective rating 1–5) as a function of sleep") +
    theme_classic() +
    theme(axis.ticks=element_blank(),
          panel.background=element_blank(),
          plot.background=element_blank()) +
    annotate("segment", x=7, xend=7.5, y=2, yend=1.5, colour="grey50") +
    annotate("text", label="Good sleep, bad day",
             x=7.2, y=1.3, hjust=0, colour="grey50")

Sorry, your browser does not support SVG.

The End

I’m going to stop here, because I don’t think I have many more useful things to say in what little space remains on the page. I could write a lot more about R, and I could write a lot more about plotting well, but those things are better suited for a different article. I hope this is enough to get you started publishing with R, because it’s a great tool to have in your arsenal!