Pipes in R

I started using R long before RStudio were a thing. In fact, the reason I started to use Emacs was because of ESS (Emacs Speaks Statistics). Emacs was a so to say IDE for R statistics. If i hade started using RStudio I may not have started using Emacs. When I started using R Dplyr and Tidyverse was not either a thing. I have therefore not used Dplyr or any other of the Tidyverse packages. I have only just used R-base. I have to confess that I have thought it unnecessary to have packages that do the same thing that you can do with R-base. I couple of months ago i found this blog-post about working with data using Dplyr. I totally understand why using Dplyr. The codes become easier to read. I have also seen that many say that it is easier to learn R by starting with tidyverse. I am sceptical about that. But I may be biased since I use Emacs and emacs-lisp is far more complicated than R-base regarding the structure of the code. I will give som example from the blog-post. You can do the same since the data used is downloaded through a package in R.

First of all we need to load dplyr. If you have not installed dplyr, do that first:

install.packages(dplyr)
library(dplyr)

You also need to install and load the data used in the blog-post:

install.packages(gapminder)
library(gapminder)

We can now look at the data:

head(gapminder)
names(gapminder)

Table 1: head()
	country	continent	year	lifeExp	pop	gdpPercap
	<fct>	<fct>	<int>	<dbl>	<int>	<dbl>
1	Afghanistan	Asia	1952	28.8	8425333	779.
2	Afghanistan	Asia	1957	30.3	9240934	821.
3	Afghanistan	Asia	1962	32.0	10267083	853.
4	Afghanistan	Asia	1967	34.0	11537966	836.
5	Afghanistan	Asia	1972	36.1	13079460	740.
6	Afghanistan	Asia	1977	38.4	14880372	786.

Table 2: names()
	country	year	pop
	<fct>	<int>	<int>
1	Afghanistan	1952	8425333
2	Afghanistan	1957	9240934
3	Afghanistan	1962	10267083
4	Afghanistan	1967	11537966
5	Afghanistan	1972	13079460
6	Afghanistan	1977	14880372
7	Afghanistan	1982	12881816
8	Afghanistan	1987	13867957
9	Afghanistan	1992	16317921
10	Afghanistan	1997	22227415

The first thing that is done in the web-blog is to select columns. I Dplyr this is done as following:

gapminder %>%
    select(country, year, pop)

In R-base this can be done as following:

gapminder [c("country", "year", "pop")]

I think you can see the difference here. In R-base we use [ and (, while in dplyr %>% is used as a pipe function (the same as | in bash). A lot of people are saying that pipes makes the code easier to read. In the example here I don’t think so. But if the code becomes more complicated, as will be shown sone, I agree. Some also say that it is easier to solve problems (and coding is to great extent about solving problems), by using pipes. I tend to agree again.

Next step is to select country, year and lifeExp – but only for year 2007 (thus year is not necessary to select – but we do it anyway). In dplyr you do as following:

gapminder %>%
    select(country, year, lifeExp) %>%
    filter (year==2007)

I think you understand the logic with pipes here. You have the data set (gapminder), which you pipes into select, which in turn is piped into filter. In R-base I would have done as following to accomplish the same result:

subset (gapminder[c("country", "year", "lifeExp")], year==2007)

As you can see the logic of the code goes in the other direction. We start by filtering (using subset). Then we have the object that should be ”subsetted”, which is gapminder but only for the three variables. Lastly we have the criteria for the subset command. I think we all can agree that in this example R-base is somewhat harder to understand, but also to write.

In the blog-post the author goes even further and want to filter on one country: Poland. The codes looks as following:

gapminder %>%
    select(country, year, lifeExp) %>%
    filter(year == 2007, country == "Poland")

In R-base I would have written the code as following:

subset(gapminder[c("country", "year", "lifeExp")], year == 2007 & country=="Poland")

Now what about selecting two countries. In dplyr:

gapminder %>%
    select(country, year, lifeExp) %>%
    filter(year == 2007, country %in% c("Poland", "Croatia"))

… in R-base:

subset(gapminder[c("country", "year", "lifeExp")], year == 2007 & (country=="Poland" | country=="Croatia"))

Now lets do some simple data analyses. To calculate the mean life expectancy:

gapminder %>%
  summarize(avgLifeExp = mean(lifeExp))

In R-base I would have done:

mean(gapminder$lifeExp)

Well, this time R-base is easier and more readable.

To make it even more complicated we want to look at the life expectancy for Europe in 2007. In dplyr:

gapminder %>%
  filter(year == 2007, continent == "Europe") %>%
  summarize(avgLifeExp = mean(lifeExp))

In R-base I would most likely do this in two steps. First create a new data set based on the filter/subset. Then calculate the mean:

d1 <- subset(gapminder, continent == "Europe" & year == 2007)

mean(d1$lifeExp)

What is easiest is now more a matter of taste and habit.

To summarize so far, the main advantage using dplyr is using the pipe-function. But as you also can see dplyr uses different words for the same command and sometimes also have slightly different and more comprehensive functions. You can see more about codes and commands in dplyr/tidyverse vs R-base here. To some extent I would argue that using other commands than R-base but mainly doing the same thing is annoying. However, using pipes makes sense. It is another way of thinking when writing codes, but the output is certainly cleaner. I good thing though is that from R version 4.1.0 there is a pipe function in R-base as well (designated as |>). I therefore really don’t see any main benefits using Dplyr or Tidyverse above R-base. But I don’t think you benefit on being fundamentalist on this. The filter commando (shown above) is certainly better (or maybe more agile is a better word) than the subset command. So how does pipes in R-base work? Well¸ the same as every pipe function. You define an object which you then pipes through commands. You can also make a command which you then pipes into other commands. The codes above, but using pipes in R-base would look something like:

With dplyr

gapminder %>%
    select(country, year, lifeExp) %>%
    filter (year==2007)

With R-base without pipes

subset (gapminder[c("country", "year", "lifeExp")], year==2007)

With R-base using pipes

gapminder[c("country", "year", "lifeExp")] |>
    subset(year==2007)

For a somewhat more complex example:

gapminder %>%
    select(country, year, lifeExp) %>%
    filter(year == 2007, country %in% c("Poland", "Croatia"))

subset(gapminder[c("country", "year", "lifeExp")], year == 2007 & (country=="Poland" | country=="Croatia"))

gapminder [c("country", "year", "lifeExp")]|>
    subset(year==2007) |>
    subset(country == "Poland" | country=="Croatia")

However, I think the new pipe function in R-base make even more sense when conducting analyses. For example, doing a proportional table without pipes looks something like:

prop.table(table(gapminder$continent)) *100

… and if you want to have only one digit and put the categories in column

cbind(round(prop.table(table(gapminder$continent)) *100, 1))

If we use pipes it would be

gapminder$continent |>
    table() |>
    cbind()

… which is much nicer. To be honest though, I usually put the codes into objects (here t1 and t2) and separates the analyses.

t1 <- table(gapminder$continent)
t2 <- prop.table(t1) *100
cbind(round(t2, 1))

Here we have a much nicer and quit similar code to the one using pipes. An advantage (which is a disadvantage by using pipes) is that you get the results separately. Here we both get the frequency table and the proportional table. If we use pipes we only get the proportional table. We have to do another block of codes to get the frequency table:

Kommentarer

Lämna ett svar