Pipes in R

I started using R long before RStudio were a thing. In fact, the reason I started to use
Emacs was because of ESS (Emacs Speaks Statistics). Emacs was a so to say IDE for R
statistics. If I had started using RStudio I may not have not been using Emacs at all now.
When I started using R Dplyr and Tidyverse was not either a thing. I have therefore not
used Dplyr or any other of the Tidyverse packages. I have only just used R-base. I have to
confess that I have thought it’s unnecessary to have packages that do the same thing that
you can do with R-base. A couple of months ago I found this blog post
about working with data using Dplyr. I totally understand why using Dplyr. The codes
become easier to read. I have also seen that many say that it is easier to learn R by
starting with tidyverse. I am sceptical about that. But I may be biased since I use Emacs
and emacs-lisp is far more complicated than R-base regarding the structure of the code. I
will give some example from the blog post. You can do the same since the data used is downloaded through a package in R.

install.packages("dplyr")
library(dplyr)

You also need to install and load the data used in the blog-post:

install.packages("gapminder")
library(gapminder)

We can now look at the data:

head(gapminder)
names(gapminder)
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

The first thing that is done in the web-blog is to select columns. I Dplyr this is done as
following:

gapminder %>%
    select(country, year, pop)
## # A tibble: 1,704 × 3
##    country      year      pop
##    <fct>       <int>    <int>
##  1 Afghanistan  1952  8425333
##  2 Afghanistan  1957  9240934
##  3 Afghanistan  1962 10267083
##  4 Afghanistan  1967 11537966
##  5 Afghanistan  1972 13079460
##  6 Afghanistan  1977 14880372
##  7 Afghanistan  1982 12881816
##  8 Afghanistan  1987 13867957
##  9 Afghanistan  1992 16317921
## 10 Afghanistan  1997 22227415
## # ℹ 1,694 more rows

In R-base this can be done as following:

gapminder [c("country", "year", "pop")]

I think you can see the difference here. In R-base we use [ and (, while in dplyr %>% is
used as a pipe function (the same as | in bash). A lot of people are saying that pipes
makes the code easier to read. In the example here I don’t think so. But if the code
becomes more complicated, as will be shown sone, I agree. Some also say that it is easier
to solve problems (and coding is to great extent about solving problems), by using pipes.
I tend to agree again.

Next step is to select country, year and lifeExp — but only for year 2007 (thus year is
not necessary to select — but we do it anyway). In dplyr you do as following:

gapminder %>%
    select(country, year, lifeExp) %>%
    filter (year==2007)




I think you understand the logic with pipes here. You have the data set (gapminder), which you pipes into select, which in turn is piped into filter. In R-base I would have done as following to accomplish the same result:

subset (gapminder[c("country", "year", "lifeExp")], year==2007) 
## # A tibble: 142 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 2007 43.8
## 2 Albania 2007 76.4
## 3 Algeria 2007 72.3
## 4 Angola 2007 42.7
## 5 Argentina 2007 75.3
## 6 Australia 2007 81.2
## 7 Austria 2007 79.8
## 8 Bahrain 2007 75.6
## 9 Bangladesh 2007 64.1
## 10 Belgium 2007 79.4
## # ℹ 132 more rows

As you can see the logic of the code goes in the other direction. We start by filtering
(using /subset/). Then we have the object that should be ”sub-setted”, which is gapminder
but only for the three variables. Lastly we have the criteria for the subset command. I
think we all can agree that in this example R-base is somewhat harder to understand, but
also to write.

As you can see the logic of the code goes in the other direction. We start by filtering
(using /subset/). Then we have the object that should be ”sub-setted”, which is gapminder
but only for the three variables. Lastly we have the criteria for the subset command. I
think we all can agree that in this example R-base is somewhat harder to understand, but
also to write.

In the blog-post the author goes even further and want to filter on one country: Poland. The codes look as following:

gapminder %>%
    select(country, year, lifeExp) %>%
    filter(year == 2007, country == "Poland")
## # A tibble: 1 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Poland 2007 75.6

In R-base I would have written the code as following:

subset(gapminder[c("country", "year", "lifeExp")], year == 2007 & country=="Poland")
## # A tibble: 1 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Poland 2007 75.6

Now what about selecting two countries. In dplyr:

gapminder %>%
    select(country, year, lifeExp) %>%
    filter(year == 2007, country %in% c("Poland", "Croatia"))
## # A tibble: 2 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Croatia 2007 75.7
## 2 Poland 2007 75.6

In R-base:

subset(gapminder[c("country", "year", "lifeExp")], year == 2007 & (country=="Poland" | country=="Croatia"))
## # A tibble: 2 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Croatia 2007 75.7
## 2 Poland 2007 75.6

Now let’s do some simple data analyses. To calculate the mean life expectancy:

gapminder %>%
    summarize(avgLifeExp = mean(lifeExp))
## # A tibble: 1 × 1
## avgLifeExp
## <dbl>
## 1 59.5

In R-base I would have done:

mean(gapminder$lifeExp)
## [1] 59.47444

Well, this time R-base is easier and more readable.

To make it even more complicated we want to look at the life expectancy for Europe
in 2007. In dplyr:

gapminder %>%
    filter(year == 2007, continent == "Europe") %>%
    summarize(avgLifeExp = mean(lifeExp))
## # A tibble: 1 × 1
## avgLifeExp
## <dbl>
## 1 77.6

In R-base I would most likely do this in two steps. First create a new data set based on
the filter/subset. Then calculate the mean:

d1 <- subset(gapminder, continent == "Europe" & year == 2007)
mean(d1$lifeExp)
## [1] 77.6486

What is easiest is now more a matter of taste and habit.

To summarize so far, the main advantage using dplyr is using the pipe-function. But as you
also can see dplyr uses different words for the same command and sometimes also have
slightly different and more comprehensive functions. You can see more about codes and
commands in dplyr/tidyverse vs R-base here. To
some extent I would argue that using other commands than R-base but mainly doing the same
thing is annoying. However, using pipes makes sense. It is another way of thinking when
writing codes, but the output is certainly cleaner. I good thing though is that from R
version 4.1.0 there is a pipe function in R-base as well (designated as |>). I therefore
really don’t see any main benefits using Dplyr or Tidyverse above R-base. But I don’t
think you benefit on being fundamentalist on this. The filter commando (shown above) is
certainly better (or maybe more agile is a better word) than the subset command. So how
does pipes in R-base work? Well¸ the same as every pipe function. You define an object
which you then pipes through commands. You can also make a command which you then pipes
into other commands. The codes above, but using pipes in R-base would look something like:

With dplyr:

gapminder %>%
    select(country, year, lifeExp) %>%
    filter (year==2007)
## # A tibble: 142 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 2007 43.8
## 2 Albania 2007 76.4
## 3 Algeria 2007 72.3
## 4 Angola 2007 42.7
## 5 Argentina 2007 75.3
## 6 Australia 2007 81.2
## 7 Austria 2007 79.8
## 8 Bahrain 2007 75.6
## 9 Bangladesh 2007 64.1
## 10 Belgium 2007 79.4
## # ℹ 132 more rows

With R-base without pipes:

subset (gapminder[c("country", "year", "lifeExp")], year==2007) 
## # A tibble: 142 × 3
##    country      year lifeExp
##    <fct>       <int>   <dbl>
##  1 Afghanistan  2007    43.8
##  2 Albania      2007    76.4
##  3 Algeria      2007    72.3
##  4 Angola       2007    42.7
##  5 Argentina    2007    75.3
##  6 Australia    2007    81.2
##  7 Austria      2007    79.8
##  8 Bahrain      2007    75.6
##  9 Bangladesh   2007    64.1
## 10 Belgium      2007    79.4
## # ℹ 132 more rows

With R-base using pipes:

gapminder[c("country", "year", "lifeExp")] |>
    subset(year==2007)
## # A tibble: 142 × 3
##    country      year lifeExp
##    <fct>       <int>   <dbl>
##  1 Afghanistan  2007    43.8
##  2 Albania      2007    76.4
##  3 Algeria      2007    72.3
##  4 Angola       2007    42.7
##  5 Argentina    2007    75.3
##  6 Australia    2007    81.2
##  7 Austria      2007    79.8
##  8 Bahrain      2007    75.6
##  9 Bangladesh   2007    64.1
## 10 Belgium      2007    79.4
## # ℹ 132 more rows

For a somewhat more complex example:

gapminder %>%
    select(country, year, lifeExp) %>%
    filter(year == 2007, country %in% c("Poland", "Croatia"))

subset(gapminder[c("country", "year", "lifeExp")], year == 2007 & (country=="Poland" | country=="Croatia"))

gapminder [c("country", "year", "lifeExp")]|>
    subset(year==2007) |>
    subset(country == "Poland" | country=="Croatia")
## # A tibble: 2 × 3
##   country  year lifeExp
##   <fct>   <int>   <dbl>
## 1 Croatia  2007    75.7
## 2 Poland   2007    75.6
## # A tibble: 2 × 3
##   country  year lifeExp
##   <fct>   <int>   <dbl>
## 1 Croatia  2007    75.7
## 2 Poland   2007    75.6
## # A tibble: 2 × 3
##   country  year lifeExp
##   <fct>   <int>   <dbl>
## 1 Croatia  2007    75.7
## 2 Poland   2007    75.6

However, I think the new pipe function in R-base make even more sense when conducting analyses. For example, doing a proportional table without pipes looks something like:

prop.table(table(gapminder$continent)) *100
## 
##    Africa  Americas      Asia    Europe   Oceania 
## 36.619718 17.605634 23.239437 21.126761  1.40845

…and if you want to have only one digit and put the categories in column:

cbind(round(prop.table(table(gapminder$continent)) *100, 1))

## [,1]
## Africa 36.6
## Americas 17.6
## Asia 23.2
## Europe 21.1
## Oceania 1.4

If we use pipes it would be:

gapminder$continent |>
    table() |>
    cbind()
##          [,1]
## Africa   36.6
## Americas 17.6
## Asia     23.2
## Europe   21.1
## Oceania   1.4

… which is much nicer. To be honest though, I usually put the codes into objects (here
t1 and t2) and separates the analyses.

t1 <- table(gapminder$continent)
t2 <- prop.table(t1) *100
cbind(round(t2, 1))
##          [,1]
## Africa   36.6
## Americas 17.6
## Asia     23.2
## Europe   21.1
## Oceania   1.4

Here we have a much nicer and quit similar code to the one using pipes. An advantage
(which is a disadvantage by using pipes) is that you get the results separately. Here we
both get the frequency table and the proportional table. If we use pipes we only get the
proportional table. We have to do another block of codes to get the frequency table.


av

Etiketter:

Kommentarer

Lämna ett svar

Din e-postadress kommer inte publiceras. Obligatoriska fält är märkta *

Denna webbplats använder Akismet för att minska skräppost. Lär dig om hur din kommentarsdata bearbetas.