I started using R long before RStudio were a thing. In fact, the reason I started to use
Emacs was because of ESS (Emacs Speaks Statistics). Emacs was a so to say IDE for R
statistics. If I had started using RStudio I may not have not been using Emacs at all now.
When I started using R Dplyr and Tidyverse was not either a thing. I have therefore not
used Dplyr or any other of the Tidyverse packages. I have only just used R-base. I have to
confess that I have thought it’s unnecessary to have packages that do the same thing that
you can do with R-base. A couple of months ago I found this blog post
about working with data using Dplyr. I totally understand why using Dplyr. The codes
become easier to read. I have also seen that many say that it is easier to learn R by
starting with tidyverse. I am sceptical about that. But I may be biased since I use Emacs
and emacs-lisp is far more complicated than R-base regarding the structure of the code. I
will give some example from the blog post. You can do the same since the data used is downloaded through a package in R.
install.packages("dplyr")
library(dplyr)
You also need to install and load the data used in the blog-post:
install.packages("gapminder")
library(gapminder)
We can now look at the data:
head(gapminder)
names(gapminder)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
The first thing that is done in the web-blog is to select columns. I Dplyr this is done as
following:
gapminder %>%
select(country, year, pop)
## # A tibble: 1,704 × 3
## country year pop
## <fct> <int> <int>
## 1 Afghanistan 1952 8425333
## 2 Afghanistan 1957 9240934
## 3 Afghanistan 1962 10267083
## 4 Afghanistan 1967 11537966
## 5 Afghanistan 1972 13079460
## 6 Afghanistan 1977 14880372
## 7 Afghanistan 1982 12881816
## 8 Afghanistan 1987 13867957
## 9 Afghanistan 1992 16317921
## 10 Afghanistan 1997 22227415
## # ℹ 1,694 more rows
In R-base this can be done as following:
gapminder [c("country", "year", "pop")]
I think you can see the difference here. In R-base we use [ and (, while in dplyr %>% is
used as a pipe function (the same as | in bash). A lot of people are saying that pipes
makes the code easier to read. In the example here I don’t think so. But if the code
becomes more complicated, as will be shown sone, I agree. Some also say that it is easier
to solve problems (and coding is to great extent about solving problems), by using pipes.
I tend to agree again.
Next step is to select country, year and lifeExp — but only for year 2007 (thus year is
not necessary to select — but we do it anyway). In dplyr you do as following:
gapminder %>%
select(country, year, lifeExp) %>%
filter (year==2007)
I think you understand the logic with pipes here. You have the data set (gapminder), which you pipes into select, which in turn is piped into filter. In R-base I would have done as following to accomplish the same result:
subset (gapminder[c("country", "year", "lifeExp")], year==2007)
## # A tibble: 142 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 2007 43.8
## 2 Albania 2007 76.4
## 3 Algeria 2007 72.3
## 4 Angola 2007 42.7
## 5 Argentina 2007 75.3
## 6 Australia 2007 81.2
## 7 Austria 2007 79.8
## 8 Bahrain 2007 75.6
## 9 Bangladesh 2007 64.1
## 10 Belgium 2007 79.4
## # ℹ 132 more rows
As you can see the logic of the code goes in the other direction. We start by filtering
(using /subset/). Then we have the object that should be ”sub-setted”, which is gapminder
but only for the three variables. Lastly we have the criteria for the subset command. I
think we all can agree that in this example R-base is somewhat harder to understand, but
also to write.
As you can see the logic of the code goes in the other direction. We start by filtering
(using /subset/). Then we have the object that should be ”sub-setted”, which is gapminder
but only for the three variables. Lastly we have the criteria for the subset command. I
think we all can agree that in this example R-base is somewhat harder to understand, but
also to write.
In the blog-post the author goes even further and want to filter on one country: Poland. The codes look as following:
gapminder %>%
select(country, year, lifeExp) %>%
filter(year == 2007, country == "Poland")
## # A tibble: 1 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Poland 2007 75.6
In R-base I would have written the code as following:
subset(gapminder[c("country", "year", "lifeExp")], year == 2007 & country=="Poland")
## # A tibble: 1 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Poland 2007 75.6
Now what about selecting two countries. In dplyr:
gapminder %>%
select(country, year, lifeExp) %>%
filter(year == 2007, country %in% c("Poland", "Croatia"))
## # A tibble: 2 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Croatia 2007 75.7
## 2 Poland 2007 75.6
In R-base:
subset(gapminder[c("country", "year", "lifeExp")], year == 2007 & (country=="Poland" | country=="Croatia"))
## # A tibble: 2 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Croatia 2007 75.7
## 2 Poland 2007 75.6
Now let’s do some simple data analyses. To calculate the mean life expectancy:
gapminder %>%
summarize(avgLifeExp = mean(lifeExp))
## # A tibble: 1 × 1
## avgLifeExp
## <dbl>
## 1 59.5
In R-base I would have done:
mean(gapminder$lifeExp)
## [1] 59.47444
Well, this time R-base is easier and more readable.
To make it even more complicated we want to look at the life expectancy for Europe
in 2007. In dplyr:
gapminder %>%
filter(year == 2007, continent == "Europe") %>%
summarize(avgLifeExp = mean(lifeExp))
## # A tibble: 1 × 1
## avgLifeExp
## <dbl>
## 1 77.6
In R-base I would most likely do this in two steps. First create a new data set based on
the filter/subset. Then calculate the mean:
d1 <- subset(gapminder, continent == "Europe" & year == 2007)
mean(d1$lifeExp)
## [1] 77.6486
What is easiest is now more a matter of taste and habit.
To summarize so far, the main advantage using dplyr is using the pipe-function. But as you
also can see dplyr uses different words for the same command and sometimes also have
slightly different and more comprehensive functions. You can see more about codes and
commands in dplyr/tidyverse vs R-base here. To
some extent I would argue that using other commands than R-base but mainly doing the same
thing is annoying. However, using pipes makes sense. It is another way of thinking when
writing codes, but the output is certainly cleaner. I good thing though is that from R
version 4.1.0 there is a pipe function in R-base as well (designated as |>). I therefore
really don’t see any main benefits using Dplyr or Tidyverse above R-base. But I don’t
think you benefit on being fundamentalist on this. The filter commando (shown above) is
certainly better (or maybe more agile is a better word) than the subset command. So how
does pipes in R-base work? Well¸ the same as every pipe function. You define an object
which you then pipes through commands. You can also make a command which you then pipes
into other commands. The codes above, but using pipes in R-base would look something like:
With dplyr:
gapminder %>%
select(country, year, lifeExp) %>%
filter (year==2007)
## # A tibble: 142 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 2007 43.8
## 2 Albania 2007 76.4
## 3 Algeria 2007 72.3
## 4 Angola 2007 42.7
## 5 Argentina 2007 75.3
## 6 Australia 2007 81.2
## 7 Austria 2007 79.8
## 8 Bahrain 2007 75.6
## 9 Bangladesh 2007 64.1
## 10 Belgium 2007 79.4
## # ℹ 132 more rows
With R-base without pipes:
subset (gapminder[c("country", "year", "lifeExp")], year==2007)
## # A tibble: 142 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 2007 43.8
## 2 Albania 2007 76.4
## 3 Algeria 2007 72.3
## 4 Angola 2007 42.7
## 5 Argentina 2007 75.3
## 6 Australia 2007 81.2
## 7 Austria 2007 79.8
## 8 Bahrain 2007 75.6
## 9 Bangladesh 2007 64.1
## 10 Belgium 2007 79.4
## # ℹ 132 more rows
With R-base using pipes:
gapminder[c("country", "year", "lifeExp")] |>
subset(year==2007)
## # A tibble: 142 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 2007 43.8
## 2 Albania 2007 76.4
## 3 Algeria 2007 72.3
## 4 Angola 2007 42.7
## 5 Argentina 2007 75.3
## 6 Australia 2007 81.2
## 7 Austria 2007 79.8
## 8 Bahrain 2007 75.6
## 9 Bangladesh 2007 64.1
## 10 Belgium 2007 79.4
## # ℹ 132 more rows
For a somewhat more complex example:
gapminder %>%
select(country, year, lifeExp) %>%
filter(year == 2007, country %in% c("Poland", "Croatia"))
subset(gapminder[c("country", "year", "lifeExp")], year == 2007 & (country=="Poland" | country=="Croatia"))
gapminder [c("country", "year", "lifeExp")]|>
subset(year==2007) |>
subset(country == "Poland" | country=="Croatia")
## # A tibble: 2 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Croatia 2007 75.7
## 2 Poland 2007 75.6
## # A tibble: 2 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Croatia 2007 75.7
## 2 Poland 2007 75.6
## # A tibble: 2 × 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Croatia 2007 75.7
## 2 Poland 2007 75.6
However, I think the new pipe function in R-base make even more sense when conducting analyses. For example, doing a proportional table without pipes looks something like:
prop.table(table(gapminder$continent)) *100
##
## Africa Americas Asia Europe Oceania
## 36.619718 17.605634 23.239437 21.126761 1.40845
…and if you want to have only one digit and put the categories in column:
cbind(round(prop.table(table(gapminder$continent)) *100, 1))
## [,1]
## Africa 36.6
## Americas 17.6
## Asia 23.2
## Europe 21.1
## Oceania 1.4
If we use pipes it would be:
gapminder$continent |>
table() |>
cbind()
## [,1]
## Africa 36.6
## Americas 17.6
## Asia 23.2
## Europe 21.1
## Oceania 1.4
… which is much nicer. To be honest though, I usually put the codes into objects (here
t1 and t2) and separates the analyses.
t1 <- table(gapminder$continent)
t2 <- prop.table(t1) *100
cbind(round(t2, 1))
## [,1]
## Africa 36.6
## Americas 17.6
## Asia 23.2
## Europe 21.1
## Oceania 1.4
Here we have a much nicer and quit similar code to the one using pipes. An advantage
(which is a disadvantage by using pipes) is that you get the results separately. Here we
both get the frequency table and the proportional table. If we use pipes we only get the
proportional table. We have to do another block of codes to get the frequency table.
Lämna ett svar