Ave-command i R statistics

Imaging you have some data on unemployment:

unemployment.rate <- c(0.01, 0.17, 0.19, NA, 0.21, 0.14, 0.02,NA, 0.26, 0.27, 0.21, 0.28, 0.23, 0.16, 0.1, NA, 0.23, 0.03, 0.11)
cntry <- c ("SE", "NO", "DK", "SE", "NO", "SE", "DK", "DK", "NO", "DK", "SE", "DK", "DK", "SE", "DK", "SE", "SE", "DK", "NO")
size <- c("Big","Medium","Big","Big","Medium","Small","Big","Medium","Medium","Big","Small","Medium","Medium","Big","Medium","Big","Big","Big","Small")
df <- data.frame(unemployment_rate, cntry, size)

In the data we have some NA. We want to replace these based on the mean of cntry. This can be done using the ave-command (from R-base):

df$unemployment_rate <- ave(df$unemployment.rate, list(df$cntry), FUN=function(x) {x[is.na(x)] <- mean(x,na.rm=TRUE); x; })

We can actually base the new data on both the mean fron cntry and the mean from size.

df$unemployment_rate <- ave(df$unemployment.rate, list(df$cntry, df$size), FUN=function(x) {x[is.na(x)] <- mean(x,na.rm=TRUE); x; })

Amazing! (edited)

Family Orientation in eight countries — a moment with R

In the last post I investigated the development of individualism in several countries, with the aim to investigate if individualism is something recent in Sweden. I used an indicator which I am rather sceptical about — the relation between importance of friends and family. The more important friends are related to the family the more individualistic. This is an indicator used, among others, to measure individualism. Maybe it is a good one, but to use it as standalone I think the variable rather measure family orientation — which of course can be related to individualism. But it is not the same thing.

Anyway, i look through and made the code I used more efficient, and then I found another pattern which I found really interesting. Family orientation is rather stable in Sweden but actually decreasing in most countries. The USA is an exception. However, USA is not alone. The other Anglo-Saxon show the same pattern. Another pattern is that family orientation in the east European countries have decreased rather fast.

First I made the code shorter and more efficient. As in the last post I will only show the codes for one of the waves — the codes for the other waves are about the same. There is thus no reason to show the codes for the other waves. As in the former post data comes from World Value Survey.

I use the countrycode package to transform the numeric country codes into factors.

library(countrycode)
wd6$cntry <- countrycode (wd6$v2, origin = "wvs", destination = "un.name.en")

Individualism/family orientation is created from two variables measuring the importance of the family and friends. I first revert the scale, and then just take the importance of friends minus the importance of family. The higher the value the more individualistic — and the lesser is the family orientation.

wd6$family <- ifelse (wd6$v4 < 0, NA, wd6$v4)
wd6$friends <- ifelse (wd6$v5 < 0, NA, wd6$v5)

wd6$family <- (wd6$family - (max(wd6$family, na.rm = TRUE))) *-1
wd6$friends <- (wd6$friends - (max(wd6$friends, na.rm = TRUE))) *-1

wd6$individualism <- wd6$friends - wd6$family

The next step is to select the countries to use. I use the car-package to create a variable from which to select a subset.

library(car)

wd6$cntry_select <- recode (wd6$cntry, '
"Australia" = 1;
"Estonia" = 1;
"Germany" = 1;
"Netherlands" = 1;
"New Zealand" = 1;
"Poland" = 1;
"Romania" = 1;
"Spain" = 1;
"Sweden" = 1;
"United States of America" = 1;
else = 0')

wd6 <- subset (wd6, cntry_select==1)

The next step is to create a data.frame with the mean value of individualism/family orientation in each country. To do this I use the aggregate command. The new data frame is need of some cleaning.

t6b <- aggregate (wd6$individualism, list(wd6$cntry), FUN = mean, na.rm=TRUE)
t6b$Country <- t6b$Group.1
t6b$Individualism <- t6b$x
t6b <- t6b[c(-1,-2)]

This can be plotted using ggplot2.

p6 <- ggplot (data=t6b, aes(x= reorder (Country, -Individualism), y=Individualism)) +
    geom_bar (stat="identity", position=position_dodge()) +
    labs (x="Country", y="Individualism") +
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

p6

After doing this for all the waves I will end up with five data frames, which are named: tb2, tb3, tb4, tb5, and tb6. Since the second wave does not include Sweden it is excluded. In each of these data frames I create a variable named Wave. The I bind them together.

t2b$Wave <- 2
t3b$Wave <- 3
t4b$Wave <- 4
t5b$Wave <- 5
t6b$Wave <- 6

df7 <- rbind (t3b,t4b,t5b, t6b)

To make visualisation easier to see I choose eight countries:

df7$country_select <- recode (df7$Country, '
c("Sweden", "United States of America", "Spain", "Poland", "Estonia", "Germany", "Australia", "New Zealand") = 1;
else=0')

df7 <- subset(df7, country_select == 1)

I then use ggplot to produce the graph.

p7 <- ggplot (data=df7, aes(x=Wave, y=Individualism, group=Country)) +
    geom_line (aes(color=Country), size=1.2) +
    geom_point(aes(color=Country), size = 3.1) + 
    scale_color_brewer(palette="Dark2")

As can be seen individualism/family orientation is rather stable in Sweden, while decreasing in the USA (as we saw in the previous post). More interesting though is that individualism is not only decreasing in the USA, but also in Australia and New Zealand — two Anglo-Saxon countries. Other interesting results are the increase in individualism in the eastern European countries. The increase is Very strong in Poland and Estonia. But as can be seen individualism is also increasing in Germany and Spain. Even though we have to be cautious with conclusions it seems as the Anglo-Saxon countries and Sweden are outliers. Everywhere else is individualism on the increase. Furthermore, maybe this is not about primary about individualism — but about family orientation. Even though probably related, it is not the same thing.

The development of individualism — a moment with R

I recently read a book about the education system in Sweden (”Glädjeparadoxen” [The paradox of Happiness]). The book is indeed interesting, dealing with the question why Swedish pupils has fallen behind in the big international tests such as PISA. However, it was another thing I found interesting. It is well known that the Swedish society is more individualistic than most other countries (however, I have some objection to what is usually meant by individualism in general — I maybe write about that in another post). But in the book the authors claim values in Sweden was more collective than in most other countries in the beginning of the 1980’s, and that it is only lately that Sweden has become more individualistic (with reference to Santos et.al (2017)). In this post I will make a brief test whit data from World Value Survey (WVS). The measure I will use is difference in how important friends are relative the family. According to Santos et.al (2017) this is a well known indicator for individualism (I am somewhat sceptical though — but maybe more about this in another post). Of course to be able to really answer the question we would most likely need more variables. But this is just a small test.

Data

The data used in the study was World Value Survey (WVS), which can be found here. Sweden is not part of all the waves, and the questions we need is only included in later data sets. I will therefore use four waves (wave 3, 4 5, and 6). In years we will be measuring changes in individualism from 1990 to 2014. This is not bad at all even though we do not go back the early 1980’s. On the other hand — how likely is it that all changes from the early 1980’s until today happened during the 1980’s?

The code

After reading the code into R I did the following with the waves (here only one of the wave (wave 6) is shown).

Frist I made all letters to lower case:

names (wd6) <- tolower (names(wd6))

Then I transformed the country codes into the names of the countries using the countrycoce package.

library(countrycode)
wd6$cntry <- countrycode (wd6$v2, origin = "wvs", destination = "un.name.en")

Next step is to construct the variable measuring individualism — which is the importance of friends minus the importance of the family. The higher the numbers the more individualized.

wd6$family <- ifelse (wd6$v4 < 0, NA, wd6$v4)
wd6$friends <- ifelse (wd6$v5 < 0, NA, wd6$v5)

wd6$family <- (wd6$family - (max(wd6$family, na.rm = TRUE))) *-1
wd6$friends <- (wd6$friends - (max(wd6$friends, na.rm = TRUE))) *-1

wd6$individualism <- wd6$friends - wd6$family

Now I have the two variables I need. But to make the results easier on the eyes I only selected some of the countries — using the subset-command:

wd6$cntry_select <- recode (wd6$cntry, '
"Australia" = 1;
"Estonia" = 1;
"Germany" = 1;
"Netherlands" = 1;
"New Zealand" = 1;
"Poland" = 1;
"Romania" = 1;
"Spain" = 1;
"Sweden" = 1;
"United States of America" = 1;
else = 0')

wd6 <- subset (wd6, cntry_select==1)

Now I have to construct a new data fram with the data to use to make graphs. To do this I first calculated the mean value of the variable individualism in each country by using tapply

t6 <- tapply (wd6$individualism, list(wd6$cntry), FUN = mean, na.rm=TRUE)
round (cbind (t6), 3)

From the result I created two variables: country and individualism — and put them in a new data frame.

country <- c (
"Australia",
"Estonia",                 
"Germany",
"Netherlands",             
"New Zealand",             
"Poland",
"Romania",
"Spain",
"Sweden",
"United States of America")

individualism <- c(
-0.397,
-0.463,
-0.308,
-0.375,
-0.403,
-0.602,
-0.996,
-0.423,
-0.197,
-0.403)

df6 <- data.frame (country, individualism)

The last step is to produce the graph. This is done with the ggplot2-package.

library (ggplot2)

p6 <- ggplot (data=df, aes(x= reorder (country, -individualism), y=individualism)) +
    geom_bar (stat="identity", position=position_dodge()) +
    labs (x="Country", y="Individualism") +
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

The results from the waves can be seen below.

Results/Figures

Figure 1: Wave 2: 1990-94

As can be see Sweden was not part of the second wave (1990-1994). Somewhat suprinsing the level of individualism was greatest in Turkey — greater than in countries such as Spain and Brazil. But to be fair — none of the countries presented in figure 1 are known to have high levels of individualism (as far as I know at least).

Figure 2: Wave 3: 1995-1998

Figure 3: Wave 4: 1999-2002

Figure 4: wave 5: 2005-2006

Figure 5: wave 6: 2010-2014

From Figure 2 to 5 Sweden is included in the Figures. As can be seen Sweden has the highest level of individualism among the included countries in all of the waves. For Sweden the results seems to be stable, while for example USA seems to be falling behind. Let’s look closer into this. In the graph below the development of individualism is investigated in Sweden, USA, Germany and Spain.

To do this I created a data set with the values for the four countries, looking like this:

Country Individualism Wave
1 Sweden -0.207 Wave 1
2 Sweden -0.187 Wave 2
3 Sweden -0.219 Wave 3
4 Sweden -0.197 Wave 4
5 USA -0.291 Wave 1
6 USA -0.343 Wave 2
7 USA -0.387 Wave 3
8 USA -0.403 Wave 4
9 Spain -0.508 Wave 1
10 Spain -0.467 Wave 2
11 Spain -0.450 Wave 3
12 Spain -0.423 Wave 4
13 Germany -0.381 Wave 1
14 Germany NA Wave 2
15 Germany -0.307 Wave 3
16 Germany -0.308 Wave 4

Once again using ggplot2:

p7 <- ggplot (data=df7, aes(x=Wave, y=Individualism, group=Country)) +
geom_line (aes(color=Country), size=1.2) +
geom_point(aes(color=Country), size = 3.1) + 
scale_color_brewer(palette="Dark2")

And the result…

Figure 6: Development of individualism in Sweden, USA, Germnay and Spain

As can be seen in Figure 6 the level of individualism is rather stable in both Sweden and Spain, however on very different levels. Individualism is super stable in Germany. In the USA on the other hand individualism seems to be decreasing substantially. From the results here the claim that individualism is a rather late thing in Sweden does not get support. However, more research needs to be done of course!

Literature

Heller-Sahlgren, G., & Sanandaji, N. (2019). Glädjeparadoxen: Historien om skolans uppgång, fall och möjliga upprättelse. Stockholm: Dialogos förlag.

Santos, H. C., Varnum, M. E. W., & Grossmann, I. (2017). Global Increases in Individualism. Psychological Science, 28(9), 1228–1239. https://doi.org/10.1177/0956797617700622

How to get R up and running in Fedora Linux

It is trivial to install R in Fedora. Just type:

sudo dnf install R

After that you can run R in the terminal. If you want an environment to work in you can use RStudio. I use Emacs and ESS. Everything works nice up until the point you want to install a package — say for example car. It wont work (out of the box). You need to install an additional program. In the terminal:

sudo dnf install libcurl-devel 

libcurl is a client-side URL transfer library that you need to install packages in R.

It is also a good idea to install:

sudo dnf install NLopt

NLopt is a library for nonlinear optimization, callable from R.

Reading rather big data into R

Reading big data into R can take some time, since R reads the data directly into the Ram-memory. If the data is big it can even happen that R crashes. Things has become better, but this is still a problem. I have 16 GB in Ram and seldom have so big data that it does not fit into the Ram. But while using R with rather big data, it may not be a good idea to be running other heavy programs at the same time. It is for example not recommended to ”virtualize” other computers, if you happen to do that (I have Windows as a virtual machine).

But even though data may fit, it can still take some time to read it into R. A trick around this is to use the data.table package. The data.tabel function reads the data as a table instead of data.frame — which takes a lot lesser time.

Install the package


install.packages("read.table")

Read data


wd <- fread(dataname.csv)

 

Problems with GCC/gfortran in R statistics

I updated my system recently and went into a problem when loading some of the packages dependent on GCC. The error message I got when for example running library(psych) was:

’/home/daniel/R/x86_64-redhat-linux-gnu-library/3.4/mnormt/libs/mnormt.so’:
libgfortran.so.4: cannot open shared object file: No such file or directory

The reason to the problem is that my GCC was updated to version 8, and libgfortran to 5. To fix this issue I went to the directory above, and search for libgfortran.so.4 and changed in to libgfortran.so.5.

Läsa in SPSS-data i R med ‘haven’

Det är fortfarande vanligt att jag behöver hantera SPSS-filer, alltså .sav-filer, när jag arbetar med R. Det sätt som jag tror är vanligast är att använda paketet ’foreign’ och koden data=read_spss(”filen.sav”, into.data.file). Jag har sedan en tid tillbaka i princip övergett detta paket för ’haven’. ’haven’ ger inga varningsmeddelanden, vilket i princip ’foreign’ alltid gör. Detta kanske inte är jätteviktigt. Dock är det viktigare att ’haven’ laddar in data snabbare. Jag har inte mätt, men jag upplever det så. Dessutom innehåller ’haven’ ett par funktioner som jag gillar. För den som undrar, det går att läsa in .dat-filer (stata) också.

Några koder:

## Läsa in spss-data

data=read_sav(”datafilen.sav”)

##För att se en variabels label

att(data$variabelnamn)

## För att se en kategorivariabels kategorier (alltså inte siffror utan själva kategorierna)

attributes(data$variabelnamn)

## För att spara data till en .sav-fil

write_sav(data, ”datafilen.sav”)

Introduktionsfilmer i R

Jag har gjort två stycken introduktionsfilmer om R som jag lagt upp på Youtube. Dessa filmer gjorde jag egentligen till undervisning jag har. Men det finns ingen anledning att inte publicera dessa filmer öppet. R har ju fördelen av att det är gratis, vilket gör att vem som helst kan ladda ner programmet. Tyvärr är det nog så att det för de flesta inte är att bara köra. R har en hyfsad inlärningskurva. Men den är inte så stor att det inte går att med ganska liten kunskap göra relativt enkla körningar. Förmodligen kommer jag att lägga upp fler filmer.