Tidying up and visualizing the Alien Employment Permit dataset

This is a cleanup-and-visualize exercise using R’s reshape, dplyr and ggplot2.

Our dataset will be the Alien Employment Permits by Nationality, obtained from data.gov.ph. This dataset contains the number of Alien Employment Permits (AEP) issued to foreign nationals who wanted to work in the Philippines from 1978 to 2012 dissagregated by Nationality sourced from the Bureau of Local Employment (BLE).

We’ll be using functions from the ddply and ggplot2 packages, so let’s load them first.

library(dplyr)
library(ggplot2)

Loading the data

The data is provided in CSV format, so we can try read.csv which usually does the right thing. Let’s also figure out the dimensions of the dataset and take a peek at the first few rows.

aliens_wide <- read.csv("../files/data/dolealienemppermsbynatlity1978-2012.csv")
dim(aliens_wide)

## [1] 44 36

36 columns is a pretty wide dataset. Let’s take a look at the first few rows to determine what variables the columns represent.

head(aliens_wide)

##   nationality X1978 X1979 X1980 X1981 X1982 X1983 X1984 X1985 X1986 X1987
## 1       Total   807   855   854   784   675  1600   399   378  1224  1240
## 2    American    96    79    84    79    75   124    38    54   122   110
## 3  Australian    41    58    41    17    12    36     9    15    24    35
## 4    Austrian     6     -    10    10    18    13     -     7    13     6
## 5     Belgian     -    11     -     -     -     -     -     -     -     -
## 6   Brazilian     -     -     7     -     -     -     -     -     -     -
##   X1988 X1989 X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998 X1999
## 1  1634  1516  1614  2016  2144  2451  3094  3701  4333  6055  5335  5956
## 2   169   152   140   177   180   246   310   395   422   631   518   493
## 3    37    42    39    58    79    78   110   170   155   303   267   335
## 4     -    14    17    15    17    17     -     -    10    16     8    18
## 5     -     -     -     -     -     -    40    49    14    43    25    38
## 6     -     -     -     -     -     -     -     -     2     6     5     7
##   X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009 X2010 X2011
## 1  5576  6858 10739  9168  9408 10823 12335 13592 12599 12218 14325 17142
## 2   518   552   515   498   618   562   665   605   657   644   766  1001
## 3   237   200   226   216   225   249   211   230   229   244   317   339
## 4    23    18    26    14    10    18    15    14    17    22    28    18
## 5    17     8    28    32    24    25    30    32    28    31    37    13
## 6     7     3    11    16    13     4     6    10    11    10    15    34
##   X2012
## 1 20966
## 2   946
## 3   380
## 4    40
## 5    25
## 6    56

read.csv was able to load the file into a data frame and correctly identify header rows. However from our initial peek at the data, we see a couple of potential problems.

The first is that instead of zeros, a dash (-) is used to indicate no permits were issued for a nationality that year. This causes read.csv to interpret any column with a dash as containing factors rather than numeric data, as can be seen if we use str to display the structure of the data frame:

str(aliens_wide)

## 'data.frame':    44 obs. of  36 variables:
##  $ nationality: Factor w/ 44 levels "American","Australian",..: 44 1 2 3 4 5 6 7 8 9 ...
##  $ X1978      : Factor w/ 18 levels "-","10","14",..: 16 18 10 13 1 1 17 4 15 13 ...
##  $ X1979      : Factor w/ 18 levels "-","104","11",..: 18 16 14 1 3 1 2 8 5 1 ...
##  $ X1980      : Factor w/ 16 levels "-","10","108",..: 16 15 9 2 1 13 3 12 6 1 ...
##  $ X1981      : Factor w/ 17 levels "-","10","11",..: 15 16 7 2 1 1 14 3 8 1 ...
##  $ X1982      : Factor w/ 19 levels "-","12","14",..: 13 15 2 6 1 1 16 18 19 1 ...
##  $ X1983      : Factor w/ 22 levels "-","110","124",..: 6 3 16 4 1 1 7 11 14 1 ...
##  $ X1984      : Factor w/ 15 levels "-","10","14",..: 8 7 15 1 1 1 7 11 12 1 ...
##  $ X1985      : Factor w/ 17 levels "-","1","12","14",..: 10 14 5 15 1 1 9 8 11 1 ...
##  $ X1986      : Factor w/ 16 levels "-","120","121",..: 5 4 10 7 1 1 6 11 2 1 ...
##  $ X1987      : Factor w/ 21 levels "-","10","11",..: 6 4 12 16 1 1 7 8 21 1 ...
##  $ X1988      : Factor w/ 21 levels "-","10","11",..: 7 8 15 1 1 1 12 14 13 1 ...
##  $ X1989      : Factor w/ 19 levels "-","11","116",..: 7 8 14 5 1 1 10 9 11 1 ...
##  $ X1990      : Factor w/ 19 levels "-","104","13",..: 6 4 16 7 1 1 9 8 10 1 ...
##  $ X1991      : Factor w/ 22 levels "-","103","11",..: 8 7 19 5 1 1 9 16 12 1 ...
##  $ X1992      : Factor w/ 23 levels "-","128","14",..: 9 6 22 5 1 1 7 14 13 1 ...
##  $ X1993      : Factor w/ 20 levels "-","142","17",..: 7 8 19 3 1 1 10 14 12 1 ...
##  $ X1994      : Factor w/ 20 levels "-","110","126",..: 8 9 2 1 12 1 10 14 13 1 ...
##  $ X1995      : Factor w/ 23 levels "-","128","130",..: 12 13 5 1 15 1 14 16 6 1 ...
##  $ X1996      : Factor w/ 36 levels "-","1","10","13",..: 25 24 8 3 6 15 19 29 18 1 ...
##  $ X1997      : int  6055 631 303 16 43 6 632 113 364 2 ...
##  $ X1998      : int  5335 518 267 8 25 5 608 96 357 56 ...
##  $ X1999      : int  5956 493 335 18 38 7 582 133 349 51 ...
##  $ X2000      : int  5576 518 237 23 17 7 523 100 362 18 ...
##  $ X2001      : int  6858 552 200 18 8 3 738 48 367 20 ...
##  $ X2002      : int  10739 515 226 26 28 11 584 106 1098 6 ...
##  $ X2003      : int  9168 498 216 14 32 16 320 94 928 10 ...
##  $ X2004      : int  9408 618 225 10 24 13 451 75 984 2 ...
##  $ X2005      : int  10823 562 249 18 25 4 403 70 1370 2 ...
##  $ X2006      : int  12335 665 211 15 30 6 391 100 1592 2 ...
##  $ X2007      : int  13592 605 230 14 32 10 384 95 1754 3 ...
##  $ X2008      : int  12599 657 229 17 28 11 375 108 1929 3 ...
##  $ X2009      : int  12218 644 244 22 31 10 302 122 1776 7 ...
##  $ X2010      : int  14325 766 317 28 37 15 423 144 2287 5 ...
##  $ X2011      : Factor w/ 41 levels "-","10","1001",..: 12 3 24 13 7 25 18 5 28 1 ...
##  $ X2012      : int  20966 946 380 40 25 56 419 182 4934 5 ...

Note how some of the X columns (those with dashes in the data) are interpreted as factors while others are interpreted as int.

This is trivial to fix. We can specify column classes and markers for NA values n read.csv. The first column, nationality, can be read in as factors while the remaining columns are numeric.

aliens_wide <- read.csv("../files/data/dolealienemppermsbynatlity1978-2012.csv", 
                        colClasses = c("factor", rep("numeric",35)), 
                        na.strings = "-")

The second issue is that the data comes in a wide format, with the year varying across columns and the year column names are in the form X<YYYY>. We want to remove the X. Lastly, there is a Total row, which is redundant and should be removed. We can eliminate the Total row first.

aliens_wide <- filter(aliens_wide, nationality != "Total")

And use reshape to convert it to a “long” table.

aliens <- reshape(aliens_wide, idvar="nationality", direction="long", sep="", varying = c(2:36))
names(aliens)[3] <- "permits"
head(aliens)

##                 nationality time permits
## American.1978      American 1978      96
## Australian.1978  Australian 1978      41
## Austrian.1978      Austrian 1978       6
## Belgian.1978        Belgian 1978      NA
## Brazilian.1978    Brazilian 1978      NA
## British.1978        British 1978      86

Note that we still have NA values. We can filter these out as well.

aliens <- filter(aliens, ! is.na(permits))
head(aliens)

##   nationality time permits
## 1    American 1978      96
## 2  Australian 1978      41
## 3    Austrian 1978       6
## 4     British 1978      86
## 5    Canadian 1978      18
## 6     Chinese 1978      72

We now have a nice tidy data frame and can apply visualization.

Visualization

Our initial plot will be of a simple bar graph.

Here’s a stacked bar graph with colors for each nationality, but with 43 nationalities it will be difficult to discern individual nationalities by bar color.

ggplot(aliens, aes(x=time, y=permits, fill=nationality)) + geom_bar(stat="identity") + guides(fill=FALSE)

Note the dip in permits issued during 1984-1985. This was a time of political turmoil that culminated in the 1986 EDSA revolution.

It might is easier to view them individually, which we can do by filtering the dataframe before sending it to ggplot.

aliens %>% filter(nationality == "Japanese") %>% ggplot( aes(x=time, y=permits, fill=nationality)) + geom_bar(stat="identity")

We can filter to two or more nationalities as well, but use a line graph rather than box graph.

aliens %>% filter(nationality == "Japanese" | 
                          nationality =="Korean" | 
                          nationality == "Chinese"|
                          nationality == "American") %>% 
        ggplot( aes(x=time, y=permits, color=nationality)) + geom_line()

Finally, we can take a high level view of all the nationalities (except “Other Nationalities”) separately using a facet_wrap(). Since all the graphs are going to be tiny, we can remove the axes and tickmarks.

aliens %>% filter(! nationality == "Other Nationalities") %>%
        ggplot( aes(x=time, y=permits)) + geom_line() + 
        facet_wrap(~ nationality) + 
        theme(axis.title.x=element_blank()) +
        theme(axis.title.y=element_blank()) + 
        theme(axis.text.y = element_blank()) +
        theme(axis.ticks.y=element_blank()) + 
        theme(axis.text.x = element_blank()) + 
        theme(axis.ticks.x = element_blank())