Tidying up and visualizing the Alien Employment Permit dataset
Tidying up and visualizing the Alien Employment Permit dataset
This is a cleanup-and-visualize exercise using R’s reshape, dplyr and ggplot2.
Our dataset will be the Alien Employment Permits by Nationality, obtained from data.gov.ph. This dataset contains the number of Alien Employment Permits (AEP) issued to foreign nationals who wanted to work in the Philippines from 1978 to 2012 dissagregated by Nationality sourced from the Bureau of Local Employment (BLE).
We’ll be using functions from the ddply
and ggplot2
packages, so let’s load them first.
library(dplyr)
library(ggplot2)
Loading the data
The data is provided in CSV format, so we can try read.csv
which usually does the right thing. Let’s also figure out the dimensions of the dataset and take a peek at the first few rows.
aliens_wide <- read.csv("../files/data/dolealienemppermsbynatlity1978-2012.csv")
dim(aliens_wide)
## [1] 44 36
36 columns is a pretty wide dataset. Let’s take a look at the first few rows to determine what variables the columns represent.
head(aliens_wide)
## nationality X1978 X1979 X1980 X1981 X1982 X1983 X1984 X1985 X1986 X1987
## 1 Total 807 855 854 784 675 1600 399 378 1224 1240
## 2 American 96 79 84 79 75 124 38 54 122 110
## 3 Australian 41 58 41 17 12 36 9 15 24 35
## 4 Austrian 6 - 10 10 18 13 - 7 13 6
## 5 Belgian - 11 - - - - - - - -
## 6 Brazilian - - 7 - - - - - - -
## X1988 X1989 X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998 X1999
## 1 1634 1516 1614 2016 2144 2451 3094 3701 4333 6055 5335 5956
## 2 169 152 140 177 180 246 310 395 422 631 518 493
## 3 37 42 39 58 79 78 110 170 155 303 267 335
## 4 - 14 17 15 17 17 - - 10 16 8 18
## 5 - - - - - - 40 49 14 43 25 38
## 6 - - - - - - - - 2 6 5 7
## X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009 X2010 X2011
## 1 5576 6858 10739 9168 9408 10823 12335 13592 12599 12218 14325 17142
## 2 518 552 515 498 618 562 665 605 657 644 766 1001
## 3 237 200 226 216 225 249 211 230 229 244 317 339
## 4 23 18 26 14 10 18 15 14 17 22 28 18
## 5 17 8 28 32 24 25 30 32 28 31 37 13
## 6 7 3 11 16 13 4 6 10 11 10 15 34
## X2012
## 1 20966
## 2 946
## 3 380
## 4 40
## 5 25
## 6 56
read.csv
was able to load the file into a data frame and correctly identify header rows. However from our initial peek at the data, we see a couple of potential problems.
The first is that instead of zeros, a dash (-
) is used to indicate no permits were issued for a nationality that year. This causes read.csv to interpret any column with a dash as containing factors rather than numeric data, as can be seen if we use str
to display the structure of the data frame:
str(aliens_wide)
## 'data.frame': 44 obs. of 36 variables:
## $ nationality: Factor w/ 44 levels "American","Australian",..: 44 1 2 3 4 5 6 7 8 9 ...
## $ X1978 : Factor w/ 18 levels "-","10","14",..: 16 18 10 13 1 1 17 4 15 13 ...
## $ X1979 : Factor w/ 18 levels "-","104","11",..: 18 16 14 1 3 1 2 8 5 1 ...
## $ X1980 : Factor w/ 16 levels "-","10","108",..: 16 15 9 2 1 13 3 12 6 1 ...
## $ X1981 : Factor w/ 17 levels "-","10","11",..: 15 16 7 2 1 1 14 3 8 1 ...
## $ X1982 : Factor w/ 19 levels "-","12","14",..: 13 15 2 6 1 1 16 18 19 1 ...
## $ X1983 : Factor w/ 22 levels "-","110","124",..: 6 3 16 4 1 1 7 11 14 1 ...
## $ X1984 : Factor w/ 15 levels "-","10","14",..: 8 7 15 1 1 1 7 11 12 1 ...
## $ X1985 : Factor w/ 17 levels "-","1","12","14",..: 10 14 5 15 1 1 9 8 11 1 ...
## $ X1986 : Factor w/ 16 levels "-","120","121",..: 5 4 10 7 1 1 6 11 2 1 ...
## $ X1987 : Factor w/ 21 levels "-","10","11",..: 6 4 12 16 1 1 7 8 21 1 ...
## $ X1988 : Factor w/ 21 levels "-","10","11",..: 7 8 15 1 1 1 12 14 13 1 ...
## $ X1989 : Factor w/ 19 levels "-","11","116",..: 7 8 14 5 1 1 10 9 11 1 ...
## $ X1990 : Factor w/ 19 levels "-","104","13",..: 6 4 16 7 1 1 9 8 10 1 ...
## $ X1991 : Factor w/ 22 levels "-","103","11",..: 8 7 19 5 1 1 9 16 12 1 ...
## $ X1992 : Factor w/ 23 levels "-","128","14",..: 9 6 22 5 1 1 7 14 13 1 ...
## $ X1993 : Factor w/ 20 levels "-","142","17",..: 7 8 19 3 1 1 10 14 12 1 ...
## $ X1994 : Factor w/ 20 levels "-","110","126",..: 8 9 2 1 12 1 10 14 13 1 ...
## $ X1995 : Factor w/ 23 levels "-","128","130",..: 12 13 5 1 15 1 14 16 6 1 ...
## $ X1996 : Factor w/ 36 levels "-","1","10","13",..: 25 24 8 3 6 15 19 29 18 1 ...
## $ X1997 : int 6055 631 303 16 43 6 632 113 364 2 ...
## $ X1998 : int 5335 518 267 8 25 5 608 96 357 56 ...
## $ X1999 : int 5956 493 335 18 38 7 582 133 349 51 ...
## $ X2000 : int 5576 518 237 23 17 7 523 100 362 18 ...
## $ X2001 : int 6858 552 200 18 8 3 738 48 367 20 ...
## $ X2002 : int 10739 515 226 26 28 11 584 106 1098 6 ...
## $ X2003 : int 9168 498 216 14 32 16 320 94 928 10 ...
## $ X2004 : int 9408 618 225 10 24 13 451 75 984 2 ...
## $ X2005 : int 10823 562 249 18 25 4 403 70 1370 2 ...
## $ X2006 : int 12335 665 211 15 30 6 391 100 1592 2 ...
## $ X2007 : int 13592 605 230 14 32 10 384 95 1754 3 ...
## $ X2008 : int 12599 657 229 17 28 11 375 108 1929 3 ...
## $ X2009 : int 12218 644 244 22 31 10 302 122 1776 7 ...
## $ X2010 : int 14325 766 317 28 37 15 423 144 2287 5 ...
## $ X2011 : Factor w/ 41 levels "-","10","1001",..: 12 3 24 13 7 25 18 5 28 1 ...
## $ X2012 : int 20966 946 380 40 25 56 419 182 4934 5 ...
Note how some of the X
This is trivial to fix. We can specify column classes and markers for NA values n read.csv
. The first column, nationality, can be read in as factors while the remaining columns are numeric.
aliens_wide <- read.csv("../files/data/dolealienemppermsbynatlity1978-2012.csv",
colClasses = c("factor", rep("numeric",35)),
na.strings = "-")
The second issue is that the data comes in a wide format, with the year varying across columns and the year column names are in the form X<YYYY>
. We want to remove the X. Lastly, there is a Total
row, which is redundant and should be removed. We can eliminate the Total
row first.
aliens_wide <- filter(aliens_wide, nationality != "Total")
And use reshape to convert it to a “long” table.
aliens <- reshape(aliens_wide, idvar="nationality", direction="long", sep="", varying = c(2:36))
names(aliens)[3] <- "permits"
head(aliens)
## nationality time permits
## American.1978 American 1978 96
## Australian.1978 Australian 1978 41
## Austrian.1978 Austrian 1978 6
## Belgian.1978 Belgian 1978 NA
## Brazilian.1978 Brazilian 1978 NA
## British.1978 British 1978 86
Note that we still have NA values. We can filter these out as well.
aliens <- filter(aliens, ! is.na(permits))
head(aliens)
## nationality time permits
## 1 American 1978 96
## 2 Australian 1978 41
## 3 Austrian 1978 6
## 4 British 1978 86
## 5 Canadian 1978 18
## 6 Chinese 1978 72
We now have a nice tidy data frame and can apply visualization.
Visualization
Our initial plot will be of a simple bar graph.
Here’s a stacked bar graph with colors for each nationality, but with 43 nationalities it will be difficult to discern individual nationalities by bar color.
ggplot(aliens, aes(x=time, y=permits, fill=nationality)) + geom_bar(stat="identity") + guides(fill=FALSE)
Note the dip in permits issued during 1984-1985. This was a time of political turmoil that culminated in the 1986 EDSA revolution.
It might is easier to view them individually, which we can do by filtering the dataframe before sending it to ggplot
.
aliens %>% filter(nationality == "Japanese") %>% ggplot( aes(x=time, y=permits, fill=nationality)) + geom_bar(stat="identity")
We can filter to two or more nationalities as well, but use a line graph rather than box graph.
aliens %>% filter(nationality == "Japanese" |
nationality =="Korean" |
nationality == "Chinese"|
nationality == "American") %>%
ggplot( aes(x=time, y=permits, color=nationality)) + geom_line()
Finally, we can take a high level view of all the nationalities (except “Other Nationalities”) separately using a facet_wrap()
. Since all the graphs are going to be tiny, we can remove the axes and tickmarks.
aliens %>% filter(! nationality == "Other Nationalities") %>%
ggplot( aes(x=time, y=permits)) + geom_line() +
facet_wrap(~ nationality) +
theme(axis.title.x=element_blank()) +
theme(axis.title.y=element_blank()) +
theme(axis.text.y = element_blank()) +
theme(axis.ticks.y=element_blank()) +
theme(axis.text.x = element_blank()) +
theme(axis.ticks.x = element_blank())