A whirlwind tour of working with data in R

# A whirlwind tour of working with data in R
### Paul Campbell<br>Culture of Insight
### <br>October 2018

---

# Why R?

---

# R

- R is a programming language built for data analysis and visualisation

- It's 'open-source' which means it is completely free to use

- Developed and maintained by an active not-for-profit team of software engineers and developers

- This means it is always growing and improving with over 11,000 R 'packages' freely available to use

- Highly flexible - adapt to suit your own needs without the constraints of proprietary software

- Community - welcoming community of R users online (#rstats) always happy to help and make it fun for you to learn

]

# R vs Excel

But I'm already an Excel wizard, why do I need R?

- As the importance and abundance of data increases, so does the complexity of the data tasks we're asked to fulfill

- Doing complex tasks with large data volumes in Excel is...Difficult!

- Learning to programme these tasks with R will:
  * reduce human error
  * allow you to work with larger datasets
  * make your analysis more transparent and reproducible
  * automate the boring repetitive parts
  * save you **so much time**

]

.pull-right[
![r v excel](http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb097c37a7970d-pi)
Source: [Gordon Shotwell](https://blog.shotwell.ca/2017/02/02/r-for-excel-users/)
]

???

Excel has long been the industry-standard tool for working with data but it's limitations are becoming more and more apparent as data volumes increase as well as the demand for accuracy and reproducibility in data analysis

---

# R vs Python

But isn't python the best for this type of stuff?

- Python is also a great language for data analysis, but it's scope extends way beyond this and is generally better suited to people with computer science backgrounds

- If you're interested in learning to code with the end goal of harnessing AI and deploying machine learning models in large-scale production environments - choose python!

- If your primary objective is to get better at manipulating and visualisation data, and communicating the results in reports and interactive dashboards - R is your new best friend (examples to come)

]

---

# The R Workflow

![r4ds](http://r4ds.had.co.nz/diagrams/data-science.png)

Source: [R for Data Science](http://r4ds.had.co.nz)

---

# Importing

---

# Importing Data

R can import data from just about every form imaginable:

- Flat files like plain text and delimited data: .txt, .csv, .tsv

- Data files from proprietary software like MS Excel, SPSS, SAS

- Directly from Databases: MySQL, MS SQL Server, PostgreSQL

- Directly from the Web: APIs, JSON, Webscrape HTML

]

---

# Importing Data - CSV

```r
library(tidyverse) # load full suite of packages for data analysis

london_crime <- read_csv("data/london_crime.csv") # import london violent crime data

london_crime # preview the data
```

```
## # A tibble: 32 x 19
##    Code    Borough       `1999-00` `2000-01` `2001-02` `2002-03` `2003-04`
##    <chr>   <chr>             <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
##  1 E09000~ Barking and ~      19.8      22.0      23.8      26.4      29.5
##  2 E09000~ Barnet             13.5      14.5      15.2      16.4      18.1
##  3 E09000~ Bexley             14.5      14.7      15.3      16.2      18.4
##  4 E09000~ Brent              20.8      21.8      21.1      23.3      27.5
##  5 E09000~ Bromley            10.9      12.9      13.0      15.9      16.8
##  6 E09000~ Camden             28.1      27.2      27.0      28.5      31.4
##  7 E09000~ Croydon            17.7      18.9      18.5      22.8      23.6
##  8 E09000~ Ealing             22.0      21.1      22.3      23.1      23.2
##  9 E09000~ Enfield            14.8      17.4      18.3      19.9      18.2
## 10 E09000~ Greenwich          29.4      29.4      28.8      31.1      30.6
## # ... with 22 more rows, and 12 more variables: `2004-05` <dbl>,
## #   `2005-06` <dbl>, `2006-07` <dbl>, `2007-08` <dbl>, `2008-09` <dbl>,
## #   `2009-10` <dbl>, `2010-11` <dbl>, `2011-12` <dbl>, `2012-13` <dbl>,
## #   `2013-14` <dbl>, `2014-15` <dbl>, `2015-16` <dbl>
```

---

# Importing Data - Multiple Excel Sheets to 1 Data Frame

We can quickly combine all this data into one data table with a small amount of R code...
]
---

# Importing Data - Multiple Excel Sheets to 1 Data Frame

```r
library(readxl) # for reading excel files

path <- "data/gapminder_messy.xlsx"

combined_data <- excel_sheets(path) %>% 
  map_df(~ {
    read_excel(path, sheet = .x, 
               skip = 4, trim_ws = TRUE) %>% 
      mutate(year = as.numeric(.x))
  }) %>% 
  select(country, year, everything())

# show top and bottom 6 rows of data frame
head(combined_data)
tail(combined_data)
```
]

```
## # A tibble: 6 x 5
##   country      year gdpPercap lifeExp      pop
##   <chr>       <dbl>     <dbl>   <dbl>    <dbl>
## 1 Afghanistan  2007       975    43.8 31889923
## 2 Albania      2007      5937    76.4  3600523
## 3 Algeria      2007      6223    72.3 33333216
## 4 Angola       2007      4797    42.7 12420476
## 5 Argentina    2007     12779    75.3 40301927
## 6 Australia    2007     34435    81.2 20434176
```

```
## # A tibble: 6 x 5
##   country             year gdpPercap lifeExp      pop
##   <chr>              <dbl>     <dbl>   <dbl>    <dbl>
## 1 Venezuela           1952      7690    55.1  5439568
## 2 Vietnam             1952       605    40.4 26246839
## 3 West Bank and Gaza  1952      1516    43.2  1030585
## 4 Yemen, Rep.         1952       782    32.5  4963829
## 5 Zambia              1952      1147    42.0  2672000
## 6 Zimbabwe            1952       407    48.5  3080907
```
]

---

# Importing Data - SPSS

You don't have to pay expensive SPSS licenses to analyse your survey data!

```r
library(haven) # for reading SPSS, SAS, STATA files

survey_data <- read_sav("data/sample_spss_data.sav")

head(survey_data)
```

```
## # A tibble: 6 x 682
##   SubsID  responseid respid dCOUNTRY fQ1bGender foQ1aAge_1 dAge   Q2Kids_1
##   <chr>        <dbl>  <dbl> <dbl+lb> <dbl+lbl>       <dbl> <dbl+>    <dbl>
## 1 218884~        345    347 1        1                68.0 3          2.00
## 2 235864~        185    187 1        1                51.0 2          2.00
## 3 237393~        108    110 1        2                57.0 3          2.00
## 4 250156~        524    526 1        2                53.0 2          2.00
## 5 254053~        349    351 1        2                25.0 1          2.00
## 6 200042~        823    825 1        2                54.0 2          1.00
## # ... with 674 more variables: Q2Kids_2 <dbl>, Q2Kids_3 <dbl>,
## #   dKIDS <dbl+lbl>, Q3 <dbl+lbl>, dNET <dbl+lbl>, fQ2UKRegion <dbl+lbl>,
## #   dRegionUK <dbl+lbl>, fQ2FRRegion <dbl+lbl>, dRegionFR <dbl+lbl>,
## #   fQ2DERegion <dbl+lbl>, dRegionDE <dbl+lbl>, Q2TURRegion <dbl+lbl>,
## #   Q2IncomeUK <dbl+lbl>, dIncomeUK <dbl+lbl>, Q2IncomeFR <dbl+lbl>,
## #   dIncomeFR <dbl+lbl>, Q2IncomeDE <dbl+lbl>, dIncomeDE <dbl+lbl>,
## #   Q2IncomeTUR <dbl+lbl>, dIncomeTUR <dbl+lbl>, Q4_1 <dbl+lbl>,
## #   Q4_2 <dbl+lbl>, Q4_3 <dbl+lbl>, Q4_4 <dbl+lbl>, Q4_5 <dbl+lbl>,
## #   Q4_6 <dbl+lbl>, Q4_99 <dbl+lbl>, Q6_1 <dbl+lbl>, Q6_2 <dbl+lbl>,
## #   Q6_3 <dbl+lbl>, Q6_4 <dbl+lbl>, Q6_5 <dbl+lbl>, Q6_98 <dbl+lbl>,
## #   Q6_99 <dbl+lbl>, dBiscLMAll_1 <dbl+lbl>, dBiscLMAll_2 <dbl+lbl>,
## #   dBiscLMAll_3 <dbl+lbl>, dBiscLMAll_4 <dbl+lbl>,
## #   dBiscLMAll_5 <dbl+lbl>, dBiscLMAll_6 <dbl+lbl>, dTypLMAll_1 <dbl+lbl>,
## #   dTypLMAll_2 <dbl+lbl>, dTypLMAll_3 <dbl+lbl>, dTypLMAll_4 <dbl+lbl>,
## #   dNumber <dbl+lbl>, fQ13_1 <dbl+lbl>, fQ13_2 <dbl+lbl>,
## #   fQ13_3 <dbl+lbl>, fQ13_4 <dbl+lbl>, fQ13_5 <dbl+lbl>,
## #   fQ13_6 <dbl+lbl>, fQ13_7 <dbl+lbl>, fQ13_8 <dbl+lbl>,
## #   fQ13_9 <dbl+lbl>, fQ13_10 <dbl+lbl>, fQ13_11 <dbl+lbl>,
## #   fQ13_12 <dbl+lbl>, fQ13_13 <dbl+lbl>, fQ13_14 <dbl+lbl>,
## #   fQ13_15 <dbl+lbl>, fQ13_16 <dbl+lbl>, fQ13_17 <dbl+lbl>,
## #   fQ13_18 <dbl+lbl>, fQ13_19 <dbl+lbl>, fQ13_20 <dbl+lbl>,
## #   fQ13_21 <dbl+lbl>, fQ13_22 <dbl+lbl>, fQ13_23 <dbl+lbl>,
## #   fQ13_24 <dbl+lbl>, fQ13_25 <dbl+lbl>, fQ13_26 <dbl+lbl>,
## #   fQ13_27 <dbl+lbl>, fQ13_28 <dbl+lbl>, fQ13_29 <dbl+lbl>,
## #   fQ13_30 <dbl+lbl>, fQ13_31 <dbl+lbl>, fQ13_32 <dbl+lbl>,
## #   fQ13_33 <dbl+lbl>, fQ13_34 <dbl+lbl>, fQ13_35 <dbl+lbl>,
## #   fQ13_36 <dbl+lbl>, fQ13_37 <dbl+lbl>, fQ13_38 <dbl+lbl>,
## #   Q25A_1 <dbl+lbl>, Q25A_2 <dbl+lbl>, Q25A_3 <dbl+lbl>,
## #   Q25A_4 <dbl+lbl>, Q25A_5 <dbl+lbl>, Q25A_6 <dbl+lbl>,
## #   Q25A_7 <dbl+lbl>, Q25A_8 <dbl+lbl>, Q25A_9 <dbl+lbl>,
## #   Q25A_10 <dbl+lbl>, Q25A_11 <dbl+lbl>, Q25A_12 <dbl+lbl>,
## #   Q25A_13 <dbl+lbl>, Q25A_14 <dbl+lbl>, Q25A_15 <dbl+lbl>,
## #   Q25A_16 <dbl+lbl>, Q25A_95 <dbl+lbl>, ...
```

---

# Importing Data - SPSS

```r
# convert spss label codes into labels
labelled_survey_data <- as_factor(survey_data)

head(labelled_survey_data)
```

```
## # A tibble: 6 x 682
##   SubsID   responseid respid dCOUNTRY fQ1bGender foQ1aAge_1 dAge  Q2Kids_1
##   <chr>         <dbl>  <dbl> <fct>    <fct>           <dbl> <fct>    <dbl>
## 1 21888465        345    347 UK       Male             68.0 55+       2.00
## 2 23586407        185    187 UK       Male             51.0 35-54     2.00
## 3 23739306        108    110 UK       Female           57.0 55+       2.00
## 4 25015655        524    526 UK       Female           53.0 35-54     2.00
## 5 25405304        349    351 UK       Female           25.0 16-34     2.00
## 6 20004255        823    825 UK       Female           54.0 35-54     1.00
## # ... with 674 more variables: Q2Kids_2 <dbl>, Q2Kids_3 <dbl>,
## #   dKIDS <fct>, Q3 <fct>, dNET <fct>, fQ2UKRegion <fct>, dRegionUK <fct>,
## #   fQ2FRRegion <fct>, dRegionFR <fct>, fQ2DERegion <fct>,
## #   dRegionDE <fct>, Q2TURRegion <fct>, Q2IncomeUK <fct>, dIncomeUK <fct>,
## #   Q2IncomeFR <fct>, dIncomeFR <fct>, Q2IncomeDE <fct>, dIncomeDE <fct>,
## #   Q2IncomeTUR <fct>, dIncomeTUR <fct>, Q4_1 <fct>, Q4_2 <fct>,
## #   Q4_3 <fct>, Q4_4 <fct>, Q4_5 <fct>, Q4_6 <fct>, Q4_99 <fct>,
## #   Q6_1 <fct>, Q6_2 <fct>, Q6_3 <fct>, Q6_4 <fct>, Q6_5 <fct>,
## #   Q6_98 <fct>, Q6_99 <fct>, dBiscLMAll_1 <fct>, dBiscLMAll_2 <fct>,
## #   dBiscLMAll_3 <fct>, dBiscLMAll_4 <fct>, dBiscLMAll_5 <fct>,
## #   dBiscLMAll_6 <fct>, dTypLMAll_1 <fct>, dTypLMAll_2 <fct>,
## #   dTypLMAll_3 <fct>, dTypLMAll_4 <fct>, dNumber <fct>, fQ13_1 <fct>,
## #   fQ13_2 <fct>, fQ13_3 <fct>, fQ13_4 <fct>, fQ13_5 <fct>, fQ13_6 <fct>,
## #   fQ13_7 <fct>, fQ13_8 <fct>, fQ13_9 <fct>, fQ13_10 <fct>,
## #   fQ13_11 <fct>, fQ13_12 <fct>, fQ13_13 <fct>, fQ13_14 <fct>,
## #   fQ13_15 <fct>, fQ13_16 <fct>, fQ13_17 <fct>, fQ13_18 <fct>,
## #   fQ13_19 <fct>, fQ13_20 <fct>, fQ13_21 <fct>, fQ13_22 <fct>,
## #   fQ13_23 <fct>, fQ13_24 <fct>, fQ13_25 <fct>, fQ13_26 <fct>,
## #   fQ13_27 <fct>, fQ13_28 <fct>, fQ13_29 <fct>, fQ13_30 <fct>,
## #   fQ13_31 <fct>, fQ13_32 <fct>, fQ13_33 <fct>, fQ13_34 <fct>,
## #   fQ13_35 <fct>, fQ13_36 <fct>, fQ13_37 <fct>, fQ13_38 <fct>,
## #   Q25A_1 <fct>, Q25A_2 <fct>, Q25A_3 <fct>, Q25A_4 <fct>, Q25A_5 <fct>,
## #   Q25A_6 <fct>, Q25A_7 <fct>, Q25A_8 <fct>, Q25A_9 <fct>, Q25A_10 <fct>,
## #   Q25A_11 <fct>, Q25A_12 <fct>, Q25A_13 <fct>, Q25A_14 <fct>,
## #   Q25A_15 <fct>, Q25A_16 <fct>, Q25A_95 <fct>, ...
```

---

# Importing Data - SPSS

Quickly get some summary stats about your respondents...

```r
labelled_survey_data %>% 
  count(fQ1bGender, dAge) %>% 
  spread(fQ1bGender, n) %>% 
  knitr::kable(format = "html")
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> dAge </th>
   <th style="text-align:right;"> Male </th>
   <th style="text-align:right;"> Female </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 16-34 </td>
   <td style="text-align:right;"> 510 </td>
   <td style="text-align:right;"> 596 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 35-54 </td>
   <td style="text-align:right;"> 664 </td>
   <td style="text-align:right;"> 612 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 55+ </td>
   <td style="text-align:right;"> 362 </td>
   <td style="text-align:right;"> 480 </td>
  </tr>
</tbody>
</table>

---

# Importing Data - Databases

```r
library(DBI) # for connecting to databases

con <- dbConnect(
  drv = RMySQL::MySQL(),
  dbname = "shinydemo",
  host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
  username = "guest",
  password = "guest"
)

dbListTables(con)
```

```
## [1] "City"            "Country"         "CountryLanguage"
```

---

# Importing Data - Databases

```r
cities <- tbl(con, "City")

cities
```

```
## # Source:   table<City> [?? x 5]
## # Database: mysql 10.0.17-MariaDB
## #   [guest@shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com:/shinydemo]
##       ID Name           CountryCode District      Population
##    <dbl> <chr>          <chr>       <chr>              <dbl>
##  1  1.00 Kabul          AFG         Kabol            1780000
##  2  2.00 Qandahar       AFG         Qandahar          237500
##  3  3.00 Herat          AFG         Herat             186800
##  4  4.00 Mazar-e-Sharif AFG         Balkh             127800
##  5  5.00 Amsterdam      NLD         Noord-Holland     731200
##  6  6.00 Rotterdam      NLD         Zuid-Holland      593321
##  7  7.00 Haag           NLD         Zuid-Holland      440900
##  8  8.00 Utrecht        NLD         Utrecht           234323
##  9  9.00 Eindhoven      NLD         Noord-Brabant     201843
## 10 10.0  Tilburg        NLD         Noord-Brabant     193238
## # ... with more rows
```

---

# Web APIs

Many R packages make it easy for you to pull publicly available data via an API (application programming interface).

The code below retrieves data from the UK police database within 1 mile of the coordinates you provide for the latest available month.

```r
library(ukpolice) # R wrapper to the UK Police API

crime_data <- ukp_crime(lat = 51.533802, lng = -0.092715)

head(crime_data)
```

```
## # A tibble: 6 x 12
##   category persistent_id date    lat    long street_id street_name context
##   <chr>    <chr>         <chr> <dbl>   <dbl> <chr>     <chr>       <chr>  
## 1 anti-so~ ""            2018~  51.5 -0.114  960854    On or near~ ""     
## 2 anti-so~ ""            2018~  51.5 -0.0826 961467    On or near~ ""     
## 3 anti-so~ ""            2018~  51.5 -0.0826 961467    On or near~ ""     
## 4 anti-so~ ""            2018~  51.5 -0.0820 968664    On or near~ ""     
## 5 anti-so~ ""            2018~  51.5 -0.101  961239    On or near~ ""     
## 6 anti-so~ ""            2018~  51.5 -0.0938 964580    On or near~ ""     
## # ... with 4 more variables: id <chr>, location_type <chr>,
## #   location_subtype <chr>, outcome_status <chr>
```

---

# Manipulation

---

# Transforming Data

https://github.com/gadenbuie/tidy-animated-verbs

---

# Transforming Data

```r
# data set with details of every tree in paris (1 row = 1 tree)
trees_in_paris <- read_csv2("data/les-arbres.csv")

colnames(trees_in_paris) # how's your french?
```

```
##  [1] "IDBASE"             "TYPEEMPLACEMENT"    "DOMANIALITE"       
##  [4] "ARRONDISSEMENT"     "COMPLEMENTADRESSE"  "NUMERO"            
##  [7] "LIEU / ADRESSE"     "IDEMPLACEMENT"      "CIRCONFERENCEENCM" 
## [10] "HAUTEUR (m)"        "STADEDEVELOPPEMENT" "PEPINIERE"         
## [13] "ESPECE"             "VARIETEOUCULTIVAR"  "GENRE"             
## [16] "DATEPLANTATION"     "REMARQUABLE"        "LIBELLEFRANCAIS"   
## [19] "OBJECTID"           "geo_point_2d"
```

Can we answer the question...

> How many trees does each Arrondissement have, and what century were they planted in?

We only need 2 columns from the raw data to answer this...

---

# Transforming Data

```r
transformed_trees <- trees_in_paris %>% 
  select(ARRONDISSEMENT, DATEPLANTATION) %>%                # select only Arr. and Date Planted cols
  mutate(year = lubridate::year(DATEPLANTATION),            # pull year number from full date
         century_planted = case_when(                       #
           year < 1700 ~ "17th",                            #
           year < 1800 ~ "18th",                            #
           year < 1900 ~ "19th",                            # code a century planted column
           year < 2000 ~ "20th",                            # based on year number
           year < 2018 ~ "21st",                            #
           is.na(year) ~ "Unknown"                          #
           )                                                #
         ) %>%                                              #
  count(ARRONDISSEMENT, century_planted) %>%                # count the number of trees by Arr. & Century
  group_by(ARRONDISSEMENT) %>%                              # group the data by Arr.
  add_tally(n) %>%                                          # add column with total trees for each Arr.
  ungroup() %>%                                             # ungroup the data
  spread(century_planted, n) %>%                            # spread centuries from rows to columns
  select(1, 3:8, Total = nn) %>%                            # reorder and rename columns
  arrange(desc(Total))                                      # sort data by total column
```

---

# Results...

The 16th Arrondissement has the most trees in Paris! Most of which were planted in the 18th Century.

```r
transformed_trees
```

```
## # A tibble: 25 x 8
##    ARRONDISSEMENT    `17th` `18th` `19th` `20th` `21st` Unknown Total
##    <chr>              <int>  <int>  <int>  <int>  <int>   <int> <int>
##  1 PARIS 16E ARRDT       NA  13970     21   2154   2059    1301 19505
##  2 PARIS 15E ARRDT       NA  11910     25   3869   2198     974 18976
##  3 PARIS 13E ARRDT       NA  10528      1   2998   3972    1331 18830
##  4 PARIS 20E ARRDT       NA  12309     NA   1646   2316    1004 17275
##  5 PARIS 19E ARRDT       NA  10421     NA   1452   2353     990 15216
##  6 PARIS 14E ARRDT       NA  10752      2   1280   1858     560 14452
##  7 PARIS 12E ARRDT       NA   8602     17   2648   2231     672 14170
##  8 PARIS 17E ARRDT       NA   6414      3   1984   2557    1157 12115
##  9 SEINE-SAINT-DENIS     NA  11230     NA     NA    285     254 11769
## 10 PARIS 18E ARRDT       NA   6116     NA   1941   2556     862 11475
## # ... with 15 more rows
```

---

# Joining Data

```r
co2 <- read.table("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_annmean_mlo.txt") %>% 
  as_tibble() %>% 
  select(Year = V1, CO2 = V2)

temp <- read.table("https://go.nasa.gov/2r8ryH1") %>% 
  as_tibble() %>% 
  select(Year = V1, `Global Land-Ocean Temperature Index` = V2)
```

```
## # A tibble: 59 x 2
##     Year   CO2
##    <int> <dbl>
##  1  1959   316
##  2  1960   317
##  3  1961   318
##  4  1962   318
##  5  1963   319
##  6  1964   320
##  7  1965   320
##  8  1966   321
##  9  1967   322
## 10  1968   323
## # ... with 49 more rows
```
]

```
## # A tibble: 138 x 2
##     Year `Global Land-Ocean Temperature Index`
##    <int>                                 <dbl>
##  1  1880                                -0.190
##  2  1881                                -0.100
##  3  1882                                -0.100
##  4  1883                                -0.190
##  5  1884                                -0.280
##  6  1885                                -0.310
##  7  1886                                -0.320
##  8  1887                                -0.350
##  9  1888                                -0.180
## 10  1889                                -0.110
## # ... with 128 more rows
```
]

---

# Joining Data

To look for a relationship between each dataset we want to join them but only keep years for which there is both CO2 and Global Temperature data.

```r
joined_data <- inner_join(co2, temp, by = "Year")

joined_data
```

```
## # A tibble: 59 x 3
##     Year   CO2 `Global Land-Ocean Temperature Index`
##    <int> <dbl>                                 <dbl>
##  1  1959   316                                0.0300
##  2  1960   317                               -0.0200
##  3  1961   318                                0.0600
##  4  1962   318                                0.0400
##  5  1963   319                                0.0700
##  6  1964   320                               -0.200 
##  7  1965   320                               -0.100 
##  8  1966   321                               -0.0500
##  9  1967   322                               -0.0200
## 10  1968   323                               -0.0700
## # ... with 49 more rows
```
]

---

# Plot the Relationship...

```r
ggplot(joined_data, aes(x = CO2, y = `Global Land-Ocean Temperature Index`)) +
  geom_point() +
  geom_smooth() +
  labs(caption = "Source: NOAA, NASA")
```

---

# Visualisation

---

# ggplot2

.pull-left[
- Combine your data wrangling and chart making into one single process
  * no more pasting data from excel into powerpoint to make or change a chart! 🙌

- Makes it easy to quickly iterate over different ways of mapping your data to shapes, space and colour

- Programming a chart with code forces you to think about what you are doing and why

- Allows you to create completley unique charts by adding several different 'layers' of data

- Not constrained by the chart types available in other software packages

- Produce publication quality graphics just like...
]

.pull-right[
<img src="https://d33wubrfki0l68.cloudfront.net/0ab849ed51b0b866ef6895c253d3899f4926d397/dbf0f/images/hex-ggplot2.png" alt="ggplot2 hex" width="60%" />
]

---

# Financial Times

.pull-left[
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Fun fact for the <a href="https://twitter.com/hashtag/dataviz?src=hash&amp;ref_src=twsrc%5Etfw">#dataviz</a> crowd:<br><br>This chart (and the ones in the story) are the first we&#39;ve done 100% in ggplot, right down to the custom <a href="https://twitter.com/FT?ref_src=twsrc%5Etfw">@FT</a> font and the white bar in the top left. <a href="https://t.co/BVFmoYX2WL">https://t.co/BVFmoYX2WL</a></p>&mdash; John Burn-Murdoch (@jburnmurdoch) <a href="https://twitter.com/jburnmurdoch/status/1006783615022391297?ref_src=twsrc%5Etfw">June 13, 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
]

---

# BBC

<blockquote class="twitter-tweet" data-conversation="none" data-cards="hidden" data-lang="en"><p lang="en" dir="ltr">At the beeb we’re now doing a lot of charts solely in ggplot. The charts here for example <a href="https://t.co/bokQHQK6pj">https://t.co/bokQHQK6pj</a> .  We’re doing maps too.</p>&mdash; Wesley Stephenson (@WesStephenson) <a href="https://twitter.com/WesStephenson/status/1006797669648470016?ref_src=twsrc%5Etfw">June 13, 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

---

# ggplot2 - Example

```r
london_crime <- london_crime %>% # use london crime data previously imported
  gather(year, rate, 3:ncol(.))  # gather the year columns to key + value columns (wide to long data)

london_crime
```

```
## # A tibble: 544 x 4
##    Code      Borough              year     rate
##    <chr>     <chr>                <chr>   <dbl>
##  1 E09000002 Barking and Dagenham 1999-00  19.8
##  2 E09000003 Barnet               1999-00  13.5
##  3 E09000004 Bexley               1999-00  14.5
##  4 E09000005 Brent                1999-00  20.8
##  5 E09000006 Bromley              1999-00  10.9
##  6 E09000007 Camden               1999-00  28.1
##  7 E09000008 Croydon              1999-00  17.7
##  8 E09000009 Ealing               1999-00  22.0
##  9 E09000010 Enfield              1999-00  14.8
## 10 E09000011 Greenwich            1999-00  29.4
## # ... with 534 more rows
```
---

# ggplot2 - Example

```r
plot_years <- unique(london_crime$year)
p <- ggplot(london_crime, aes(year, rate, group = Borough)) +
  geom_line(colour = "SteelBlue") +
  scale_x_discrete(breaks = plot_years[seq(1, length(plot_years), by = 2)]) +
  labs(x = "Year", y = "Rate per Thousand Population", title = "Violent Crime Trends by London Boroughs",
       caption = "Source: data.gov.uk")
p
```

---

# Facets

```r
p + 
  facet_wrap(~ Borough) + 
  scale_x_discrete(breaks = NULL) +
  scale_y_continuous(breaks = c(10, 30, 50)) +
  labs(x = NULL, subtitle = "1999-00 - 2015-16") +
  theme(strip.text.x = element_text(size = 6))
```
]

.right-column[
<img src="index_files/figure-html/london_facet2_out-1.svg" style="display: block; margin: auto;" />
]

---

# Geo Facets

Switch `facet_wrap` to `facet_geo` with the `geofacet` package to arrange facets in a geographically ordered grid.

```r
library(geofacet) # for geographically arranged facets
p +
  facet_geo(~ Borough, grid = "london_boroughs_grid") +
  scale_x_discrete(breaks = NULL) +
  scale_y_continuous(breaks = c(10, 30, 50)) +
  labs(x = NULL, subtitle = "1999-00 - 2015-16") +
  theme(strip.text.x = element_text(size = 6))
```
]

.right-column[
<img src="index_files/figure-html/london_facet3_out-1.svg" style="display: block; margin: auto;" />
]

---

# Joy Plots

```r
library(ggridges) # for creating ridgeline plots of 'joy plots' a la joy division unknown pleasures

ggplot(london_crime, aes(x = rate, y = factor(year) %>% fct_rev(), fill = ..x..)) +
  geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01, colour = "white") +
  viridis::scale_fill_viridis(name = "Rate", option = "C") +
  labs(y = NULL, title = 'Distribution of Violent Crime Rate in London Boroughs',
       subtitle = "Rate per Thousand Population", caption = "Source: London Data Store") +
  theme_ft_rc()
```

---

# Joy Plots

---

# Maps

```r
library(sf) # for spatial analysis in a 'tidy' framework

# join london borough shapefile to data and filter to latest year
ldn_shape <- st_read("data/ldnb/London_Borough_Excluding_MHW.shp", quiet = TRUE) %>%
  st_transform(4326) %>% 
  inner_join(london_crime, by = c("GSS_CODE" = "Code")) %>% 
  filter(year == "2015-16")

ldn_map <- ggplot(ldn_shape) +
  geom_sf(aes(fill = rate), colour = "#FAFAFA") +
  coord_sf(datum = NA) +
  scale_fill_viridis_c(name = "Rate per Thousand Population", 
                       guide = guide_colourbar(title.position = "top",
                                               barwidth = 13, 
                                               barheight = 0.5)) +
  theme(legend.position = c(0.8, 1.05), legend.direction = "horizontal") +
  labs(title = "Violent Crime in London", subtitle = "2015-16", caption = "Source: London Data Store")
```

---

# Maps

---

# Interactive Graphics

As well as static graphics with `ggplot2`, [`htmlwidgets`](https://www.htmlwidgets.org/) bring the best of JavaScript data visualisation to R.

A few lines of R code can produce rich interactive charts and maps that you can embed in reports and dashboards.

For example, Plotly allows you to convert a static ggplot to an interactive graphics with one line!

```r
# make the same london crime ggplot as before with added html tooltip aesthetic
library(plotly) # R package for the plotly javascript library
library(glue) # for glueing strings to data

p <- ggplot(london_crime, aes(year, rate, group = Borough, 
                              text = glue("<b>{Borough}</b><br>Year: {year}<br>Rate: {round(rate, 2)}"))) +
  geom_line(colour = "SteelBlue") +
  scale_x_discrete(breaks = plot_years[seq(1, length(plot_years), by = 2)]) +
  labs(x = "Year", y = "Rate per Thousand Population", title = "Violent Crime Trends by London Boroughs")

p_i <- ggplotly(p, tooltip = "text") # then convert to plotly js chart
```

Chart on the next slide!

---

# ggplot2 -> Ploty JS

---

# Leaflet

We can use the previously merged london borough shapefile and violent crime data to make an interactive leaflet map like so...

```r
library(leaflet) # for interactive mapping

pal <- colorNumeric("viridis", NULL)

ldn_map_i <- leaflet(ldn_shape) %>% 
  addProviderTiles("CartoDB.Positron") %>%
  addPolygons(
    fill = ~rate, fillColor = ~pal(rate), fillOpacity = .6,
    label = ~glue("<b>{Borough}</b><br>Year: {year}<br>Rate: {round(rate, 2)}") %>% lapply(htmltools::HTML), 
    stroke = TRUE, color = "#FAFAFA", weight = 1, smoothFactor = 0
  ) %>% 
  addLegend(pal = pal, values = ~rate, opacity = 1.0, title = "Crime<br>Rate")
```

---

# Leaflet

---

# Highcharts

```r
library(highcharter) # R package for the highcharts javascript library
library(treemap) # to load GNI data below
data(GNI2014) # world gross national income data

# 2-tiered treemap with one line of code
treemap <- hctreemap2(GNI2014, c("continent", "iso3"), size_var = "population", color_var = "GNI")
```

---

# Communication

---

# Communication

All your coding is only going to be useful if you can effectively communicate the findings to a broader audience of stakeholders, in formats they are comfortable with.

This is where R really excels above other programming languages:

- [`rmarkdown`](https://rmarkdown.rstudio.com/) is an authoring format that enables the easy creation of dynamic documents, presentations, and reports from R. It combines the core syntax of markdown (an easy to write plain text format) with embedded R code chunks that are run so their output can be included in the final document

- [`Shiny`](https://shiny.rstudio.com/) makes it possible to build fully interactive web-apps and dashboards that puts the analytical and graphical power of R at the fingertips of your users in elegant, browser-based user interfaces

]

![shiny](static/imgs/shiny.png)
]

---

# Rmarkdown

Combining narrative and commentary with code in a single document is a very powerful concept. It makes your report fully **reproducible**. That is, a script you can simply re-run whenever your data or code changes and all data-driven R outputs will update automatically in your report. No more copy & pasting data around various MS Office programs!*

The output options enables you to share your analysis in the format your audience is most comfortable with, be it:

- HTML Report (enables the use of interactive graphics)
- HTML Slideshow (this entire slideshow is written in rmarkdown!)
- PDF Report
- MS Word Document
- MS PowerPoint Slides
- Dashboards
- Websites - [cultureofinsight.com](https://cultureofinsight.com) is written in rmarkdown

Having your report as an executable R script also allows you to paramaterise some variables and automate the generation of several different reports - one for each subset of a data set, for example.

.footnote[
*[Check out this video](https://www.youtube.com/watch?v=s3JldKoA0zw) for a better illustration of the problem and solution to traditional data reporting methods.
]

---

# Rmarkdown

---

# Shiny

Shiny goes a step further and allows end-users to interact with a live R session through a user-interface, giving them the power of R without any required knowledge of coding.

Shiny dashboards are significantly more powerful than expensive off-the-shelf dashboard software that are limited to visualising the data you supply to them with their own charting libraries. Shiny can draw on the vast eco-system of open-source technology, dynamically run any type of R code you want it to, and feed the results back to the end-user.

#### Example Use Case

- fetch some live google analytics data
- merge it with some internal market research data
- run a model to calculate whether a recent marketing campaign drove a significant increase in web traffic
- visualise the results to the user in interactive charts
- give them the option of downloading an auto-generated rmarkdown PDF report

And there's no limitations on the look and feel of your app. There's lots of great CSS themes to choose from, or if you know some HTML + CSS you can build a complety bespoke user interace interface from the ground-up.

Check out some of the shiny apps we've made over on the Culture of Insight [portfolio page.](https://cultureofinsight.com/portfolio)

---

# Shiny

---

# Learning

---

# Learning

We hope this has inspired you to start learning some R!

We're currently offering a 1-day "Introduction to R" course to get people started on their journey to data programming nirvana.

The course will teach you how to import, collate, join, and transform datasets with the tidyverse packages in the RStudio IDE.

This will provide the base knowledge you need to:

- Liberate you from the constraints of Excel
- Elimate human error in your work by automating your data tasks with R scripts
- Free-up time to focus on insight, not data wrangling

You can extend your training to focus on topics like:

- Data visualisation with `ggplot2`
- Automated reporting with `rmarkdown`
- Web-application development with `shiny`

Head over to our [eventbrite page](https://www.eventbrite.co.uk/e/master-data-with-code-tickets-50351709318) to sign up, or [get in touch here](https://cultureofinsight.com/training) if you'd like to know more!

---

# Thank you!

[cultureofinsight.com](https://cultureofinsight.com/)

twitter: [@paulcampbell91](https:/twitter.com/paulcampbell91)

github: [paulc91](https://github.com/PaulC91)

linkedin: [paul campbell](https://www.linkedin.com/in/paul-campbell-792425123/)