R - the tool

Ilya Kashnitsky

17 October 2017

What is R?

Rlogo

  • R is a language and environment for statistical computing and graphics
  • Open source
  • The most widely used statistical software in the world
  • Amazing number of contributors

Who use R?

  • Researchers
  • Data analysts
  • Programmers
  • Journalists
  • Web designers
  • In academia R de-facto is the standard tool

Illustration

https://www.r-users.com

http://blog.revolutionanalytics.com/2014/05/companies-using-r-in-2014.html

Why R?

“R offers a breadth and depth in statistical computing beyond what is available in commercial closed source products. Yet R remains, primarily, a programming language for the highly skilled statistician, and out of the reach of many.”

Williams, G. J. (2009). Rattle: A Data Mining GUI for R. The R Journal, 1/2, 45-55.

PRO

  • Open source
  • Reproducible
  • Easy and fast to work with large datasets
  • Most importantly, R can easily replace most of the tools one uses daily for research

CONTRA

  • Steep learning curve
  • Not perfect for handling survey data (metadata)

Examples of data visualization

https://ourworldindata.org

https://jschoeley.shinyapps.io/hmdexp

http://bancdadesced.uab.es/population_change/

http://www.global-migration.info

Election 2016: Exit Polls

http://www.nytimes.com/interactive/2016/11/08/us/politics/election-exit-polls.html

A Day in the Life of Americans

http://flowingdata.com/2015/12/15/a-day-in-the-life-of-americans

Where People Run in Major Cities

http://flowingdata.com/2014/02/05/where-people-run/

Reproducible research

Reproducibility

  • Reproducibility of scientific results is the main challenge of the modern academia
  • Collaboration, Open Science (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716
  • Out of 100 papers only 39 were replicated

Literate programming

  • The idea of literate programming dates back to mid-80s
  • The core idea: code, comments, and results should appear together in one document.

R notebooks

  • The recently presented v1.0 version of RStudio brings R-notebooks as on of the main features.
  • This type of document implements the principles of literate programming.

http://www.danielwells.me/human-lifespan-limit/

Practice

Swirl

  • install.packages('swirl')
  • library(swirl)
  • swirl()

Tidyverse

The most influential R developer

Hadley Wickham

hadley

tidyverse

https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/

tidy data

Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10). Retrieved from http://www.jstatsoft.org/v59/i10

Tidy data is a standard way of mapping the meaning of a dataset to its structure.

A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.

In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Examples and exercises

  • Please follow me on the R script "examples-tidyverse.R"
  • Then proceed to "exercises.R"

Demographic data acquisition

https://ikashnitsky.github.io/2017/data-acquisition-one//hmdexp

Visualizing data with ggplot2

A bit more motivation

http://qz.com/316906/the-dude-map-how-american-men-refer-to-their-bros

Visualizing life tables

http://flowingdata.com/2016/01/19/how-you-will-die

https://www.r-bloggers.com/pisa-2015-how-to-readprocessplot-the-data-with-r

American schools

http://www.nytimes.com/interactive/2016/04/29/upshot/money-race-and-success-how-your-school-district-compares.html

Plotting systems in R?

  • base
  • lattice
  • ggplot2

“The winner takes it all”

cat

Strength of base plotting system

  • Usually, base knows how to plot an object
  • Extremely easy to use if you are happy with the default settings
  • BUT
  • Now ggplot2 has the autoplot function

The only example when ggplot2 failed for me

http://stackoverflow.com/questions/17753502

https://github.com/tidyverse/ggplot2/issues/1720

What makes ggplot2 special?

“gg” means “Grammar of graphics”

http://www.springer.com/us/book/9780387245447

Extremely big and helpful community

  • Help
  • Examples
  • Rapid development
  • Extensions

http://www.ggplot2-exts.org/gallery/

Amazing documentation

http://docs.ggplot2.org/current

ggplot2 show

Please follow me on “examples-ggplot2.R