R - the tool

Ilya Kashnitsky

06 December 2016

What is R?

Rlogo

  • R is a language and environment for statistical computing and graphics
  • Open source
  • The most widely used statistical software in the world
  • Amazing number of contributors

Who use R?

  • Researchers
  • Data analysts
  • Programmers
  • Journalists
  • Web designers
  • In academia R de-facto is the standard tool

Illustration

https://www.r-users.com

https://www.demogr.mpg.de/en/education_career/jobs_fellowships_1910/statistical_analyst_4828.htm

http://blog.revolutionanalytics.com/2014/05/companies-using-r-in-2014.html

Why R?

“R offers a breadth and depth in statistical computing beyond what is available in commercial closed source products. Yet R remains, primarily, a programming language for the highly skilled statistician, and out of the reach of many.”

Williams, G. J. (2009). Rattle: A Data Mining GUI for R. The R Journal, 1/2, 45-55.

PRO

  • Open source
  • Reproducible
  • Easy and fast to work with large datasets
  • Most importantly, R can easily replace most of the tools one uses daily for research

CONTRA

  • Steep learning curve
  • Not perfect for handling survey data (metadata)

Examples of data visualization

https://ourworldindata.org

https://jschoeley.shinyapps.io/hmdexp

Election 2016: Exit Polls

http://www.nytimes.com/interactive/2016/11/08/us/politics/election-exit-polls.html

A Day in the Life of Americans

http://flowingdata.com/2015/12/15/a-day-in-the-life-of-americans

Where People Run in Major Cities

http://flowingdata.com/2014/02/05/where-people-run/

https://youtu.be/aOtQyfbRMQY

Reproducible research

Reproducibility

  • Reproducibility of scientific results is the main challenge of the modern academia
  • Collaboration, Open Science (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716
  • Out of 100 papers only 39 were replicated

Literate programming

  • The idea of literate programming dates back to mid-80s
  • The core idea: code, comments, and results should appear together in one document.

R notebooks

  • The recently presented v1.0 version of RStudio brings R-notebooks as on of the main features.
  • This type of document implements the principles of literate programming.

http://www.danielwells.me/human-lifespan-limit/

Practice

Swirl

  • install.packages('swirl')
  • library(swirl)
  • swirl()

Tidyverse

The most influential R developer

Hadley Wickham

hadley

tidyverse

https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/

tidy data

Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10). Retrieved from http://www.jstatsoft.org/v59/i10

Tidy data is a standard way of mapping the meaning of a dataset to its structure.

A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types.

In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Examples and exercises

  • Please follow me on the R script "examples-tidyverse.R"
  • Then proceed to "exercises.R"