You are currently browsing hadleywickham’s articles.

I’ve released four new data packages to CRAN: babynames, fueleconomy, nasaweather and nycflights13. The goal of these packages is to provide some interesting, and relatively large, datasets to demonstrate various data analysis challenges in R. The package source code (on github, linked above) is fully reproducible so that you can see some data tidying in action, or make your own modifications to the data.

Below, I’ve listed the primary dataset found in each package. Most packages also include a number of supplementary datasets that provide additional information. Check out the docs for more details.

  • babynames::babynames: US baby name data for each year from 1880 to 2013, the number of children of each sex given each name. All names used 5 or more times are included. 1,792,091 rows, 5 columns (year, sex, name, n, prop). (Source: Social security administration).
  • fueleconomy::vehicles: Fuel economy data for all cars sold in the US from 1984 to 2015. 33,442 rows, 12 variables. (Source: Environmental protection agency)
  • nasaweather::atmos: Data from the 2006 ASA data expo. Contains monthly atmospheric measurements from Jan 1995 to Dec 2000 on 24 x 24 grid over Central America. 41,472 observations, 11 variables. (Source: ASA data expo)
  • nycflights13::flights: This package contains information about all flights that departed from NYC (i.e., EWR, JFK and LGA) in 2013: 336,776 flights with 16 variables. To help understand what causes delays, it also includes a number of other useful datasets: weather, planes, airports, airlines. (Source: Bureau of transportation statistics)

NB: since the datasets are large, I’ve tagged each data frame with the tbl_df class. If you don’t use dplyr, this has no effect. If you do use dplyr, this ensures that you won’t accidentally print thousands of rows of data. Instead, you’ll just see the first 10 rows and as many columns as will fit on screen. This makes interactive exploration much easier.

library(dplyr)
library(nycflights13)
flights
#> Source: local data frame [336,776 x 16]
#> 
#>    year month day dep_time dep_delay arr_time arr_delay carrier tailnum
#> 1  2013     1   1      517         2      830        11      UA  N14228
#> 2  2013     1   1      533         4      850        20      UA  N24211
#> 3  2013     1   1      542         2      923        33      AA  N619AA
#> 4  2013     1   1      544        -1     1004       -18      B6  N804JB
#> 5  2013     1   1      554        -6      812       -25      DL  N668DN
#> 6  2013     1   1      554        -4      740        12      UA  N39463
#> 7  2013     1   1      555        -5      913        19      B6  N516JB
#> 8  2013     1   1      557        -3      709       -14      EV  N829AS
#> 9  2013     1   1      557        -3      838        -8      B6  N593JB
#> 10 2013     1   1      558        -2      753         8      AA  N3ALAA
#> ..  ...   ... ...      ...       ...      ...       ...     ...     ...
#> Variables not shown: flight (int), origin (chr), dest (chr), air_time
#>   (dbl), distance (dbl), hour (dbl), minute (dbl)

tidyr is new package that makes it easy to “tidy” your data. Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modelling packages). The two most important properties of tidy data are:

  • Each column is a variable.
  • Each row is an observation.

Arranging your data in this way makes it easier to work with because you have a consistent way of referring to variables (as column names) and observations (as row indices). When use tidy data and tidy tools, you spend less time worrying about how to feed the output from one function into the input of another, and more time answering your questions about the data.

To tidy messy data, you first identify the variables in your dataset, then use the tools provided by tidyr to move them into columns. tidyr provides three main functions for tidying your messy data: gather(), separate() and spread().

gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. Other names for gather include melt (reshape2), pivot (spreadsheets) and fold (databases). Here’s an example how you might use gather() on a made-up dataset. In this experiment we’ve given three people two different drugs and recorded their heart rate:

library(tidyr)
library(dplyr)

messy <- data.frame(
  name = c("Wilbur", "Petunia", "Gregory"),
  a = c(67, 80, 64),
  b = c(56, 90, 50)
)
messy
#>      name  a  b
#> 1  Wilbur 67 56
#> 2 Petunia 80 90
#> 3 Gregory 64 50

We have three variables (name, drug and heartrate), but only name is currently in a column. We use gather() to gather the a and b columns into key-value pairs of drug and heartrate:

messy %>%
  gather(drug, heartrate, a:b)
#>      name drug heartrate
#> 1  Wilbur    a        67
#> 2 Petunia    a        80
#> 3 Gregory    a        64
#> 4  Wilbur    b        56
#> 5 Petunia    b        90
#> 6 Gregory    b        50

Sometimes two variables are clumped together in one column. separate() allows you to tease them apart (extract() works similarly but uses regexp groups instead of a splitting pattern or position). Take this example from stackoverflow (modified slightly for brevity). We have some measurements of how much time people spend on their phones, measured at two locations (work and home), at two times. Each person has been randomly assigned to either treatment or control.

set.seed(10)
messy <- data.frame(
  id = 1:4,
  trt = sample(rep(c('control', 'treatment'), each = 2)),
  work.T1 = runif(4),
  home.T1 = runif(4),
  work.T2 = runif(4),
  home.T2 = runif(4)
)

To tidy this data, we first use gather() to turn columns work.T1, home.T1, work.T2 and home.T2 into a key-value pair of key and time. (Only the first eight rows are shown to save space.)

tidier <- messy %>%
  gather(key, time, -id, -trt)
tidier %>% head(8)
#>   id       trt     key    time
#> 1  1 treatment work.T1 0.08514
#> 2  2   control work.T1 0.22544
#> 3  3 treatment work.T1 0.27453
#> 4  4   control work.T1 0.27231
#> 5  1 treatment home.T1 0.61583
#> 6  2   control home.T1 0.42967
#> 7  3 treatment home.T1 0.65166
#> 8  4   control home.T1 0.56774

Next we use separate() to split the key into location and time, using a regular expression to describe the character that separates them.

tidy <- tidier %>%
  separate(key, into = c("location", "time"), sep = "\\.") 
tidy %>% head(8)
#>   id       trt location time    time
#> 1  1 treatment     work   T1 0.08514
#> 2  2   control     work   T1 0.22544
#> 3  3 treatment     work   T1 0.27453
#> 4  4   control     work   T1 0.27231
#> 5  1 treatment     home   T1 0.61583
#> 6  2   control     home   T1 0.42967
#> 7  3 treatment     home   T1 0.65166
#> 8  4   control     home   T1 0.56774

The last tool, spread(), takes two columns (a key-value pair) and spreads them in to multiple columns, making “long” data wider. Spread is known by other names in other places: it’s cast in reshape2, unpivot in spreadsheets and unfold in databases. spread() is used when you have variables that form rows instead of columns. You need spread() less frequently than gather() or separate() so to learn more, check out the documentation and the demos.

Just as reshape2 did less than reshape, tidyr does less than reshape2. It’s designed specifically for tidying data, not general reshaping. In particular, existing methods only work for data frames, and tidyr never aggregates. This makes each function in tidyr simpler: each function does one thing well. For more complicated operations you can string together multiple simple tidyr and dplyr functions with %>%.

You can learn more about the underlying principles in my tidy data paper. To see more examples of data tidying, read the vignette, vignette("tidy-data"), or check out the demos, demo(package = "tidyr"). Alternatively, check out some of the great stackoverflow answers that use tidyr. Keep up-to-date with development at http://github.com/hadley/tidyr, report bugs at http://github.com/hadley/tidyr/issues and get help with data manipulation challenges at https://groups.google.com/group/manipulatr. If you ask a question specifically about tidyr on stackoverflow, please tag it with tidyr and I’ll make sure to read it.

I’m very excited to announce dplyr 0.2. It has three big features:

  • improved piping courtesy of the magrittr package

  • a vastly more useful implementation of do()

  • five new verbs: sample_n(), sample_frac(), summarise_each(), mutate_each and glimpse().

These features are described in more detail below. To learn more about the 35 new minor improvements and bug fixes, please read the full release notes.

Improved piping

dplyr now imports %>% from the magrittr package by Stefan Milton Bache. I recommend that you use this instead of %.% because it is easier to type (since you can hold down the shift key) and is more flexible. With you %>%, you can control which argument on the RHS receives the LHS with the pronoun .. This makes %>% more useful with base R functions because they don’t always take the data frame as the first argument. For example you could pipe mtcars to xtabs() with:

mtcars %>% xtabs( ~ cyl + vs, data = .)

dplyr only exports %>% from magrittr, but magrittr contains many other useful functions. To use them, load magrittr explicitly with library(magrittr). For more details, see vignette("magrittr").
%.% will be deprecated in a future version of dplyr, but it won’t happen for a while. I’ve deprecated chain() to encourage a single style of dplyr usage: please use %>% instead.

Do

do() has been completely overhauled, and group_by() + do() is now equivalent in power to plyr::dlply(). There are two ways to use do(), either with multiple named arguments or a single unnamed arguments. If you use named arguments, each argument becomes a list-variable in the output. A list-variable can contain any arbitrary R object which makes this form of do() useful for storing models:

library(dplyr)
models %>% group_by(cyl) %>% do(model = lm(mpg ~ wt, data = .))
models %>% summarise(rsq = summary(model)$r.squared)

If you use an unnamed argument, the result should be a data frame. This allows you to apply arbitrary functions to each group.

mtcars %>% group_by(cyl) %>% do(head(., 1))

Note the use of the pronoun . to refer to the data in the current group.
do() also has an automatic progress bar. It appears if the computation takes longer than 2 seconds and estimates how long the job will take to complete.

New verbs

sample_n() randomly samples a fixed number of rows from a tbl; sample_frac() randomly samples a fixed fraction of rows. They currently only work for local data frames and data tables.
summarise_each() and mutate_each() make it easy to apply one or more functions to multiple columns in a tbl. These works for all srcs that summarise() and mutate() work for.
glimpse() makes it possible to see all the columns in a tbl, displaying as much data for each variable as can be fit on a single line.

We’re pleased to announce a new version of roxygen2. Roxygen2 allows you to write documentation comments that are automatically converted to R’s standard Rd format, saving you time and reducing duplication. This release is a major update that provides enhanced error handling and considerably safer default behaviour. Roxygen2 now adds a comment to all generated files so that you know they shouldn’t be edited by hand. This also ensures that roxygen2 will never overwrite a file that it did not create, and can automatically remove files that are no longer needed.

I’ve also written some vignettes to help you understand how to use roxygen2. Six new vignettes provide a comprehensive overview of using roxygen2 in practice. Run browseVignettes("roxygen2") to read them. In an effort to make roxygen2 easier to use and more consistent between package authors, I’ve made parsing considerably stricter, and made sure that all errors give you the line number of the associated roxygen block. Every input is now checked to make sure that it has (e.g. every { has a matching }). This should prevent frustrating errors that require careful reading of .Rd files. Similarly, @section titles and @export tags can now only span a single line as this prevents a number of common bugs.

Other features include two new tags @describeIn and @field, and you can document objects (like datasets) by documenting their name as a string. For example, to document a dataset called mydata, you can do:

#' Mydata set
#'
#' Some data I collected about myself
"mydata"

To see a complete list of all bug fixes and improvements, please see the release notes for roxygen2 4.0.0 for details. Roxygen2 4.0.1 fixed a couple of minor bugs and majorly improved the upgrade process.

reshape2 1.4 is now available on CRAN. This version adds a number of useful arguments and messages, but mostly importantly it gains a C++ implementation of melt.data.frame(). This new method should be much much faster (>10x) and does a better job of preserving existing attributes. For full details, see the release notes on github.

The C++ implementation of melt was contributed by Kevin Ushey, who we’re very pleased to announce has joined RStudio. You may be familiar with Kevin from his contributions to Rcpp, or his CRAN packages Kmisc and timeit.

devtools 1.5 is now available on CRAN. It includes four new functions to make it easier to add useful infrastructure to packages:

  • add_test_infrastructure() will create testthat infrastructure when needed.

  • add_rstudio_project() adds an Rstudio project file to your package.

  • add_travis() adds a basic template for travis-ci.

  • add_build_ignore() makes it easy to add files to .Rbuildignore,
    escaping special characters as needed.

We’ve also bumped two dependencies: devtools now requires R 3.0.2 and roxygen2 3.0.0. We’ve also included many minor improvements and bug fixes, particularly for package installation. For example install_github() now prefers the safer github personal access token, and does a better job of installing the dependencies that you actually need. We also provide versions of help(), ? and system.file() that work with all packages, regardless of how they’re loaded. See a complete list of changes in the full release notes.

We’re very pleased to announce the release of httr 0.3. httr makes it
easy to work with modern web apis so that you can work with web data
almost as easily as local data. For example, this code shows how might
find the most recently asked question about R on stackoverflow:

# install.packages("httr")
library(httr)

# Find the most recent R questions on stackoverflow
r <- GET(
  "http://api.stackexchange.com",
  path = "questions",
  query = list(
    site = "stackoverflow.com",
    tagged = "r"
  )
)

# Check the request succeeded
stop_for_status(r)

# Automatically parse the json output
questions <- content(r)
questions$items[[1]]$title
#> [1] "Remove NAs from data frame without deleting entire rows/columns"

httr 0.3 recieved a major overhaul to OAuth support. OAuth is a modern
standard for authentication used when you want to allow a service (i.e R
package) access to your account on a website. This version of httr
provides an improved initial authentication experience and supports
caching so that you only need to authenticate once per project. A big
thanks goes to Craig Citro (Google) who contributed a lot of code and
ideas to make this possible.

httr 0.3 also includes many other bug fixes and minor improvements. You
can read about these in the github release notes.

dplyr 0.1.3 is now on CRAN. It fixes an incompatibility with the latest version of Rcpp, and a number of other bugs that were causing dplyr to crash R. See the full details in the release notes.

We’re pleased to announce a new major version of testthat. Version 0.8 comes with a new recommended structure for storing your tests. To better meet CRAN recommended practices, we now recommend that tests live in tests/testthat, instead of inst/tests. This makes it possible for users to choose whether or not to install tests. With this new structure, you’ll need to use test_check() instead of test_packages() in the test file (usually tests/testthat.R) that runs all testthat unit tests.

Another big improvement comes from Karl Forner. He contributed code which provides line numbers in test errors so you can see exactly where the problems are. There are also four new expectations (expect_null(), expected_named(), expect_more_than(), expect_less_than()) and many other minor improvements and bug fixes. For a complete list of changes, please see the github release. After release of 0.8 to CRAN, we discovered two small bugs. These were fixed in 0.8.1.

As always, you can install the latest version with install.packages("testthat").

We’re pleased to announce a new minor version of dplyr. This fixes a number of bugs that crashed R, and considerably improves the functionality of select(). You can now use named arguments to rename existing variables, and use new functions starts_with(), ends_with()contains(),  matches() and num_range() to select variables based on their names. Finally, select() now makes a shallow copy, substantially reducing its memory impact. I’ve also added the summarize() alias for people from countries who don’t spell correctly ;)

For a complete list of changes, please see the github release, and as always, you can install the latest version with install.packages("dplyr").

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 600 other followers

RStudio is an affiliated project of the Foundation for Open Access Statistics

Follow

Get every new post delivered to your Inbox.

Join 600 other followers