You are currently browsing hadleywickham’s articles.

I’m very pleased to announce that dplyr 0.3 is now available from CRAN. Get the latest version by running:


There are four major new features:

  • Four new high-level verbs: distinct(), slice(), rename(), and transmute().
  • Three new helper functions between, count(), and data_frame().
  • More flexible join specifications.
  • Support for row-based set operations.

There are two new features of interest to developers. They make it easier to write packages that use dplyr:

  • It’s now much easier to program with dplyr (using standard evaluation).
  • Improved database backends.

I describe each of these in turn below.

New verbs

distinct() returns distinct (unique) rows of a table:

# Find all origin-destination pairs
flights %>% 
  select(origin, dest) %>%
#> Source: local data frame [224 x 2]
#>    origin dest
#> 1     EWR  IAH
#> 2     LGA  IAH
#> 3     JFK  MIA
#> 4     JFK  BQN
#> 5     LGA  ATL
#> ..    ...  ...

slice() allows you to select rows by position. It includes positive integers and drops negative integers:

# Get the first flight to each destination
flights %>% 
  group_by(dest) %>%
#> Source: local data frame [105 x 16]
#> Groups: dest
#>    year month day dep_time dep_delay arr_time arr_delay carrier tailnum
#> 1  2013    10   1     1955        -6     2213       -35      B6  N554JB
#> 2  2013    10   1     1149       -10     1245       -14      B6  N346JB
#> 3  2013     1   1     1315        -2     1413       -10      EV  N13538
#> 4  2013     7   6     1629        14     1954         1      UA  N587UA
#> 5  2013     1   1      554        -6      812       -25      DL  N668DN
#> ..  ...   ... ...      ...       ...      ...       ...     ...     ...
#> Variables not shown: flight (int), origin (chr), dest (chr), air_time
#>   (dbl), distance (dbl), hour (dbl), minute (dbl)

transmute() and rename() are variants of mutate() and select(). Transmute drops all columns that you didn’t specifically mention, rename() keeps all columns that you didn’t specifically mention. They complete this table:

Drop others Keep others
Rename & reorder variables select() rename()
Compute new variables transmute() mutate()

New helpers

data_frame(), contributed by Kevin Ushey, is a nice way to create data frames:

  • It never changes the type of its inputs (i.e. no more stringsAsFactors = FALSE!)
    data.frame(x = letters) %>% sapply(class)
    #>        x 
    #> "factor"
    data_frame(x = letters) %>% sapply(class)
    #>           x 
    #> "character"
  • Or the names of variables:
    data.frame(`crazy name` = 1) %>% names()
    #> [1] ""
    data_frame(`crazy name` = 1) %>% names()
    #> [1] "crazy name"
  • It evaluates its arguments lazyily and in order:
    data_frame(x = 1:5, y = x ^ 2)
    #> Source: local data frame [5 x 2]
    #>   x  y
    #> 1 1  1
    #> 2 2  4
    #> 3 3  9
    #> 4 4 16
    #> 5 5 25
  • It adds tbl_df() class to output, never adds row.names(), and only recycles vectors of length 1 (recycling is a frequent source of bugs in my experience).

The count() function wraps up the common combination of group_by() and summarise():

# How many flights to each destination?
flights %>% count(dest)
#> Source: local data frame [105 x 2]
#>    dest     n
#> 1   ABQ   254
#> 2   ACK   265
#> 3   ALB   439
#> 4   ANC     8
#> 5   ATL 17215
#> ..  ...   ...

# Which planes flew the most?
flights %>% count(tailnum, sort = TRUE)
#> Source: local data frame [4,044 x 2]
#>    tailnum    n
#> 1          2512
#> 2   N725MQ  575
#> 3   N722MQ  513
#> 4   N723MQ  507
#> 5   N711MQ  486
#> ..     ...  ...

# What's the total carrying capacity of planes by year of purchase
planes %>% count(year, wt = seats)
#> Source: local data frame [47 x 2]
#>    year   n
#> 1  1956 102
#> 2  1959  18
#> 3  1963  10
#> 4  1965 149
#> 5  1967   9
#> ..  ... ...

Better joins

You can now join by different variables in each table:

narrow <- flights %>% select(origin, dest, year:day)

# Add destination airport metadata
narrow %>% left_join(airports, c("dest" = "faa"))
#> Source: local data frame [336,776 x 11]
#>    dest origin year month day                            name   lat    lon
#> 1   IAH    EWR 2013     1   1    George Bush Intercontinental 29.98 -95.34
#> 2   IAH    LGA 2013     1   1    George Bush Intercontinental 29.98 -95.34
#> 3   MIA    JFK 2013     1   1                      Miami Intl 25.79 -80.29
#> 4   BQN    JFK 2013     1   1                              NA    NA     NA
#> 5   ATL    LGA 2013     1   1 Hartsfield Jackson Atlanta Intl 33.64 -84.43
#> ..  ...    ...  ...   ... ...                             ...   ...    ...
#> Variables not shown: alt (int), tz (dbl), dst (chr)

# Add origin airport metadata
narrow %>% left_join(airports, c("origin" = "faa"))
#> Source: local data frame [336,776 x 11]
#>    origin dest year month day                name   lat    lon alt tz dst
#> 1     EWR  IAH 2013     1   1 Newark Liberty Intl 40.69 -74.17  18 -5   A
#> 2     LGA  IAH 2013     1   1          La Guardia 40.78 -73.87  22 -5   A
#> 3     JFK  MIA 2013     1   1 John F Kennedy Intl 40.64 -73.78  13 -5   A
#> 4     JFK  BQN 2013     1   1 John F Kennedy Intl 40.64 -73.78  13 -5   A
#> 5     LGA  ATL 2013     1   1          La Guardia 40.78 -73.87  22 -5   A
#> ..    ...  ...  ...   ... ...                 ...   ...    ... ... .. ...

(right_join() and outer_join() implementations are planned for dplyr 0.4.)

Set operations

You can use intersect(), union() and setdiff() with data frames, data tables and databases:

jfk_planes <- flights %>% 
  filter(origin == "JFK") %>% 
  select(tailnum) %>% 
lga_planes <- flights %>% 
  filter(origin == "LGA") %>% 
  select(tailnum) %>% 

# Planes that fly out of either JGK or LGA
nrow(union(jfk_planes, lga_planes))
#> [1] 3592

# Planes that fly out of both JFK and LGA
nrow(intersect(jfk_planes, lga_planes))
#> [1] 1311

# Planes that fly out JGK but not LGA
nrow(setdiff(jfk_planes, lga_planes))
#> [1] 647

Programming with dplyr

You can now program with dplyr – every function that uses non-standard evaluation (NSE) also has a standard evaluation (SE) twin that ends in _. For example, the SE version of filter() is called filter_(). The SE version of each function has similar arguments, but they must be explicitly “quoted”. Usually the best way to do this is to use ~:

airport <- "ANC"
# NSE version
filter(flights, dest == airport)
#> Source: local data frame [8 x 16]
#>    year month day dep_time dep_delay arr_time arr_delay carrier tailnum
#> 1  2013     7   6     1629        14     1954         1      UA  N587UA
#> 2  2013     7  13     1618         3     1955         2      UA  N572UA
#> 3  2013     7  20     1618         3     2003        10      UA  N567UA
#> 4  2013     7  27     1617         2     1906       -47      UA  N559UA
#> 5  2013     8   3     1615         0     2003        10      UA  N572UA
#> ..  ...   ... ...      ...       ...      ...       ...     ...     ...
#> Variables not shown: flight (int), origin (chr), dest (chr), air_time
#>   (dbl), distance (dbl), hour (dbl), minute (dbl)

# Equivalent SE code:
criteria <- ~dest == airport
filter_(flights, criteria)
#> Source: local data frame [8 x 16]
#>    year month day dep_time dep_delay arr_time arr_delay carrier tailnum
#> 1  2013     7   6     1629        14     1954         1      UA  N587UA
#> 2  2013     7  13     1618         3     1955         2      UA  N572UA
#> 3  2013     7  20     1618         3     2003        10      UA  N567UA
#> 4  2013     7  27     1617         2     1906       -47      UA  N559UA
#> 5  2013     8   3     1615         0     2003        10      UA  N572UA
#> ..  ...   ... ...      ...       ...      ...       ...     ...     ...
#> Variables not shown: flight (int), origin (chr), dest (chr), air_time
#>   (dbl), distance (dbl), hour (dbl), minute (dbl)

To learn more, read the Non-standard evaluation vignette. This new approach is powered by the lazyeval package which provides all the tools needed to implement NSE consistently and correctly. I now understand how to implement NSE consistently and correctly, and I’ll be using the same approach everywhere.

Database backends

The database backend system has been completely overhauled in order to make it possible to add backends in other packages, and to support a much wider range of databases. If you’re interested in implementing a new dplyr backend, please check out vignette("new-sql-backend") – it’s really not that much work.

The first package to take advantage of this system is MonetDB.R, which now provides the MonetDB backend for dplyr.

Other changes

As well as the big new features described here, dplyr 0.3 also fixes many bugs and makes numerous minor improvements. See the release notes for a complete list of the changes.

Devtools 1.6 is now available on CRAN. Devtools makes it so easy to build a package that it becomes your default way to organise code, data and documentation. Learn more at You can get the latest version with:


We’ve made a lot of improvements to the install and release process:

  • Installation functions now default to build_vignettes = FALSE, and only install required dependencies (not suggested). They also store a lot of useful metadata.
  • install_github() got a lot of love. install_github("user/repo") is now the preferred way to install a package from github (older forms with explicit username parameter are now deprecated). You can supply the host argument to install packages from a local github enterprise installation. You can get the latest release with user/repo@*release.
  • session_info() uses package installation metdata to show you exactly how every package was installed (locally, from CRAN, from github, …)
  • release() uses new webform-based submission process for CRAN, as implemented in submit_cran().
  • You can add arbitrary extra questions to release() by defining a function release_questions() in your package. It should return a character vector of questions to ask.

We’ve also added a number of functions to make it easy to get started with various aspects of the package development:

  • use_data() adds data to a package, either in data/ (external data) or in R/sysdata.rda (internal data). use_data_raw() sets up data-raw/ for your reproducible data generation scripts.
  • use_package() sets dependencies and reminds you how to use them.
  • use_rcpp() gets you ready to use Rcpp.
  • use_testthat() sets up testing infrastructure with testthat.
  • use_travis() adds a .travis.yml file and tells you how to get started with travis ci.
  • use_vignette() creates a draft vignette using Rmarkdown.

There were many other minor improvements and bug fixes. See the release notes for complete list of changes.

testthat 0.9 is now available on CRAN. Testthat makes it easy to turn the informal testing that you’re already doing into formal automated tests. Learn more at

This version of testthat has four important new features that bring testthat up to speed with unit testing frameworks in other languages:

  • You can skip() tests with an informative message, if their prerequisites are not available. This is particularly use for CRAN packages, since tests only have a limited amount of time to run. Use skip_on_cran() skip selected tests when run on CRAN.
    test_that("a complicated simulation takes a long time", {
  • Experiment with behaviour driven development with the new describe() function contributed by Dirk Schumacher:
    describe("matrix()", {
      it("can be multiplied by a scalar", {
        m1 <- matrix(1:4, 2, 2)
        m2 <- m1 * 2
        expect_equivalent(matrix(1:4 * 2, 2, 2), m2)
  • Use with_mock() to “mock” functions, replacing slow, resource intensive or inconsistent functions with your own quick approximations. This is particularly useful when you want to test functions that call web APIs without being connected to the internet. Contributed by Kirill Müller.
  • Sometimes it’s difficult to figure out exactly what a function should return and instead you just want to make sure that it returned the same thing as the last time you ran it. A new expectation, expect_equal_to_reference(), makes this easy to do. Contributed by Jon Clayden.

Other changes of note: auto_test_package() is working again (and uses devtools::load_all() to load the code), random praise has been re-enabled (after being accidentally disabled), and expect_identical() works better with R-devel. See the release notes for complete list of changes.

httr 0.5 is now available on CRAN. The httr packages makes it easy to talk to web APIs from R. Learn more in the quick start vignette.

This release is mostly bug fixes and minor improvements, but there is one major new feature: you can now save response bodies directly to disk.

# Download the latest version of rstudio for windows
url <- ""
GET(url, write_disk(basename(url)), progress())

There is also some preliminary support for HTTP caching (see cache_info() and rerequest()). See the release notes for complete details.

httr 0.4 is now available on CRAN. The httr packages makes it easy to talk to web APIs from R.

The most important new features are two new vignettes to help you get started and to help you make wrappers for web APIs. Other important improvements include:

  • New headers() and cookies() functions to extract headers and cookies from responses. status_code() returns HTTP status codes.
  • POST() (and PUT(), and PATCH()) now have an encode argument that determine how the body is encoded. Valid values are “multipart”, “form” or “json”, and the multipart argument is now deprecated.
  • GET(..., progress()) will display a progress bar, useful if you’re doing large uploads or downloads.
  • verbose() gives you considerably more control over degree of verbosity, and defaults have been selected to be more helpful for the most common cases.
  • NULL query parameters are now dropped automatically.

There are number of other minor improvements and bug fixes, as described by the release notes.

I’ve released four new data packages to CRAN: babynames, fueleconomy, nasaweather and nycflights13. The goal of these packages is to provide some interesting, and relatively large, datasets to demonstrate various data analysis challenges in R. The package source code (on github, linked above) is fully reproducible so that you can see some data tidying in action, or make your own modifications to the data.

Below, I’ve listed the primary dataset found in each package. Most packages also include a number of supplementary datasets that provide additional information. Check out the docs for more details.

  • babynames::babynames: US baby name data for each year from 1880 to 2013, the number of children of each sex given each name. All names used 5 or more times are included. 1,792,091 rows, 5 columns (year, sex, name, n, prop). (Source: Social security administration).
  • fueleconomy::vehicles: Fuel economy data for all cars sold in the US from 1984 to 2015. 33,442 rows, 12 variables. (Source: Environmental protection agency)
  • nasaweather::atmos: Data from the 2006 ASA data expo. Contains monthly atmospheric measurements from Jan 1995 to Dec 2000 on 24 x 24 grid over Central America. 41,472 observations, 11 variables. (Source: ASA data expo)
  • nycflights13::flights: This package contains information about all flights that departed from NYC (i.e., EWR, JFK and LGA) in 2013: 336,776 flights with 16 variables. To help understand what causes delays, it also includes a number of other useful datasets: weather, planes, airports, airlines. (Source: Bureau of transportation statistics)

NB: since the datasets are large, I’ve tagged each data frame with the tbl_df class. If you don’t use dplyr, this has no effect. If you do use dplyr, this ensures that you won’t accidentally print thousands of rows of data. Instead, you’ll just see the first 10 rows and as many columns as will fit on screen. This makes interactive exploration much easier.

#> Source: local data frame [336,776 x 16]
#>    year month day dep_time dep_delay arr_time arr_delay carrier tailnum
#> 1  2013     1   1      517         2      830        11      UA  N14228
#> 2  2013     1   1      533         4      850        20      UA  N24211
#> 3  2013     1   1      542         2      923        33      AA  N619AA
#> 4  2013     1   1      544        -1     1004       -18      B6  N804JB
#> 5  2013     1   1      554        -6      812       -25      DL  N668DN
#> 6  2013     1   1      554        -4      740        12      UA  N39463
#> 7  2013     1   1      555        -5      913        19      B6  N516JB
#> 8  2013     1   1      557        -3      709       -14      EV  N829AS
#> 9  2013     1   1      557        -3      838        -8      B6  N593JB
#> 10 2013     1   1      558        -2      753         8      AA  N3ALAA
#> ..  ...   ... ...      ...       ...      ...       ...     ...     ...
#> Variables not shown: flight (int), origin (chr), dest (chr), air_time
#>   (dbl), distance (dbl), hour (dbl), minute (dbl)

tidyr is new package that makes it easy to “tidy” your data. Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modelling packages). The two most important properties of tidy data are:

  • Each column is a variable.
  • Each row is an observation.

Arranging your data in this way makes it easier to work with because you have a consistent way of referring to variables (as column names) and observations (as row indices). When use tidy data and tidy tools, you spend less time worrying about how to feed the output from one function into the input of another, and more time answering your questions about the data.

To tidy messy data, you first identify the variables in your dataset, then use the tools provided by tidyr to move them into columns. tidyr provides three main functions for tidying your messy data: gather(), separate() and spread().

gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. Other names for gather include melt (reshape2), pivot (spreadsheets) and fold (databases). Here’s an example how you might use gather() on a made-up dataset. In this experiment we’ve given three people two different drugs and recorded their heart rate:


messy <- data.frame(
  name = c("Wilbur", "Petunia", "Gregory"),
  a = c(67, 80, 64),
  b = c(56, 90, 50)
#>      name  a  b
#> 1  Wilbur 67 56
#> 2 Petunia 80 90
#> 3 Gregory 64 50

We have three variables (name, drug and heartrate), but only name is currently in a column. We use gather() to gather the a and b columns into key-value pairs of drug and heartrate:

messy %>%
  gather(drug, heartrate, a:b)
#>      name drug heartrate
#> 1  Wilbur    a        67
#> 2 Petunia    a        80
#> 3 Gregory    a        64
#> 4  Wilbur    b        56
#> 5 Petunia    b        90
#> 6 Gregory    b        50

Sometimes two variables are clumped together in one column. separate() allows you to tease them apart (extract() works similarly but uses regexp groups instead of a splitting pattern or position). Take this example from stackoverflow (modified slightly for brevity). We have some measurements of how much time people spend on their phones, measured at two locations (work and home), at two times. Each person has been randomly assigned to either treatment or control.

messy <- data.frame(
  id = 1:4,
  trt = sample(rep(c('control', 'treatment'), each = 2)),
  work.T1 = runif(4),
  home.T1 = runif(4),
  work.T2 = runif(4),
  home.T2 = runif(4)

To tidy this data, we first use gather() to turn columns work.T1, home.T1, work.T2 and home.T2 into a key-value pair of key and time. (Only the first eight rows are shown to save space.)

tidier <- messy %>%
  gather(key, time, -id, -trt)
tidier %>% head(8)
#>   id       trt     key    time
#> 1  1 treatment work.T1 0.08514
#> 2  2   control work.T1 0.22544
#> 3  3 treatment work.T1 0.27453
#> 4  4   control work.T1 0.27231
#> 5  1 treatment home.T1 0.61583
#> 6  2   control home.T1 0.42967
#> 7  3 treatment home.T1 0.65166
#> 8  4   control home.T1 0.56774

Next we use separate() to split the key into location and time, using a regular expression to describe the character that separates them.

tidy <- tidier %>%
  separate(key, into = c("location", "time"), sep = "\\.") 
tidy %>% head(8)
#>   id       trt location time    time
#> 1  1 treatment     work   T1 0.08514
#> 2  2   control     work   T1 0.22544
#> 3  3 treatment     work   T1 0.27453
#> 4  4   control     work   T1 0.27231
#> 5  1 treatment     home   T1 0.61583
#> 6  2   control     home   T1 0.42967
#> 7  3 treatment     home   T1 0.65166
#> 8  4   control     home   T1 0.56774

The last tool, spread(), takes two columns (a key-value pair) and spreads them in to multiple columns, making “long” data wider. Spread is known by other names in other places: it’s cast in reshape2, unpivot in spreadsheets and unfold in databases. spread() is used when you have variables that form rows instead of columns. You need spread() less frequently than gather() or separate() so to learn more, check out the documentation and the demos.

Just as reshape2 did less than reshape, tidyr does less than reshape2. It’s designed specifically for tidying data, not general reshaping. In particular, existing methods only work for data frames, and tidyr never aggregates. This makes each function in tidyr simpler: each function does one thing well. For more complicated operations you can string together multiple simple tidyr and dplyr functions with %>%.

You can learn more about the underlying principles in my tidy data paper. To see more examples of data tidying, read the vignette, vignette("tidy-data"), or check out the demos, demo(package = "tidyr"). Alternatively, check out some of the great stackoverflow answers that use tidyr. Keep up-to-date with development at, report bugs at and get help with data manipulation challenges at If you ask a question specifically about tidyr on stackoverflow, please tag it with tidyr and I’ll make sure to read it.

I’m very excited to announce dplyr 0.2. It has three big features:

  • improved piping courtesy of the magrittr package

  • a vastly more useful implementation of do()

  • five new verbs: sample_n(), sample_frac(), summarise_each(), mutate_each and glimpse().

These features are described in more detail below. To learn more about the 35 new minor improvements and bug fixes, please read the full release notes.

Improved piping

dplyr now imports %>% from the magrittr package by Stefan Milton Bache. I recommend that you use this instead of %.% because it is easier to type (since you can hold down the shift key) and is more flexible. With you %>%, you can control which argument on the RHS receives the LHS with the pronoun .. This makes %>% more useful with base R functions because they don’t always take the data frame as the first argument. For example you could pipe mtcars to xtabs() with:

mtcars %>% xtabs( ~ cyl + vs, data = .)

dplyr only exports %>% from magrittr, but magrittr contains many other useful functions. To use them, load magrittr explicitly with library(magrittr). For more details, see vignette("magrittr").
%.% will be deprecated in a future version of dplyr, but it won’t happen for a while. I’ve deprecated chain() to encourage a single style of dplyr usage: please use %>% instead.


do() has been completely overhauled, and group_by() + do() is now equivalent in power to plyr::dlply(). There are two ways to use do(), either with multiple named arguments or a single unnamed arguments. If you use named arguments, each argument becomes a list-variable in the output. A list-variable can contain any arbitrary R object which makes this form of do() useful for storing models:

models %>% group_by(cyl) %>% do(model = lm(mpg ~ wt, data = .))
models %>% summarise(rsq = summary(model)$r.squared)

If you use an unnamed argument, the result should be a data frame. This allows you to apply arbitrary functions to each group.

mtcars %>% group_by(cyl) %>% do(head(., 1))

Note the use of the pronoun . to refer to the data in the current group.
do() also has an automatic progress bar. It appears if the computation takes longer than 2 seconds and estimates how long the job will take to complete.

New verbs

sample_n() randomly samples a fixed number of rows from a tbl; sample_frac() randomly samples a fixed fraction of rows. They currently only work for local data frames and data tables.
summarise_each() and mutate_each() make it easy to apply one or more functions to multiple columns in a tbl. These works for all srcs that summarise() and mutate() work for.
glimpse() makes it possible to see all the columns in a tbl, displaying as much data for each variable as can be fit on a single line.

We’re pleased to announce a new version of roxygen2. Roxygen2 allows you to write documentation comments that are automatically converted to R’s standard Rd format, saving you time and reducing duplication. This release is a major update that provides enhanced error handling and considerably safer default behaviour. Roxygen2 now adds a comment to all generated files so that you know they shouldn’t be edited by hand. This also ensures that roxygen2 will never overwrite a file that it did not create, and can automatically remove files that are no longer needed.

I’ve also written some vignettes to help you understand how to use roxygen2. Six new vignettes provide a comprehensive overview of using roxygen2 in practice. Run browseVignettes("roxygen2") to read them. In an effort to make roxygen2 easier to use and more consistent between package authors, I’ve made parsing considerably stricter, and made sure that all errors give you the line number of the associated roxygen block. Every input is now checked to make sure that it has (e.g. every { has a matching }). This should prevent frustrating errors that require careful reading of .Rd files. Similarly, @section titles and @export tags can now only span a single line as this prevents a number of common bugs.

Other features include two new tags @describeIn and @field, and you can document objects (like datasets) by documenting their name as a string. For example, to document a dataset called mydata, you can do:

#' Mydata set
#' Some data I collected about myself

To see a complete list of all bug fixes and improvements, please see the release notes for roxygen2 4.0.0 for details. Roxygen2 4.0.1 fixed a couple of minor bugs and majorly improved the upgrade process.

reshape2 1.4 is now available on CRAN. This version adds a number of useful arguments and messages, but mostly importantly it gains a C++ implementation of This new method should be much much faster (>10x) and does a better job of preserving existing attributes. For full details, see the release notes on github.

The C++ implementation of melt was contributed by Kevin Ushey, who we’re very pleased to announce has joined RStudio. You may be familiar with Kevin from his contributions to Rcpp, or his CRAN packages Kmisc and timeit.


Get every new post delivered to your Inbox.

Join 690 other followers