You are currently browsing hadleywickham’s articles.

roxygen2 5.0.0 is now available on CRAN. roxygen2 helps you document your packages by turning specially formatted inline comments in R’s standard Rd format. Learn more at

In this release:

  • Roxygen records its version in a single place: the RoxygenNote field in your DESCRIPTION. This should make it easier to see what’s changed when you upgrade roxygen2, because only files with differences will be modified. Previously every Rd file was modified to update the version number.
  • You can now easily document functions that you’ve imported from another package:
    #' @importFrom magrittr %>%
    #' @export

    All imported-and-re-exported functions will be documented in the same file (rexports.Rd), with a brief descrption and links to the original documentation.

  • You can more easily generate package documentation by documenting the special string “_PACKAGE“:
    #' @details Details

    The title and description will be automatically filled in from the DESCRIPTION.

  • New tags @rawRd and @rawNamespace allow you to insert raw (unescaped) text in Rd and the NAMESPACE. @evalRd() is similar, but instead of literal Rd, you give it R code that produces literal Rd code when run. This should make it easier to experiment with new types of output.
  • Roxygen2 now parses the source code files in the order specified in the Collate field in DESCRIPTION. This improves the ordering of the generated documentation when using @describeIn and/or @rdname split across several .R files, as often happens when working with S4.
  • The parser has been completely rewritten in C++. This gives a nice performance boost and adds improves the error messages: now get the line number of the tag, not the start of the block.
  • @family now cross-links each manual page only once, instread of linking to all aliases.

There were many other minor improvements and bug fixes; please see the release notes for a complete list. A bug thanks goes to all the contributors who made this release possible.

readr 0.2.0 is now available on CRAN. readr makes it easy to read many types of tabular data, including csv, tsv and fixed width. Compared to base equivalents like read.csv(), readr is much faster and gives more convenient output: it never converts strings to factors, can parse date/times, and it doesn’t munge the column names.

This is a big release, so below I describe the new features divided into four main categories:

  • Improved support for international data.
  • Column parsing improvements.
  • File parsing improvements, including support for comments.
  • Improved writers.

There were too many minor improvements and bug fixes to describe in detail here. See the release notes for a complete list.


readr now has a strategy for dealing with settings that vary across languages and localities: locales. A locale, created with locale(), includes:

  • The names of months and days, used when parsing dates.
  • The default time zone, used when parsing datetimes.
  • The character encoding, used when reading non-ASCII strings.
  • Default date format, used when guessing column types.
  • The decimal and grouping marks, used when reading numbers.

I’ll cover the most important of these parameters below. For more details, see vignette("locales").
To override the default US-centric locale, you pass a custom locale to read_csv(), read_tsv(), or read_fwf(). Rather than showing those funtions here, I’ll use the parse_*() functions because they work with character vectors instead of a files, but are otherwise identical.

Names of months and days

The first argument to locale() is date_names which controls what values are used for month and day names. The easiest way to specify them is with a ISO 639 language code:

locale("ko") # Korean
#> <locale>
#> Numbers:  123,456.78
#> Formats:  %Y%.%m%.%d / %H:%M
#> Timezone: UTC
#> Encoding: UTF-8
#> <date_names>
#> Days:   일요일 (일), 월요일 (월), 화요일 (화), 수요일 (수), 목요일 (목),
#>         금요일 (금), 토요일 (토)
#> Months: 1월, 2월, 3월, 4월, 5월, 6월, 7월, 8월, 9월, 10월, 11월, 12월
#> AM/PM:  오전/오후
locale("fr") # French
#> <locale>
#> Numbers:  123,456.78
#> Formats:  %Y%.%m%.%d / %H:%M
#> Timezone: UTC
#> Encoding: UTF-8
#> <date_names>
#> Days:   dimanche (dim.), lundi (lun.), mardi (mar.), mercredi (mer.),
#>         jeudi (jeu.), vendredi (ven.), samedi (sam.)
#> Months: janvier (janv.), février (févr.), mars (mars), avril (avr.), mai
#>         (mai), juin (juin), juillet (juil.), août (août),
#>         septembre (sept.), octobre (oct.), novembre (nov.),
#>         décembre (déc.)
#> AM/PM:  AM/PM

This allows you to parse dates in other languages:

parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
#> [1] "2015-01-01"
parse_date("14 oct. 1979", "%d %b %Y", locale = locale("fr"))
#> [1] "1979-10-14"


readr assumes that times are in Coordinated Universal Time, aka UTC. UTC is the best timezone for data because it doesn’t have daylight savings. If your data isn’t already in UTC, you’ll need to supply a tz in the locale:

parse_datetime("2001-10-10 20:10")
#> [1] "2001-10-10 20:10:00 UTC"
parse_datetime("2001-10-10 20:10", 
  locale = locale(tz = "Pacific/Auckland"))
#> [1] "2001-10-10 20:10:00 NZDT"
parse_datetime("2001-10-10 20:10", 
  locale = locale(tz = "Europe/Dublin"))
#> [1] "2001-10-10 20:10:00 IST"

List all available times zones with OlsonNames(). If you’re American, note that “EST” is not Eastern Standard Time – it’s a Canadian time zone that doesn’t have DST! Instead of relying on ambiguous abbreivations, use:

  • PST/PDT = “US/Pacific”
  • CST/CDT = “US/Central”
  • MST/MDT = “US/Mountain”
  • EST/EDT = “US/Eastern”

Default formats

Locales also provide default date and time formats. The time format isn’t currently used for anything, but the date format is used when guessing column types. The default date format is %Y-%m-%d because that’s unambiguous:

#>  Date[1:1], format: "2010-10-10"

If you’re an American, you might want you use your illogical date sytem::

#>  chr "01/02/2013"
  locale = locale(date_format = "%d/%m/%Y")))
#>  Date[1:1], format: "2013-02-01"

Character encoding

All readr functions yield strings encoded in UTF-8. This encoding is the most likely to give good results in the widest variety of settings. By default, readr assumes that your input is also in UTF-8, which is less likely to be the case, especially when you’re working with older datasets. To parse a dataset that’s not in UTF-8, you need to a supply an encoding.
The following code creates a string encoded with latin1 (aka ISO-8859-1), and shows how it’s different from the string encoded as UTF-8, and how to parse it with readr:

x <- "Émigré cause célèbre déjà vu.\n"
y <- stringi::stri_conv(x, "UTF-8", "Latin1")

# These strings look like they're identical:
#> [1] "Émigré cause célèbre déjà vu.\n"
#> [1] "Émigré cause célèbre déjà vu.\n"
identical(x, y)
#> [1] TRUE

# But they have different encodings:
#> [1] "UTF-8"
#> [1] "latin1"

# That means while they print the same, their raw (binary)
# representation is actually rather different:
#>  [1] c3 89 6d 69 67 72 c3 a9 20 63 61 75 73 65 20 63 c3 a9 6c c3 a8 62 72
#> [24] 65 20 64 c3 a9 6a c3 a0 20 76 75 2e 0a
#>  [1] c9 6d 69 67 72 e9 20 63 61 75 73 65 20 63 e9 6c e8 62 72 65 20 64 e9
#> [24] 6a e0 20 76 75 2e 0a

# readr expects strings to be encoded as UTF-8. If they're
# not, you'll get weird characters
#> [1] "Émigré cause célèbre déjà vu.\n"
#> [1] "\xc9migr\xe9 cause c\xe9l\xe8bre d\xe9j\xe0 vu.\n"

# If you know the encoding, supply it:
parse_character(y, locale = locale(encoding = "latin1"))
#> [1] "Émigré cause célèbre déjà vu.\n"

If you don’t know what encoding the file uses, try guess_encoding(). It’s not 100% perfect (as it’s fundamentally a heuristic), but should at least get you pointed in the right direction:

#>     encoding confidence
#> 1 ISO-8859-2        0.4
#> 2 ISO-8859-1        0.3

# Note that the first guess produces a valid string, 
# but isn't correct:
parse_character(y, locale = locale(encoding = "ISO-8859-2"))
#> [1] "Émigré cause célčbre déjŕ vu.\n"
# But ISO-8859-1 is another name for latin1
parse_character(y, locale = locale(encoding = "ISO-8859-1"))
#> [1] "Émigré cause célèbre déjà vu.\n"


Some countries use the decimal point, while others use the decimal comma. The decimal_mark option controls which readr uses when parsing doubles:

parse_double("1,23", locale = locale(decimal_mark = ","))
#> [1] 1.23

The big_mark option describes which character is used to space groups of digits. Do you write 1,000,000, 1.000.000, 1 000 000, or 1'000'000? Specifying the grouping mark allows parse_number() to parse large number as they’re commonly written:

#> [1] 1234.56

# dplyr is smart enough to guess that if you're using , for 
# decimals then you're probably using . for grouping:
parse_number("1.234,56", locale = locale(decimal_mark = ","))
#> [1] 1234.56

Column parsing improvements

One of the most useful parts of readr are the column parsers: the tools that turns character input into usefully typed data frame columns. This process is now described more fully in a new vignette: vignette("column-types").
By default, column types are guessed by looking at the data. I’ve made a number of tweaks to make it more likely that your code will load correctly the first time:

  • readr now looks at the first 1000 rows (instead of just the first 100) when guessing column types: this only takes a fraction more time, but should hopefully yield better guesses for more inputs.

  • col_date() and col_datetime() no longer recognise partial dates like 19, 1900, 1900-01. These triggered many false positives and after re-reading the ISO8601 spec, I believe they actually refer to periods of time, so should not be parsed into a specific instant.

  • col_integer() no longer recognises values started with zeros (e.g. 0001) as these are often used as identifiers.

  • col_number() will automatically recognise numbers containing the grouping mark (see below for more details).

But you can override these defaults with the col_types() argument. In this version, col_types gains some much needed flexibility:

  • New cols() function takes of assembling the list of column types, and with its .default argument, allows you to control the default column type:
    read_csv("x,y\n1,2", col_types = cols(.default = "c"))
    #> Source: local data frame [1 x 2]
    #>       x     y
    #>   (chr) (chr)
    #> 1     1     2

    You can refer to parsers with their full name (e.g. col_character()) or their one letter abbreviation (e.g. c). The default value of .default is “?”: guess the type of column from the data.

  • cols_only() allows you to load only the specified columns:

    read_csv("a,b,c\n1,2,3", col_types = cols_only("b" = "?"))
    #> Source: local data frame [1 x 1]
    #>       b
    #>   (int)
    #> 1     2

Many of the individual parsers have also been improved:

  • col_integer() and col_double() no longer silently ignore trailing characters after the number.

  • New col_number()/parse_number() replace the old col_numeric()/ parse_numeric(). This parser is less flexible, so it’s less likely to silently ignored bad input. It’s designed specifically to read currencies and percentages. It only reads the first number from a string, ignoring the grouping mark defined by the locale:

    #> [1] 1234566
    #> [1] 1234
    #> [1] 27
  • New parse_time() and col_time() allow you to parse times. They have an optional format argument, that uses the same components as parse_datetime(). If format is omitted, they use a flexible parser that looks for hours, then an optional colon, then minutes, then an optional colon, then optional seconds, then optional am/pm.
    parse_time(c("1:45 PM", "1345", "13:45:00"))
    #> [1] 13:45:00 13:45:00 13:45:00

    parse_time() returns the number of seconds since midnight as an integer with class “time”. readr includes a basic print method.

  • parse_date()/col_date() and parse_datetime()/col_datetime() gain two new format strings: “%+” skips one or more non-digits, and %p reads in AM/PM (and am/pm).

File parsing improvements

read_csv(), read_tsv(), and read_delim() gain extra arguments that allow you to parse more files:

  • Multiple NA values can be specified by passing a character vector to na. The default has been changed to na = c("", "NA").
    read_csv("a,b\n.,NA\n1,3", na = c(".", "NA"))
    #> Source: local data frame [2 x 2]
    #>       a     b
    #>   (int) (int)
    #> 1    NA    NA
    #> 2     1     3
  • New comment argument allows you to ignore all text after a string:
    "#This is a comment
    #This is another comment
    2,20", comment = "#")
    #> Source: local data frame [2 x 2]
    #>       a     b
    #>   (int) (int)
    #> 1     1    10
    #> 2     2    20
  • trim_ws argument controls whether leading and trailing whitespace is removed. It defaults to TRUE.
    read_csv("a,b\n     1,     2")
    #> Source: local data frame [1 x 2]
    #>       a     b
    #>   (int) (int)
    #> 1     1     2
    read_csv("a,b\n     1,     2", trim_ws = FALSE)
    #> Source: local data frame [1 x 2]
    #>        a      b
    #>    (chr)  (chr)
    #> 1      1      2

Specifying the wrong number of column names, or having rows with an unexpected number of columns, now gives a warning, rather than an error:

#> Warning: 2 parsing failures.
#> row col  expected    actual
#>   1  -- 3 columns 2 columns
#>   2  -- 3 columns 4 columns
#> Source: local data frame [2 x 3]
#>       a     b     c
#>   (int) (int) (int)
#> 1     1     2    NA
#> 2     1     2     3

Note that the warning message now also shows you the first five problems. I hope this will often allow you to iterate immediately, rather than having to look at the full problems().


Despite the name, readr also provides some tools for writing data frames to disk. In this version there are three output functions:

  • write_csv() and write_tsv() write tab and comma delimted files, and write_delim() writes with user specified delimiter.

  • write_rds() and read_rds() wrap around readRDS() and saveRDS(), defaulting to no compression, because you’re usually more interested in saving time (expensive) than disk space (cheap).

All these functions invisibly return their output so you can use them as part of a pipeline:

my_df %>%
  some_manipulation() %>%
  write_csv("interim-a.csv") %>%
  some_more_manipulation() %>%
  write_csv("interim-b.csv") %>%
  even_more_manipulation() %>%

You can now control how missing values are written with the na argument, and the quoting algorithm has been further refined to only add quotes when needed: when the string contains a quote, the delimiter, a new line or the same text as missing value.
Output for doubles now uses the same precision as R, and POSIXt vectors are saved in a ISO8601 compatible format.
For testing, you can use format_csv(), format_tsv(), and format_delim() to write csv to a string:

mtcars %>%
  head(4) %>%
  format_csv() %>%
#> mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
#> 21,6,160,110,3.9,2.62,16.46,0,1,4,4
#> 21,6,160,110,3.9,2.875,17.02,0,1,4,4
#> 22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
#> 21.4,6,258,110,3.08,3.215,19.44,1,0,3,1

This is particularly useful for generating reprexes.

testthat 0.11.0 is now available on CRAN. Testthat makes it easy to turn your existing informal tests into formal automated tests that you can rerun quickly and easily. Learn more at Install the latest version with:


In this version:

  • New expect_silent() ensures that code produces no output, messages, or warnings. expect_output(), expect_message(), expect_warning(), and expect_error() now accept NA as the second argument to indicate that there shouldn’t be any output, messages, warnings, or errors (i.e. they should be missing)
    f <- function() {
    #> Error: f() produced output, warnings, messages
    expect_warning(log(-1), NA)
    #> Error: log(-1) expected no warnings:
    #> *  NaNs produced
  • Praise gets more diverse thanks to Gabor Csardi’s praise package, and you now also get random encouragment if your tests don’t pass.
  • testthat no longer muffles warning messages. This was a bug in the previous version, as warning messages are usually important and should be dealt with explicitly, either by resolving the problem or explicitly capturing them with expect_warning().
  • Two new skip functions make it easier to skip tests that don’t work in certain environments: skip_on_os() skips tests on the specified operating system, and skip_on_appveyor() skips tests on Appveyor.

There were a number of other minor improvements and bug fixes. See the release notes for a complete list.

A big thanks goes out to all the contributors who made this release happen. There’s no way I could be as productive without the fantastic commmunity of R developers who come up with thoughtful new features, and who discover and fix my bugs!

Purrr is a new package that fills in the missing pieces in R’s functional programming tools: it’s designed to make your pure functions purrr. Like many of my recent packages, it works with magrittr to allow you to express complex operations by combining simple pieces in a standard way.

Install it with:


Purrr wouldn’t be possible without Lionel Henry. He wrote a lot of the package and his insightful comments helped me rapidly iterate towards a stable, useful, and understandable package.

Map functions

The core of purrr is a set of functions for manipulating vectors (atomic vectors, lists, and data frames). The goal is similar to dplyr: help you tackle the most common 90% of data manipulation challenges. But where dplyr focusses on data frames, purrr focusses on vectors. For example, the following code splits the built-in mtcars dataset up by number of cylinders (using the base split() function), fits a linear model to each piece, summarises each model, then extracts the the \(R^2\):

mtcars %>%
  split(.$cyl) %>%
  map(~lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
#>     4     6     8 
#> 0.509 0.465 0.423

The first argument to all map functions is the vector to operate on. The second argument, .f specifies what to do with each piece. It can be:

  • A function, like summary().
  • A formula, which is converted to an anonymous function, so that ~ lm(mpg ~ wt, data = .) is shorthand for function(x) lm(mpg ~ wt, data = x).
  • A string or number, which is used to extract components, i.e. "r.squared" is shorthand for function(x) x[[r.squared]] and 1 is shorthand for function(x) x[[1]].

Map functions come in a few different variations based on their inputs and output:

  • map() takes a vector (list or atomic vector) and returns a list. map_lgl(), map_int(), map_dbl(), and map_chr() take a vector and return an atomic vector. flatmap() works similarly, but allows the function to return arbitrary length vectors.
  • map_if() only applies .f to those elements of the list where .p is true. For example, the following snippet converts factors into characters:
    iris %>% map_if(is.factor, as.character) %>% str()
    #> 'data.frame':    150 obs. of  5 variables:
    #>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
    #>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
    #>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
    #>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
    #>  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

    map_at() works similarly but instead of working with a logical vector or predicate function, it works with a integer vector of element positions.

  • map2() takes a pair of lists and iterates through them in parallel:
    map2(1:3, 2:4, c)
    #> [[1]]
    #> [1] 1 2
    #> [[2]]
    #> [1] 2 3
    #> [[3]]
    #> [1] 3 4
    map2(1:3, 2:4, ~ .x * (.y - 1))
    #> [[1]]
    #> [1] 1
    #> [[2]]
    #> [1] 4
    #> [[3]]
    #> [1] 9

    map3() does the same thing for three lists, and map_n() does it in general.

  • invoke(), invoke_lgl(), invoke_int(), invoke_dbl(), and invoke_chr() take a list of functions, and call each one with the supplied arguments:
    list(m1 = mean, m2 = median) %>%
    #>    m1    m2 
    #> 9.765 0.117
  • walk() takes a vector, calls a function on piece, and returns its original input. It’s useful for functions called for their side-effects; it returns the input so you can use it in a pipe.

Purrr and dplyr

I’m becoming increasingly enamoured with the list-columns in data frames. The following example combines purrr and dplyr to generate 100 random test-training splits in order to compute an unbiased estimate of prediction quality. These tools are still experimental (and currently need quite a bit of extra scaffolding), but I think the basic approach is really appealing.

random_group <- function(n, probs) {
  probs <- probs / sum(probs)
  g <- findInterval(seq(0, 1, length = n), c(0, cumsum(probs)),
    rightmost.closed = TRUE)
partition <- function(df, n, probs) {
  n %>% 
    replicate(split(df, random_group(nrow(df), probs)), FALSE) %>%
    zip_n() %>%

msd <- function(x, y) sqrt(mean((x - y) ^ 2))

# Genearte 100 random test-training splits, 
cv <- mtcars %>%
  partition(100, c(training = 0.8, test = 0.2)) %>% 
    # Fit the model
    model = map(training, ~ lm(mpg ~ wt, data = .)),
    # Make predictions on test data
    pred = map2(model, test, predict),
    # Calculate mean squared difference
    diff = map2(pred, test %>% map("mpg"), msd) %>% flatten()
#> Source: local data frame [100 x 5]
#>                   test             training   model     pred  diff
#>                 (list)               (list)  (list)   (list) (dbl)
#> 1  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  3.70
#> 2  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  2.03
#> 3  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  2.29
#> 4  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  4.88
#> 5  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  3.20
#> 6  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  4.68
#> 7  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  3.39
#> 8  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  3.82
#> 9  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  2.56
#> 10 <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  3.40
#> ..                 ...                  ...     ...      ...   ...
#> [1] 3.22

Other functions

There are too many other pieces of purrr to describe in detail here. A few of the most useful functions are noted below:

  • zip_n() allows you to turn a list of lists “inside-out”:
    x <- list(list(a = 1, b = 2), list(a = 2, b = 1))
    x %>% str()
    #> List of 2
    #>  $ :List of 2
    #>   ..$ a: num 1
    #>   ..$ b: num 2
    #>  $ :List of 2
    #>   ..$ a: num 2
    #>   ..$ b: num 1
    x %>%
      zip_n() %>%
    #> List of 2
    #>  $ a:List of 2
    #>   ..$ : num 1
    #>   ..$ : num 2
    #>  $ b:List of 2
    #>   ..$ : num 2
    #>   ..$ : num 1
    x %>%
      zip_n(.simplify = TRUE) %>%
    #> List of 2
    #>  $ a: num [1:2] 1 2
    #>  $ b: num [1:2] 2 1
  • keep() and discard() allow you to filter a vector based on a predicate function. compact() is a helpful wrapper that throws away empty elements of a list.
    1:10 %>% keep(~. %% 2 == 0)
    #> [1]  2  4  6  8 10
    1:10 %>% discard(~. %% 2 == 0)
    #> [1] 1 3 5 7 9
    list(list(x = TRUE, y = 10), list(x = FALSE, y = 20)) %>%
      keep("x") %>% 
    #> List of 1
    #>  $ :List of 2
    #>   ..$ x: logi TRUE
    #>   ..$ y: num 10
    list(NULL, 1:3, NULL, 7) %>% 
      compact() %>%
    #> List of 2
    #>  $ : int [1:3] 1 2 3
    #>  $ : num 7
  • lift() (and friends) allow you to convert a function that takes multiple arguments into a function that takes a list. It helps you compose functions by lifting their domain from a kind of input to another kind. The domain can be changed to and from a list (l), a vector (v) and dots (d).
  • cross2(), cross3() and cross_n() allow you to create the Cartesian product of the inputs (with optional filtering).
  • A number of functions let you manipulate functions: negate(), compose(), partial().
  • A complete set of predicate functions provides predictable versions of the is.* functions: is_logical(), is_list(), is_bare_double(), is_scalar_character(), etc.
  • Other equivalents functions wrap existing base R functions into to the consistent design of purrr: replicate() -> rerun(), Reduce() -> reduce(), Find() -> detect(), Position() -> detect_index().

Design philosophy

The goal of purrr is not try and turn R into Haskell in R: it does not implement currying, or destructuring binds, or pattern matching. The goal is to give you similar expressiveness to a classical FP language, while allowing you to write code that looks and feels like R.

  • Anonymous functions are verbose in R, so we provide two convenient shorthands. For predicate functions, ~ .x + 1 is equivalent to function(.x) .x + 1. For chains of transformations functions, . %>% f() %>% g() is equivalent to function(.) . %>% f() %>% g().
  • R is weakly typed, so we can implement general zip_n(), rather than having to specialise on the number of arguments. That said, we still provide map2() and map3() since it’s useful to clearly separate which arguments are vectorised over. Functions are designed to be output type-stable (respecting Postel’s law) so you can rely on the output being as you expect.
  • R has named arguments, so instead of providing different functions for minor variations (e.g. detect() and detectLast()) we use a named arguments.
  • Instead of currying, we use ... to pass in extra arguments. Arguments of purrr functions always start with . to avoid matching to the arguments of .f passed in via ....
  • Instead of point free style, use the pipe, %>%, to write code that can be read from left to right.

I’m pleased to announce rvest 0.3.0 is now available on CRAN. Rvest makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with pipes so that you can express complex operations by composed simple pieces. Install it with:


What’s new

The biggest change in this version is that rvest now uses the xml2 package instead of XML. This makes rvest much simpler, eliminates memory leaks, and should improve performance a little.

A number of functions have changed names to improve consistency with other packages: most importantly html() is now read_html(), and html_tag() is now html_name(). The old versions still work, but are deprecated and will be removed in rvest 0.4.0.

html_node() now throws an error if there are no matches, and a warning if there’s more than one match. I think this should make it more likely to fail clearly when the structure of the page changes. If you don’t want this behaviour, use html_nodes().

There were a number of other bug fixes and minor improvements as described in the release notes.

Devtools 1.9.1 is now available on CRAN. Devtools makes package building so easy a package can become your default way to organise code, data, and documentation. You can learn more about developing packages in R packages, my book about package development that’s freely available online..

Get the latest version of devtools with:


There are three major improvements that I contributed:

  • check() is now much closer to what CRAN does – it passes on --as-cran to R CMD check, using an env var to turn off the incoming CRAN checks. These are turned off because they’re slow (they have to retrieve data from CRAN), and are not necessary except just prior to release (so release() turns them back on again).
  • install_deps() now automatically upgrades out of date dependencies. This is typically what you want when you’re working on a development version of a package: otherwise you can get an unpleasant surprise when you go to submit your package to CRAN and discover it doesn’t work with the latest version of its dependencies. To suppress this behaviour, set upgrade_dependencies = FALSE.
  • revdep_check() received a number of tweaks that I’ve found helpful when preparing my packages for CRAN:
    • Suggested dependencies of the revdeps are installed by default.
    • The NOT_CRAN env var is set to false so tests that are skipped on CRAN are also skipped for you.
    • The RGL_USE_NULL env var is set to true to stop rgl windows from popping up during testing.
    • All revdep sources are downloaded at the start of the checks. This makes life a bit easier if you’re on a flaky internet connection.

But like many recent devtools releases, most of the coolest new features have been contributed by the community:

  • Jim Hester implemented experimental remote depedencies for install(). You can now tell devtools where to find dependencies with a remotes field:

    The default allows you to refer to github repos, but you can easily add deps from any of the other sources that devtools supports: see vignette("dependencies") for more details.

    Support for installing development dependencies is still experimental so we appreciate any feedback.

  • Jenny Bryan considerably improved the existing GitHub integration. use_github() now pushes to the newly created GitHub repo, and sets a remote tracking branch. It also populates the URL and BugReports fields of your DESCRIPTION.
  • Kirill Müller contributed many bug fixes, minor improvements and test cases.

See the release notes for complete bug fixes and other minor changes.

tidyr 0.3.0 is now available on CRAN. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has variables in columns and observations in rows, and is described in more detail in the tidy data vignette. Install tidyr with:


tidyr contains four new verbs: fill(), replace() and complete(), and unnest(), and lots of smaller bug fixes and improvements.


The new fill function fills in missing observations from the last non-missing value. This is useful if you’re getting data from Excel users who haven’t read Karl Broman’s excellent data organisation guide and leave cells blank to indicate that the previous value should be carried forward:

df <- dplyr::data_frame(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
#> Source: local data frame [4 x 2]
#>    year   trt
#>   (dbl) (chr)
#> 1  2015     A
#> 2    NA    NA
#> 3    NA     B
#> 4    NA    NA
df %>% fill(year, trt)
#> Source: local data frame [4 x 2]
#>    year   trt
#>   (dbl) (chr)
#> 1  2015     A
#> 2  2015     A
#> 3  2015     B
#> 4  2015     B

replace_na() and complete()

replace_na() makes it easy to replace missing values on a column-by-column basis:

df <- dplyr::data_frame(
  x = c(1, 2, NA), 
  y = c("a", NA, "b")
df %>% replace_na(list(x = 0, y = "unknown"))
#> Source: local data frame [3 x 2]
#>       x       y
#>   (dbl)   (chr)
#> 1     1       a
#> 2     2 unknown
#> 3     0       b

It is particularly useful when called from complete(), which makes it easy to fill in missing combinations of your data:

df <- dplyr::data_frame(
  group = c(1:2, 1),
  item_id = c(1:2, 2),
  item_name = c("a", "b", "b"),
  value1 = 1:3,
  value2 = 4:6
#> Source: local data frame [3 x 5]
#>   group item_id item_name value1 value2
#>   (dbl)   (dbl)     (chr)  (int)  (int)
#> 1     1       1         a      1      4
#> 2     2       2         b      2      5
#> 3     1       2         b      3      6

df %>% complete(group, c(item_id, item_name))
#> Source: local data frame [4 x 5]
#>   group item_id item_name value1 value2
#>   (dbl)   (dbl)     (chr)  (int)  (int)
#> 1     1       1         a      1      4
#> 2     1       2         b      3      6
#> 3     2       1         a     NA     NA
#> 4     2       2         b      2      5

df %>% complete(
  group, c(item_id, item_name), 
  fill = list(value1 = 0)
#> Source: local data frame [4 x 5]
#>   group item_id item_name value1 value2
#>   (dbl)   (dbl)     (chr)  (dbl)  (int)
#> 1     1       1         a      1      4
#> 2     1       2         b      3      6
#> 3     2       1         a      0     NA
#> 4     2       2         b      2      5

Note how I’ve grouped item_id and item_name together with c(item_id, item_name). This treats them as nested, not crossed, so we don’t get every combination of group, item_id and item_name, as we would otherwise:

df %>% complete(group, item_id, item_name)
#> Source: local data frame [8 x 5]
#>    group item_id item_name value1 value2
#>    (dbl)   (dbl)     (chr)  (int)  (int)
#> 1      1       1         a      1      4
#> 2      1       1         b     NA     NA
#> 3      1       2         a     NA     NA
#> 4      1       2         b      3      6
#> 5      2       1         a     NA     NA
#> ..   ...     ...       ...    ...    ...

Read more about this behaviour in ?expand.


unnest() is out of beta, and is now ready to help you unnest columns that are lists of vectors. This can occur when you have hierarchical data that’s been collapsed into a string:

df <- dplyr::data_frame(x = 1:2, y = c("1,2", "3,4,5,6,7"))
#> Source: local data frame [2 x 2]
#>       x         y
#>   (int)     (chr)
#> 1     1       1,2
#> 2     2 3,4,5,6,7

df %>% 
  dplyr::mutate(y = strsplit(y, ","))
#> Source: local data frame [2 x 2]
#>       x        y
#>   (int)   (list)
#> 1     1 <chr[2]>
#> 2     2 <chr[5]>

df %>% 
  dplyr::mutate(y = strsplit(y, ",")) %>%
#> Source: local data frame [7 x 2]
#>        x     y
#>    (int) (chr)
#> 1      1     1
#> 2      1     2
#> 3      2     3
#> 4      2     4
#> 5      2     5
#> ..   ...   ...

unnest() also works on columns that are lists of data frames. This is admittedly esoteric, but I think it might be useful when you’re generating pairs of test-training splits. I’m still thinking about this idea, so look for more examples and better support across my packages in the future.

Minor improvements

There were 13 minor improvements and bug fixes. The most important are listed below. To read about the rest, please consult the release notes.

  • %>% is re-exported from magrittr: this means that you no longer need to load dplyr or magrittr if you want to use the pipe.
  • extract() and separate() now return multiple NA columns for NA inputs:
    df <- dplyr::data_frame(x = c("a-b", NA, "c-d"))
    df %>% separate(x, c("x", "y"), "-")
    #> Source: local data frame [3 x 2]
    #>       x     y
    #>   (chr) (chr)
    #> 1     a     b
    #> 2    NA    NA
    #> 3     c     d
  • separate() gains finer control if there are too few matches:
    df <- dplyr::data_frame(x = c("a-b-c", "a-c"))
    df %>% separate(x, c("x", "y", "z"), "-")
    #> Warning: Too few values at 1 locations: 2
    #> Source: local data frame [2 x 3]
    #>       x     y     z
    #>   (chr) (chr) (chr)
    #> 1     a     b     c
    #> 2     a     c    NA
    df %>% separate(x, c("x", "y", "z"), "-", fill = "right")
    #> Source: local data frame [2 x 3]
    #>       x     y     z
    #>   (chr) (chr) (chr)
    #> 1     a     b     c
    #> 2     a     c    NA
    df %>% separate(x, c("x", "y", "z"), "-", fill = "left")
    #> Source: local data frame [2 x 3]
    #>       x     y     z
    #>   (chr) (chr) (chr)
    #> 1     a     b     c
    #> 2    NA     c     a

    This complements the support for too many matches:

    df <- dplyr::data_frame(x = c("a-b-c", "a-c"))
    df %>% separate(x, c("x", "y"), "-")
    #> Warning: Too many values at 1 locations: 1
    #> Source: local data frame [2 x 2]
    #>       x     y
    #>   (chr) (chr)
    #> 1     a     b
    #> 2     a     c
    df %>% separate(x, c("x", "y"), "-", extra = "merge")
    #> Source: local data frame [2 x 2]
    #>       x     y
    #>   (chr) (chr)
    #> 1     a   b-c
    #> 2     a     c
    df %>% separate(x, c("x", "y"), "-", extra = "drop")
    #> Source: local data frame [2 x 2]
    #>       x     y
    #>   (chr) (chr)
    #> 1     a     b
    #> 2     a     c
  • tidyr no longer depends on reshape2. This should fix issues when you load reshape and tidyr at the same time. It also frees tidyr to evolve in a different direction to the more general reshape2.

dplyr 0.4.3 includes over 30 minor improvements and bug fixes, which are described in detail in the release notes. Here I wanted to draw your attention five small, but important, changes:

  • mutate() no longer randomly crashes! (Sorry it took us so long to fix this – I know it’s been causing a lot of pain.)
  • dplyr now has much better support for non-ASCII column names. It’s probably not perfect, but should be a lot better than previous versions.
  • When printing a tbl_df, you now see the types of all columns, not just those that don’t fit on the screen:
    data_frame(x = 1:3, y = letters[x], z = factor(y))
    #> Source: local data frame [3 x 3]
    #>       x     y      z
    #>   (int) (chr) (fctr)
    #> 1     1     a      a
    #> 2     2     b      b
    #> 3     3     c      c
  • bind_rows() gains a .id argument. When supplied, it creates a new column that gives the name of each data frame:
    a <- data_frame(x = 1, y = "a")
    b <- data_frame(x = 2, y = "c")
    bind_rows(a = a, b = b)
    #> Source: local data frame [2 x 2]
    #>       x     y
    #>   (dbl) (chr)
    #> 1     1     a
    #> 2     2     c
    bind_rows(a = a, b = b, .id = "source")
    #> Source: local data frame [2 x 3]
    #>   source     x     y
    #>    (chr) (dbl) (chr)
    #> 1      a     1     a
    #> 2      b     2     c
    # Or equivalently
    bind_rows(list(a = a, b = b), .id = "source")
    #> Source: local data frame [2 x 3]
    #>   source     x     y
    #>    (chr) (dbl) (chr)
    #> 1      a     1     a
    #> 2      b     2     c
  • dplyr is now more forgiving of unknown attributes. All functions should now copy column attributes from the input to the output, instead of complaining. Additionally arrange(), filter(), slice(), and summarise() preserve attributes of the data frame itself.

testthat 0.10.0 is now available on CRAN. Testthat makes it easy to turn the informal testing that you’re already doing into formal automated tests. Learn more at Install the latest version with:


There are four big changes in this release:

  • test_check() uses a new reporter specifically designed for R CMD check. It displays a summary at the end of the tests, designed to be <13 lines long so test failures in R CMD check display are as useful as possible.
  • New skip_if_not_installed() skips tests if a package isn’t installed: this is useful if you want tests to skip if a suggested package isn’t installed.
  • The expect_that(a, equals(b)) style of testing has been soft-deprecated in favour of expect_equals(a, b). It will keep working, but it’s no longer demonstrated in the documentation, and new expectations will only be available in expect_equal(a, b) style.
  • compare() is now documented and exported: compare is used to display test failures for expect_equal(), and is designed to help you spot exactly where the failure occured. It currently has methods for character and numeric vectors.

There were a number of other minor improvements and bug fixes. See the release notes for a complete list.

Devtools 1.8 is now available on CRAN. Devtools makes it so easy to build a package that it becomes your default way to organise code, data and documentation. You can learn more about developing packages at

Get the latest version of devtools with:


There are three main improvements:

  • More helpers to get you up and running with package development as quickly as possible.

  • Better tools for package installation (including checking that all dependencies are up to date).

  • Improved reverse dependency checking for CRAN packages.

There were many other minor improvements and bug fixes. See the release notes for complete list of changes. The last release announcement was for devtools 1.6 since there weren’t many big changes in devtools 1.7. I’ve included the most important points in this announcement labelled with [1.7]. ## Helpers

The number of functions designed to get you up and going with package development continues to grow. This version sees the addition of:

  • dr_devtools(), which runs some common diagnostics: are you using the latest version of R and devtools? Similarly, dr_github() checks for common git/github configuration problems.

  • lint() runs lintr::lint_package() to check the style of package code [1.7].

  • use_code_of_conduct() adds a contributor code of conduct from

  • use_cran_badge() adds a CRAN status badge that you can copy into a README file. Green indicates package is on CRAN. Packages not yet submitted or accepted to CRAN get a red badge.

  • use_cran_comments() creates a template and adds it to .Rbuildignore to help with CRAN submissions. [1.7]

  • use_coveralls() allows you to easily add test coverage with coveralls.

  • use_git() sets up a package to use git, initialising the repo and checking the existing files.

  • use_test() adds a new test file in tests/testthat.

  • use_readme_rmd() sets up a template to generate a from a README.Rmd with knitr. [1.7]

Package installation and info

When developing packages it’s common to run into problems because you’ve updated a package, but you’ve forgotten to update it’s dependencies (install.packages() doesn’t this automatically). The new package_deps() solves this problem by finding all recursive dependencies of a package and determining if they’re out of date:

# Find out which dependencies are out of date
# Update them

This code is used in install_deps() and revdep_check() – devtools is now aggressive about updating packages, which should avoid potential problems in CRAN submissions.
New update_packages() uses these tools to install a package (and its dependencies) only if they’re not already installed and current.

Reverse dependency checking

Devtools 1.7 included considerable improvement to reverse dependency checking. This sort of checking is important if your package gets popular, and is used by other CRAN packages. Before submitting updates to CRAN, you need to make sure that you have not broken the CRAN packages that use your package. Read more about it in the R packages book. To get started, run use_revdep(), then run the code in revdep/check.R.


Get every new post delivered to your Inbox.

Join 19,247 other followers