You are currently browsing hadleywickham’s articles.

I’m pleased to announce tidyr 0.4.0. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has a simple convention: put variables in the columns and observations in the rows. You can learn more about it in the tidy data vignette. Install it with:

install.packages("tidyr")

There are two big features in this release: support for nested data frames, and improved tools for turning implicit missing values into explicit missing values. These are described in detail below. As well as these big features, all tidyr verbs now handle grouped_df objects created by dplyr, gather() makes a character key column (instead of a factor), and there are lots of other minor fixes and improvements. Please see the release notes for a complete list of changes.

Nested data frames

nest() and unnest() have been overhauled to support a new way of structuring your data: the nested data frame. In a grouped data frame, you have one row per observation, and additional metadata define the groups. In a nested data frame, you have one row per group, and the individual observations are stored in a column that is a list of data frames. This is a useful structure when you have lists of other objects (like models) with one element per group.

For example, take the gapminder dataset:

library(gapminder)
library(dplyr)

gapminder
#> Source: local data frame [1,704 x 6]
#> 
#>        country continent  year lifeExp      pop gdpPercap
#>         (fctr)    (fctr) (int)   (dbl)    (int)     (dbl)
#> 1  Afghanistan      Asia  1952    28.8  8425333       779
#> 2  Afghanistan      Asia  1957    30.3  9240934       821
#> 3  Afghanistan      Asia  1962    32.0 10267083       853
#> 4  Afghanistan      Asia  1967    34.0 11537966       836
#> 5  Afghanistan      Asia  1972    36.1 13079460       740
#> 6  Afghanistan      Asia  1977    38.4 14880372       786
#> 7  Afghanistan      Asia  1982    39.9 12881816       978
#> 8  Afghanistan      Asia  1987    40.8 13867957       852
#> ..         ...       ...   ...     ...      ...       ...

We can plot the trend in life expetancy for each country:

library(ggplot2)

ggplot(gapminder, aes(year, lifeExp)) +
  geom_line(aes(group = country))

unnamed-chunk-4-1

But it’s hard to see what’s going on because of all the overplotting. One interesting solution is to summarise each country with a linear model. To do that most naturally, you want one data frame for each country. nest() creates this structure:

by_country <- gapminder %>% 
  group_by(continent, country) %>% 
  nest()

by_country
#> Source: local data frame [142 x 3]
#> 
#>    continent     country            data
#>       (fctr)      (fctr)          (list)
#> 1       Asia Afghanistan <tbl_df [12,4]>
#> 2     Europe     Albania <tbl_df [12,4]>
#> 3     Africa     Algeria <tbl_df [12,4]>
#> 4     Africa      Angola <tbl_df [12,4]>
#> 5   Americas   Argentina <tbl_df [12,4]>
#> 6    Oceania   Australia <tbl_df [12,4]>
#> 7     Europe     Austria <tbl_df [12,4]>
#> 8       Asia     Bahrain <tbl_df [12,4]>
#> ..       ...         ...             ...

The intriguing thing about this data frame is that it now contains one row per group, and to store the original data we have a new data column, a list of data frames. If we look at the first one, we can see that it contains the complete data for Afghanistan (sans grouping columns):

by_country$data[[1]]
#> Source: local data frame [12 x 4]
#> 
#>     year lifeExp      pop gdpPercap
#>    (int)   (dbl)    (int)     (dbl)
#> 1   1952    43.1  9279525      2449
#> 2   1957    45.7 10270856      3014
#> 3   1962    48.3 11000948      2551
#> 4   1967    51.4 12760499      3247
#> 5   1972    54.5 14760787      4183
#> 6   1977    58.0 17152804      4910
#> 7   1982    61.4 20033753      5745
#> 8   1987    65.8 23254956      5681
#> ..   ...     ...      ...       ...

This form is natural because there are other vectors where you’ll have one value per country. For example, we could fit a linear model to each country with purrr:

by_country <- by_country %>% 
  mutate(model = purrr::map(data, ~ lm(lifeExp ~ year, data = .))
)
by_country
#> Source: local data frame [142 x 4]
#> 
#>    continent     country            data   model
#>       (fctr)      (fctr)          (list)  (list)
#> 1       Asia Afghanistan <tbl_df [12,4]> <S3:lm>
#> 2     Europe     Albania <tbl_df [12,4]> <S3:lm>
#> 3     Africa     Algeria <tbl_df [12,4]> <S3:lm>
#> 4     Africa      Angola <tbl_df [12,4]> <S3:lm>
#> 5   Americas   Argentina <tbl_df [12,4]> <S3:lm>
#> 6    Oceania   Australia <tbl_df [12,4]> <S3:lm>
#> 7     Europe     Austria <tbl_df [12,4]> <S3:lm>
#> 8       Asia     Bahrain <tbl_df [12,4]> <S3:lm>
#> ..       ...         ...             ...     ...

Because we used mutate(), we get an extra column containing one linear model per country.

It might seem unnatural to store a list of linear models in a data frame. However, I think it is actually a really convenient and powerful strategy because it allows you to keep related vectors together. If you filter or arrange the vector of models, there’s no way for the other components to get out of sync.

nest() got us into this form; unnest() gets us out. You give it the list-columns that you want to unnested, and tidyr will automatically repeat the grouping columns. Unnesting data gets us back to the original form:

by_country %>% unnest(data)
#> Source: local data frame [1,704 x 6]
#> 
#>    continent     country  year lifeExp      pop gdpPercap
#>       (fctr)      (fctr) (int)   (dbl)    (int)     (dbl)
#> 1       Asia Afghanistan  1952    43.1  9279525      2449
#> 2       Asia Afghanistan  1957    45.7 10270856      3014
#> 3       Asia Afghanistan  1962    48.3 11000948      2551
#> 4       Asia Afghanistan  1967    51.4 12760499      3247
#> 5       Asia Afghanistan  1972    54.5 14760787      4183
#> 6       Asia Afghanistan  1977    58.0 17152804      4910
#> 7       Asia Afghanistan  1982    61.4 20033753      5745
#> 8       Asia Afghanistan  1987    65.8 23254956      5681
#> ..       ...         ...   ...     ...      ...       ...

When working with models, unnesting is particularly useful when you combine it with broom to extract model summaries:

# Extract model summaries:
by_country %>% unnest(model %>% purrr::map(broom::glance))
#> Source: local data frame [142 x 15]
#> 
#>    continent     country            data   model r.squared
#>       (fctr)      (fctr)          (list)  (list)     (dbl)
#> 1       Asia Afghanistan <tbl_df [12,4]> <S3:lm>     0.985
#> 2     Europe     Albania <tbl_df [12,4]> <S3:lm>     0.888
#> 3     Africa     Algeria <tbl_df [12,4]> <S3:lm>     0.967
#> 4     Africa      Angola <tbl_df [12,4]> <S3:lm>     0.034
#> 5   Americas   Argentina <tbl_df [12,4]> <S3:lm>     0.919
#> 6    Oceania   Australia <tbl_df [12,4]> <S3:lm>     0.766
#> 7     Europe     Austria <tbl_df [12,4]> <S3:lm>     0.680
#> 8       Asia     Bahrain <tbl_df [12,4]> <S3:lm>     0.493
#> ..       ...         ...             ...     ...       ...
#> Variables not shown: adj.r.squared (dbl), sigma (dbl),
#>   statistic (dbl), p.value (dbl), df (int), logLik (dbl),
#>   AIC (dbl), BIC (dbl), deviance (dbl), df.residual (int).

# Extract coefficients:
by_country %>% unnest(model %>% purrr::map(broom::tidy))
#> Source: local data frame [284 x 7]
#> 
#>    continent     country        term  estimate std.error
#>       (fctr)      (fctr)       (chr)     (dbl)     (dbl)
#> 1       Asia Afghanistan (Intercept) -1.07e+03   43.8022
#> 2       Asia Afghanistan        year  5.69e-01    0.0221
#> 3     Europe     Albania (Intercept) -3.77e+02   46.5834
#> 4     Europe     Albania        year  2.09e-01    0.0235
#> 5     Africa     Algeria (Intercept) -6.13e+02   38.8918
#> 6     Africa     Algeria        year  3.34e-01    0.0196
#> 7     Africa      Angola (Intercept) -6.55e+01  202.3625
#> 8     Africa      Angola        year  6.07e-02    0.1022
#> ..       ...         ...         ...       ...       ...
#> Variables not shown: statistic (dbl), p.value (dbl).

# Extract residuals etc:
by_country %>% unnest(model %>% purrr::map(broom::augment))
#> Source: local data frame [1,704 x 11]
#> 
#>    continent     country lifeExp  year .fitted .se.fit
#>       (fctr)      (fctr)   (dbl) (int)   (dbl)   (dbl)
#> 1       Asia Afghanistan    43.1  1952    43.4   0.718
#> 2       Asia Afghanistan    45.7  1957    46.2   0.627
#> 3       Asia Afghanistan    48.3  1962    49.1   0.544
#> 4       Asia Afghanistan    51.4  1967    51.9   0.472
#> 5       Asia Afghanistan    54.5  1972    54.8   0.416
#> 6       Asia Afghanistan    58.0  1977    57.6   0.386
#> 7       Asia Afghanistan    61.4  1982    60.5   0.386
#> 8       Asia Afghanistan    65.8  1987    63.3   0.416
#> ..       ...         ...     ...   ...     ...     ...
#> Variables not shown: .resid (dbl), .hat (dbl), .sigma
#>   (dbl), .cooksd (dbl), .std.resid (dbl).

I think storing multiple models in a data frame is a powerful and convenient technique, and I plan to write more about it in the future.

Expanding

The complete() function allows you to turn implicit missing values into explicit missing values. For example, imagine you’ve collected some data every year basis, but unfortunately some of your data has gone missing:

resources <- frame_data(
  ~year, ~metric, ~value,
  1999, "coal", 100,
  2001, "coal", 50,
  2001, "steel", 200
)
resources
#> Source: local data frame [3 x 3]
#> 
#>    year metric value
#>   (dbl)  (chr) (dbl)
#> 1  1999   coal   100
#> 2  2001   coal    50
#> 3  2001  steel   200

Here the value for steel in 1999 is implicitly missing: it’s simply absent from the data frame. We can use complete() to make this missing row explicit, adding that combination of the variables and inserting a placeholder NA:

resources %>% complete(year, metric)
#> Source: local data frame [4 x 3]
#> 
#>    year metric value
#>   (dbl)  (chr) (dbl)
#> 1  1999   coal   100
#> 2  1999  steel    NA
#> 3  2001   coal    50
#> 4  2001  steel   200

With complete you’re not limited to just combinations that exist in the data. For example, here we know that there should be data for every year, so we can use the fullseq() function to generate every year over the range of the data:

resources %>% complete(year = full_seq(year, 1L), metric)
#> Source: local data frame [6 x 3]
#> 
#>    year metric value
#>   (dbl)  (chr) (dbl)
#> 1  1999   coal   100
#> 2  1999  steel    NA
#> 3  2000   coal    NA
#> 4  2000  steel    NA
#> 5  2001   coal    50
#> 6  2001  steel   200

In other scenarios, you may not want to generate the full set of combinations. For example, imagine you have an experiment where each person is assigned one treatment. You don’t want to expand the combinations of person and treatment, but you do want to make sure every person has all replicates. You can use nesting() to prevent the full Cartesian product from being generated:

experiment <- data_frame(
  person = rep(c("Alex", "Robert", "Sam"), c(3, 2, 1)),
  trt  = rep(c("a", "b", "a"), c(3, 2, 1)),
  rep = c(1, 2, 3, 1, 2, 1),
  measurment_1 = runif(6),
  measurment_2 = runif(6)
)
experiment
#> Source: local data frame [6 x 5]
#> 
#>   person   trt   rep measurment_1 measurment_2
#>    (chr) (chr) (dbl)        (dbl)        (dbl)
#> 1   Alex     a     1       0.7161        0.927
#> 2   Alex     a     2       0.3231        0.942
#> 3   Alex     a     3       0.4548        0.668
#> 4 Robert     b     1       0.0356        0.667
#> 5 Robert     b     2       0.5081        0.143
#> 6    Sam     a     1       0.6917        0.753

experiment %>% complete(nesting(person, trt), rep)
#> Source: local data frame [9 x 5]
#> 
#>    person   trt   rep measurment_1 measurment_2
#>     (chr) (chr) (dbl)        (dbl)        (dbl)
#> 1    Alex     a     1       0.7161        0.927
#> 2    Alex     a     2       0.3231        0.942
#> 3    Alex     a     3       0.4548        0.668
#> 4  Robert     b     1       0.0356        0.667
#> 5  Robert     b     2       0.5081        0.143
#> 6  Robert     b     3           NA           NA
#> 7     Sam     a     1       0.6917        0.753
#> 8     Sam     a     2           NA           NA
#> ..    ...   ...   ...          ...          ...

httr 1.1.0 is now available on CRAN. The httr packages makes it easy to talk to web APIs from R. Learn more in the quick start vignette.

Install the latest version with:

install.packages("httr")

When writing this blog post I discovered that I forgot to annouce httr 1.0.0. This was a major release marking the transition from the RCurl package to the curl package, a modern binding to libcurl written by Jeroen Ooms. This makes httr more reliable, less likely to leak memory, and prevents the diabolical “easy handle already used in multi handle” error.

httr 1.1.0 includes a couple of new features:

  • stop_for_status(), warn_for_status() and (new) message_for_status() replace the old message argument with a new task argument that optionally describes the current task. This allows API wrappers to provide more informative error messages on failure.

  • http_error() replaces url_ok() and url_successful(). http_error() more clearly conveys intent and works with urls, responses and status codes.

Otherwise, OAuth support continues to improve thanks to support from the community:

  • Nathan Goulding added RSA-SHA1 signature support to oauth1.0_token(). He also fixed bugs in oauth_service_token() and improved the caching behaviour of refresh_oauth2.0(). This makes httr easier to use with Google’s service accounts.

  • Graham Parsons added support for HTTP basic authentication to oauth2.0_token() with the use_basic_auth. This is now the default method used when retrieving a token.

  • Daniel Lockau implemented user_params which allows you to pass arbitrary additional parameters to the token access endpoint when acquiring or refreshing a token. This allows you to use httr with Microsoft Azure. He also wrote a demo so you can see exactly how this works.

To see the full list of changes, please read the release notes for 1.0.0 and 1.1.0.

Devtools 1.10.0 is now available on CRAN. Devtools makes package building so easy that a package can become your default way to organise code, data, documentation, and tests. You can learn more about creating your own package in R packages. Install devtools with:

install.packages("devtools")

This version is mostly a collection of bug fixes and minor improvements. For example:

  • Devtools employs a new strategy for detecting RTools on windows: we now only check for Rtools if you need to load_all() or build() a package with compiled code. This should make life easier for most windows users.
  • Package installation receieved a lot of tweaks from the community. Devtools now makes use of the Additional_repositories field, which is useful if you’re using drat for non-CRAN packages. install_github() is now lazy and won’t reinstall if the currently installed version is the same as the one on github. Local installs now add git and github metadata, if available.
  • use_news_md() adds a (very) basic NEWS.md template. CRAN now accepts NEWS.md files so release() warns if you’ve previously added it to .Rbuilignore.
  • use_mit_license() writes the necessary infrastructure to declare that your package is MIT licensed (in a CRAN-compliant way).
  • check(cran = TRUE) automatically adds --run-donttest as this is a de facto CRAN standard.

To see the full list of changes, please read the release notes.

I’m pleased to announce purrr 0.2.0. Purrr fills in the missing pieces in R’s functional programming tools, and is designed to make your pure (and now) type-stable functions purrr.

I’m still working out exactly what purrr should do, and how it compares to existing functions in base R, dplyr, and tidyr. One main insight that has affected much of the current version is that functions designed for programming should be type-stable. Type-stability is an idea brought to my attention by Julia. Even though functions in R and Julia can return different types of output, by and large, you should strive to make functions that always return the same type of data structure. This makes functions more robust to varying input, and makes them easier to reason about (and in Julia, to optimise). (But not every function can be type-stable – how could $ work?)

Purrr 0.2.0 adds type-stable alternatives for maps, flattens, and try(), as described below. There were a lot of other minor improvements, bug fixes, and a number of deprecations. Please see the release notes for a complete list of changes.

Type stable maps

A map is a function that calls an another function on each element of a vector. Map functions in base R are the “applys”: lapply(), sapply(), vapply(), etc. lapply() is type-stable: no matter what the inputs are, the output is already a list. sapply() is not type-stable: it can return different types of output depending on the input. The following code shows a simple (if somewhat contrived) example of sapply() returning either a vector, a matrix, or a list, depending on its inputs:

df <- data.frame(
  a = 1L,
  b = 1.5,
  y = Sys.time(),
  z = ordered(1)
)

df[1:4] %>% sapply(class) %>% str()
#> List of 4
#>  $ a: chr "integer"
#>  $ b: chr "numeric"
#>  $ y: chr [1:2] "POSIXct" "POSIXt"
#>  $ z: chr [1:2] "ordered" "factor"
df[1:2] %>% sapply(class) %>% str()
#>  Named chr [1:2] "integer" "numeric"
#>  - attr(*, "names")= chr [1:2] "a" "b"
df[3:4] %>% sapply(class) %>% str()
#>  chr [1:2, 1:2] "POSIXct" "POSIXt" "ordered" "factor"
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:2] "y" "z"

This behaviour makes sapply() appropriate for interactive use, since it usually guesses correctly and gives a useful data structure. It’s not appropriate for use in package or production code because if the input isn’t what you expect, it won’t fail, and will instead return an unexpected data structure. This typically causes an error further along the process, so you get a confusing error message and it’s difficult to isolate the root cause.

Base R has a type-stable version of sapply() called vapply(). It takes an additional argument that determines what the output will be. purrr takes a different approach. Instead of one function that does it all, purrr has multiple functions, one for each common type of output: map_lgl(), map_int(), map_dbl(), map_chr(), and map_df(). These either produce the specified type of output or throw an error. This forces you to deal with the problem right away:

df[1:4] %>% map_chr(class)
#> Error: Result 3 is not a length 1 atomic vector
df[1:4] %>% map_chr(~ paste(class(.), collapse = "/"))
#>                a                b                y                z 
#>        "integer"        "numeric" "POSIXct/POSIXt" "ordered/factor"

Other variants of map() have similar suffixes. For example, map2() allows you to iterate over two vectors in parallel:

x <- list(1, 3, 5)
y <- list(2, 4, 6)
map2(x, y, c)
#> [[1]]
#> [1] 1 2
#> 
#> [[2]]
#> [1] 3 4
#> 
#> [[3]]
#> [1] 5 6

map2() always returns a list. If you want to add together the corresponding values and store the result as a double vector, you can use map2_dbl():

map2_dbl(x, y, `+`)
#> [1]  3  7 11

Another map variant is invoke_map(), which takes a list of functions and list of arguments. It also has type-stable suffixes:

spread <- list(sd = sd, iqr = IQR, mad = mad)
x <- rnorm(100)

invoke_map_dbl(spread, x = x)
#>        sd       iqr       mad 
#> 0.9121309 1.2515807 0.9774154

Type-stable flatten

Another situation when type-stability is important is flattening a nested list into a simpler data structure. Base R has unlist(), but it’s dangerous because it always succeeds. As an alternative, purrr provides flatten_lgl(), flatten_int(), flatten_dbl(), and flatten_chr():

x <- list(1L, 2:3, 4L)
x %>% str()
#> List of 3
#>  $ : int 1
#>  $ : int [1:2] 2 3
#>  $ : int 4
x %>% flatten() %>% str()
#> List of 4
#>  $ : int 1
#>  $ : int 2
#>  $ : int 3
#>  $ : int 4
x %>% flatten_int() %>% str()
#>  int [1:4] 1 2 3 4

Type-stable try()

Another function in base R that is not type-stable is try(). try() ensures that an expression always succeeds, either returning the original value or the error message:

str(try(log(10)))
#>  num 2.3
str(try(log("a"), silent = TRUE))
#> Class 'try-error'  atomic [1:1] Error in log("a") : non-numeric argument to mathematical function
#> 
#>   ..- attr(*, "condition")=List of 2
#>   .. ..$ message: chr "non-numeric argument to mathematical function"
#>   .. ..$ call   : language log("a")
#>   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

safely() is a type-stable version of try. It always returns a list of two elements, the result and the error, and one will always be NULL.

safely(log)(10)
#> $result
#> [1] 2.302585
#> 
#> $error
#> NULL
safely(log)("a")
#> $result
#> NULL
#> 
#> $error
#> <simpleError in .f(...): non-numeric argument to mathematical function>

Notice that safely() takes a function as input and returns a “safe” function, a function that never throws an error. A powerful technique is to use safely() and map() together to attempt an operation on each element of a list:

safe_log <- safely(log)
x <- list(10, "a", 5)
log_x <- x %>% map(safe_log)

str(log_x)
#> List of 3
#>  $ :List of 2
#>   ..$ result: num 2.3
#>   ..$ error : NULL
#>  $ :List of 2
#>   ..$ result: NULL
#>   ..$ error :List of 2
#>   .. ..$ message: chr "non-numeric argument to mathematical function"
#>   .. ..$ call   : language .f(...)
#>   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
#>  $ :List of 2
#>   ..$ result: num 1.61
#>   ..$ error : NULL

This is output is slightly inconvenient because you’d rather have a list of three results, and another list of three errors. You can use the new transpose() function to switch the order of the first and second levels in the hierarchy:

log_x %>% transpose() %>% str()
#> List of 2
#>  $ result:List of 3
#>   ..$ : num 2.3
#>   ..$ : NULL
#>   ..$ : num 1.61
#>  $ error :List of 3
#>   ..$ : NULL
#>   ..$ :List of 2
#>   .. ..$ message: chr "non-numeric argument to mathematical function"
#>   .. ..$ call   : language .f(...)
#>   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
#>   ..$ : NULL

This makes it easy to extract the inputs where the original functions failed, or just keep the good successful result:

results <- x %>% map(safe_log) %>% transpose()

(ok <- results$error %>% map_lgl(is_null))
#> [1]  TRUE FALSE  TRUE
(bad_inputs <- x %>% discard(ok))
#> [[1]]
#> [1] "a"
(successes <- results$result %>% keep(ok) %>% flatten_dbl())
#> [1] 2.302585 1.609438

I’m very pleased to announce the release of ggplot2 2.0.0. I know I promised that there wouldn’t be any more updates, but while working on the 2nd edition of the ggplot2 book, I just couldn’t stop myself from fixing some long standing problems.

On the scale of ggplot2 releases, this one is huge with over one hundred fixes and improvements. This might break some of your existing code (although I’ve tried to minimise breakage as much as possible), but I hope the new features make up for any short term hassle. This blog post documents the most important changes:

  • ggplot2 now has an official extension mechanism.
  • There are a handful of new geoms, and updates to existing geoms.
  • The default appearance has been thoroughly tweaked so most plots should look better.
  • Facets have a much richer set of labelling options.
  • The documentation has been overhauled to be more helpful, and require less integration across multiple pages.
  • A number of older and less used features have been deprecated.

These are described in more detail below. See the release notes for a complete list of all changes.

Extensibility

Perhaps the bigggest news in this release is that ggplot2 now has an official extension mechanism. This means that others can now easily create their on stats, geoms and positions, and provide them in other packages. This should allow the ggplot2 community to flourish, even as less development work happens in ggplot2 itself. See vignette("extending-ggplot2") for details.

Coupled with this change, ggplot2 no longer uses proto or reference classes. Instead, we now use ggproto, a new OO system designed specifically for ggplot2. Unlike proto and RC, ggproto supports clean cross-package inheritance, which is necessary for extensibility. Creating a new OO system isn’t usually the right solution, but I’m pretty sure it was necessary here. Read more about it in the vignette.

New and updated geoms

  • ggplot no longer throws an error if you your plot has no layers. Instead it automatically adds geom_blank():
    ggplot(mpg, aes(cyl, hwy))

  • geom_count() (a new alias for the old stat_sum()) counts the number of points at unique locations on a scatterplot, and maps the size of the point to the count:
    ggplot(mpg, aes(cty, hwy)) + 
      geom_point()
    ggplot(mpg, aes(cty, hwy)) +
      geom_count()

  • geom_curve() draws curved lines in the same way that geom_segment() draws straight lines:
    df <- expand.grid(x = 1:2, y = 1:2)
    ggplot(df, aes(x, y, xend = x + 0.5, yend = y + 0.5)) +
      geom_curve(aes(colour = "curve")) +
      geom_segment(aes(colour = "segment"))

  • geom_bar() now behaves differently from geom_histogram(). Instead of binning the data, it counts the number of unique observations at each location:
    ggplot(mpg, aes(cyl)) + 
      geom_bar()
    
    ggplot(mpg, aes(cyl)) + 
      geom_histogram(binwidth = 1)

    If you got into the (bad) habit of using geom_histogram() to create bar charts, or geom_bar() to create histograms, you’ll need to switch.

  • Layers are now much stricter about their arguments – you will get an error if you’ve supplied an argument that isn’t an aesthetic or a parameter. This breaks the handful of geoms/stats that used ... to pass additional arguments on to the underlying computation. Now geom_smooth()/stat_smooth() and geom_quantile()/stat_quantile() use method.args instead; and stat_summary(), stat_summary_hex(), and stat_summary2d() use fun.args. This is likely to cause some short-term pain but in the long-term it will make it much easier to spot spelling mistakes and other errors.
  • geom_text() has been overhauled to make labelling your data a little easier. You can use nudge_x and nudge_y arguments to offset labels from their corresponding points. check_overlap = TRUE provides a simple way to avoid overplotting of labels: labels that would otherwise overlap are omitted.
    ggplot(mtcars, aes(wt, mpg, label = rownames(mtcars))) +
      geom_point() + 
      geom_text(nudge_y = 0.5, check_overlap = TRUE)

    (Labelling points well is still a huge pain, but at least these new features make life a lit better.)

  • geom_label() works like geom_text() but draws a rounded rectangle underneath each label:
    grid <- expand.grid(
      x = seq(-pi, pi, length = 50),
      y = seq(-pi, pi, length = 50)
    ) %>% mutate(r = x ^ 2 + y ^ 2, z = cos(r ^ 2) * exp(-r / 6))
    
    ggplot(grid, aes(x, y)) +
      geom_raster(aes(fill = z)) +
      geom_label(data = data.frame(x = 0, y = 0), label = "Center") +
      theme(legend.position = "none") +
      coord_fixed()

  • aes_() replaces aes_q(), and works like the SE functions in dplyr and my other recent packages. It supports formulas, so the most concise SE version of aes(carat, price) is now aes_(~carat, ~price). You may want to use this form in packages, as it will avoid spurious R CMD check warnings about undefined global variables.
    ggplot(mpg, aes_(~displ, ~cty)) + 
      geom_point()
    # Same as
    ggplot(mpg, aes(displ, cty)) + 
      geom_point()

Appearance

I’ve made a number of small tweaks to the default appearance:

  • The default theme_grey() background colour has been changed from “grey90” to “grey92”: this makes the background a little less visually prominent.
  • Labels and titles have been tweaked for readability. Axis labels are darker, and legend titles get the same visual treatment as axis labels.
  • The default font size dropped from 12 to 11. You might be surprised that I’ve made the default text size smaller as it was already hard for many people to read. It turns out there was a bug in RStudio (fixed in 0.99.724), that shrunk the text of all grid based graphics. Once that was resolved the defaults seemed too big to my eyes.
  • scale_size() now maps values to area, not radius. Use scale_radius() if you want the old behaviour (not recommended, except perhaps for lines). Continue to use scale_size_area() if you want 0 values to have 0 area.
  • Bar and rectangle legends no longer get a diagonal line. Instead, the border has been tweaked to make it visible, and more closely match the size of line drawn on the plot.
    ggplot(mpg, aes(factor(cyl), fill = drv)) +  
      geom_bar(colour = "black", size = 1) + 
      coord_flip()

  • geom_point() now uses shape 19 instead of 16. This looks much better on the default Linux graphics device. (It’s very slightly smaller than the old point, but it shouldn’t affect any graphics significantly). You can now control the width of the outline on shapes 21-25 with the stroke parameter.
  • The default legend will now allocate multiple rows (if vertical) or columns (if horizontal) in order to make a legend that is more likely to fit on the screen. You can override with the nrow/ncol arguments to guide_legend()
    p <- ggplot(mpg, aes(displ,hwy, colour = manufacturer)) +
      geom_point() + 
      theme(legend.position = "bottom")
    p
    # Revert back to previous behaviour
    p + guides(colour = guide_legend(nrow = 1))

  • Two new themes were contributed by Jean-Olivier Irisson: theme_void() is completely empty and theme_dark() has a dark background designed to make colours pop out.

Facet labels

Thanks to the work of Lionel Henry, facet labels have received three major improvements:

  1. You can switch the position of facet labels so they’re next to the axes.
  2. facet_wrap() now supports custom labellers.
  3. You can create combined labels when facetting by multiple variables.

Switching the labels

The new switch argument allows you to switch the labels to display near the axes:

data <- transform(mtcars,
  am = factor(am, levels = 0:1, c("Automatic", "Manual")),
  gear = factor(gear, levels = 3:5, labels = c("Three", "Four", "Five"))
)

ggplot(data, aes(mpg, disp)) +
  geom_point() +
  facet_grid(am ~ gear, switch = "both")

This is especially useful when the labels directly characterise the axes. In that situation, switching the labels can make the plot clearer and more readable. You may also want to use a neutral label background by setting strip.background to element_blank():

data <- mtcars %>% 
  mutate(
    Logarithmic = log(mpg),
    Inverse = 1 / mpg,
    Cubic = mpg ^ 3,
    Original = mpg
) %>% tidyr::gather(transformation, mpg2, Logarithmic:Original)

ggplot(data, aes(mpg2, disp)) +
  geom_point() +
  facet_wrap(~transformation, scales = "free", switch = "x") +
  theme(strip.background = element_blank())

Wrap labeller

A longstanding issue in ggplot was that facet_wrap() did not support custom labellers. Labellers are small functions that make it easy to customise the labels. You can now supply labellers to both wrap and grid facets:

ggplot(data, aes(mpg2, disp)) +
  geom_point() +
  facet_wrap(~transformation, scales = "free", labeller = "label_both")

Composite margins

Labellers have now better support for composite margins when you facet over multiple variable with +. All labellers gain a multi_line argument to control whether labels should be displayed as a single line or over multiple lines, one for each factor.

The labellers still work the same way except for label_bquote(). That labeller makes it easy to write mathematical expression involving the values of facetted factors. Historically, label_bquote() could only specify a single expression for all margins and factor. The factor value was referred to via the backquoted placeholder .(x). Now that it supports expressions combining multiple factors, you must backquote the variable names themselves. In addition, you can provide different expressions for each margin:

my_labeller <- label_bquote(
  rows = .(am) / alpha,
  cols = .(vs) ^ .(cyl)
)

ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  facet_grid(am ~ vs + cyl, labeller = my_labeller)

Documentation

I’ve given the documentation a thorough overhaul:

  • Tighly linked geoms and stats (e.g. geom_boxplot() and stat_boxplot()) are now documented in the same file so you can see all the arguments in one place. Similarly, variations on a theme (like geom_path(), geom_line(), and geom_step()) are documented together.
  • I’ve tried to reduce the use of ... so that you can see all the documentation in one place rather than having to follow links around. In some cases this has involved adding additional arguments to geoms to make it more clear what you can do.
  • Thanks to Bob Rudis, the use of qplot() in examples has been grealy reduced. This is inline with the 2nd edition of the ggplot2 book, which eliminates qplot() in favour of ggplot().

Deprecated features

  • The order aesthetic is officially deprecated. It never really worked, and was poorly documented.
  • The stat and position arguments to qplot() have been deprecated. qplot() is designed for quick plots – if you need to specify position or stat, use ggplot() instead.
  • The theme setting axis.ticks.margin has been deprecated: now use the margin property of axis.ticks.
  • stat_abline(), stat_hline() and stat_vline() have been removed: these were never suitable for use other than with their corresponding geoms and were not documented.
  • show_guide has been renamed to show.legend: this more accurately reflects what it does (controls appearance of layer in legend), and uses the same convention as other ggplot2 arguments (i.e. a . between names). (Yes, I know that’s inconsistent with function names (which use _) but it’s too late to change now.)

A number of geoms have been renamed to be more consistent. The previous names will continue to work for the forseeable future, but you should switch to the new names for new work.

  • stat_binhex() and stat_bin2d() have been renamed to stat_bin_hex() and stat_bin_2d(). stat_summary2d() has been renamed to stat_summary_2d(), geom_density2d()/stat_density2d() has been renamed to geom_density_2d()/stat_density_2d().
  • stat_spoke() is now geom_spoke() since I realised it’s a reparameterisation of geom_segment().
  • stat_bindot() has been removed because it’s so tightly coupled to geom_dotplot(). If you happened to use stat_bindot(), just change to geom_dotplot().

All defunct functions have been removed.

I’m pleased to announced a new package for producing SVGs from R: svglite. This package is a fork of Matthieu Decorde RSvgDevice and wouldn’t be possible without his hard work. I’d also like to thank David Gohel who wrote the gdtools package: it solves all the hardest problems associated with making good SVGs from R.

Today, most browsers have good support for SVG and it is a great way of displaying vector graphics on the web. Unfortunately, R’s built-in svg() device is focussed on high quality rendering, not size or speed. It renders text as individual polygons: this ensures a graphic will look exactly the same regardless of what fonts you have installed, but makes output considerably larger (and harder to edit in other tools). svglite produces hand-optimised SVG that is as small as possible.

Features

svglite is a complete graphics device: that means you can give it any graphic and it will look the same as the equivalent .pdf or .png. Please file an issue if you discover a plot that doesn’t look right.

Use

In an interactive session, you use it like any other R graphics device:

svglite::svglite("myfile.svg")
plot(runif(10), runif(10))
dev.off()

If you want to use it in knitr, just set your chunk options as follows:

```{r setup, include = FALSE}
library(svglite)
knitr::opts_chunk$set(
  dev = "svglite",
  fig.ext = ".svg"
)

(Thanks to Bob Rudis for the tip)

There are also a few helper functions:

  • htmlSVG() makes it easy to preview the SVG in RStudio.
  • editSVG() opens the SVG file in your default SVG editor.
  • xmlSVG() returns the SVG as an xml2 object.

roxygen2 5.0.0 is now available on CRAN. roxygen2 helps you document your packages by turning specially formatted inline comments in R’s standard Rd format. Learn more at http://r-pkgs.had.co.nz/man.html.

In this release:

  • Roxygen records its version in a single place: the RoxygenNote field in your DESCRIPTION. This should make it easier to see what’s changed when you upgrade roxygen2, because only files with differences will be modified. Previously every Rd file was modified to update the version number.
  • You can now easily document functions that you’ve imported from another package:
    #' @importFrom magrittr %>%
    #' @export
    magrittr::`%>%`

    All imported-and-re-exported functions will be documented in the same file (rexports.Rd), with a brief descrption and links to the original documentation.

  • You can more easily generate package documentation by documenting the special string “_PACKAGE“:
    #' @details Details
    "_PACKAGE" 

    The title and description will be automatically filled in from the DESCRIPTION.

  • New tags @rawRd and @rawNamespace allow you to insert raw (unescaped) text in Rd and the NAMESPACE. @evalRd() is similar, but instead of literal Rd, you give it R code that produces literal Rd code when run. This should make it easier to experiment with new types of output.
  • Roxygen2 now parses the source code files in the order specified in the Collate field in DESCRIPTION. This improves the ordering of the generated documentation when using @describeIn and/or @rdname split across several .R files, as often happens when working with S4.
  • The parser has been completely rewritten in C++. This gives a nice performance boost and adds improves the error messages: now get the line number of the tag, not the start of the block.
  • @family now cross-links each manual page only once, instread of linking to all aliases.

There were many other minor improvements and bug fixes; please see the release notes for a complete list. A bug thanks goes to all the contributors who made this release possible.

readr 0.2.0 is now available on CRAN. readr makes it easy to read many types of tabular data, including csv, tsv and fixed width. Compared to base equivalents like read.csv(), readr is much faster and gives more convenient output: it never converts strings to factors, can parse date/times, and it doesn’t munge the column names.

This is a big release, so below I describe the new features divided into four main categories:

  • Improved support for international data.
  • Column parsing improvements.
  • File parsing improvements, including support for comments.
  • Improved writers.

There were too many minor improvements and bug fixes to describe in detail here. See the release notes for a complete list.

Internationalisation

readr now has a strategy for dealing with settings that vary across languages and localities: locales. A locale, created with locale(), includes:

  • The names of months and days, used when parsing dates.
  • The default time zone, used when parsing datetimes.
  • The character encoding, used when reading non-ASCII strings.
  • Default date format, used when guessing column types.
  • The decimal and grouping marks, used when reading numbers.

I’ll cover the most important of these parameters below. For more details, see vignette("locales").
To override the default US-centric locale, you pass a custom locale to read_csv(), read_tsv(), or read_fwf(). Rather than showing those funtions here, I’ll use the parse_*() functions because they work with character vectors instead of a files, but are otherwise identical.

Names of months and days

The first argument to locale() is date_names which controls what values are used for month and day names. The easiest way to specify them is with a ISO 639 language code:

locale("ko") # Korean
#> <locale>
#> Numbers:  123,456.78
#> Formats:  %Y%.%m%.%d / %H:%M
#> Timezone: UTC
#> Encoding: UTF-8
#> <date_names>
#> Days:   일요일 (일), 월요일 (월), 화요일 (화), 수요일 (수), 목요일 (목),
#>         금요일 (금), 토요일 (토)
#> Months: 1월, 2월, 3월, 4월, 5월, 6월, 7월, 8월, 9월, 10월, 11월, 12월
#> AM/PM:  오전/오후
locale("fr") # French
#> <locale>
#> Numbers:  123,456.78
#> Formats:  %Y%.%m%.%d / %H:%M
#> Timezone: UTC
#> Encoding: UTF-8
#> <date_names>
#> Days:   dimanche (dim.), lundi (lun.), mardi (mar.), mercredi (mer.),
#>         jeudi (jeu.), vendredi (ven.), samedi (sam.)
#> Months: janvier (janv.), février (févr.), mars (mars), avril (avr.), mai
#>         (mai), juin (juin), juillet (juil.), août (août),
#>         septembre (sept.), octobre (oct.), novembre (nov.),
#>         décembre (déc.)
#> AM/PM:  AM/PM

This allows you to parse dates in other languages:

parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
#> [1] "2015-01-01"
parse_date("14 oct. 1979", "%d %b %Y", locale = locale("fr"))
#> [1] "1979-10-14"

Timezones

readr assumes that times are in Coordinated Universal Time, aka UTC. UTC is the best timezone for data because it doesn’t have daylight savings. If your data isn’t already in UTC, you’ll need to supply a tz in the locale:

parse_datetime("2001-10-10 20:10")
#> [1] "2001-10-10 20:10:00 UTC"
parse_datetime("2001-10-10 20:10", 
  locale = locale(tz = "Pacific/Auckland"))
#> [1] "2001-10-10 20:10:00 NZDT"
parse_datetime("2001-10-10 20:10", 
  locale = locale(tz = "Europe/Dublin"))
#> [1] "2001-10-10 20:10:00 IST"

List all available times zones with OlsonNames(). If you’re American, note that “EST” is not Eastern Standard Time – it’s a Canadian time zone that doesn’t have DST! Instead of relying on ambiguous abbreivations, use:

  • PST/PDT = “US/Pacific”
  • CST/CDT = “US/Central”
  • MST/MDT = “US/Mountain”
  • EST/EDT = “US/Eastern”

Default formats

Locales also provide default date and time formats. The time format isn’t currently used for anything, but the date format is used when guessing column types. The default date format is %Y-%m-%d because that’s unambiguous:

str(parse_guess("2010-10-10"))
#>  Date[1:1], format: "2010-10-10"

If you’re an American, you might want you use your illogical date sytem::

str(parse_guess("01/02/2013"))
#>  chr "01/02/2013"
str(parse_guess("01/02/2013", 
  locale = locale(date_format = "%d/%m/%Y")))
#>  Date[1:1], format: "2013-02-01"

Character encoding

All readr functions yield strings encoded in UTF-8. This encoding is the most likely to give good results in the widest variety of settings. By default, readr assumes that your input is also in UTF-8, which is less likely to be the case, especially when you’re working with older datasets. To parse a dataset that’s not in UTF-8, you need to a supply an encoding.
The following code creates a string encoded with latin1 (aka ISO-8859-1), and shows how it’s different from the string encoded as UTF-8, and how to parse it with readr:

x <- "Émigré cause célèbre déjà vu.\n"
y <- stringi::stri_conv(x, "UTF-8", "Latin1")

# These strings look like they're identical:
x
#> [1] "Émigré cause célèbre déjà vu.\n"
y
#> [1] "Émigré cause célèbre déjà vu.\n"
identical(x, y)
#> [1] TRUE

# But they have different encodings:
Encoding(x)
#> [1] "UTF-8"
Encoding(y)
#> [1] "latin1"

# That means while they print the same, their raw (binary)
# representation is actually rather different:
charToRaw(x)
#>  [1] c3 89 6d 69 67 72 c3 a9 20 63 61 75 73 65 20 63 c3 a9 6c c3 a8 62 72
#> [24] 65 20 64 c3 a9 6a c3 a0 20 76 75 2e 0a
charToRaw(y)
#>  [1] c9 6d 69 67 72 e9 20 63 61 75 73 65 20 63 e9 6c e8 62 72 65 20 64 e9
#> [24] 6a e0 20 76 75 2e 0a

# readr expects strings to be encoded as UTF-8. If they're
# not, you'll get weird characters
parse_character(x)
#> [1] "Émigré cause célèbre déjà vu.\n"
parse_character(y)
#> [1] "\xc9migr\xe9 cause c\xe9l\xe8bre d\xe9j\xe0 vu.\n"

# If you know the encoding, supply it:
parse_character(y, locale = locale(encoding = "latin1"))
#> [1] "Émigré cause célèbre déjà vu.\n"

If you don’t know what encoding the file uses, try guess_encoding(). It’s not 100% perfect (as it’s fundamentally a heuristic), but should at least get you pointed in the right direction:

guess_encoding(y)
#>     encoding confidence
#> 1 ISO-8859-2        0.4
#> 2 ISO-8859-1        0.3

# Note that the first guess produces a valid string, 
# but isn't correct:
parse_character(y, locale = locale(encoding = "ISO-8859-2"))
#> [1] "Émigré cause célčbre déjŕ vu.\n"
# But ISO-8859-1 is another name for latin1
parse_character(y, locale = locale(encoding = "ISO-8859-1"))
#> [1] "Émigré cause célèbre déjà vu.\n"

Numbers

Some countries use the decimal point, while others use the decimal comma. The decimal_mark option controls which readr uses when parsing doubles:

parse_double("1,23", locale = locale(decimal_mark = ","))
#> [1] 1.23

The big_mark option describes which character is used to space groups of digits. Do you write 1,000,000, 1.000.000, 1 000 000, or 1'000'000? Specifying the grouping mark allows parse_number() to parse large number as they’re commonly written:

parse_number("1,234.56")
#> [1] 1234.56

# dplyr is smart enough to guess that if you're using , for 
# decimals then you're probably using . for grouping:
parse_number("1.234,56", locale = locale(decimal_mark = ","))
#> [1] 1234.56

Column parsing improvements

One of the most useful parts of readr are the column parsers: the tools that turns character input into usefully typed data frame columns. This process is now described more fully in a new vignette: vignette("column-types").
By default, column types are guessed by looking at the data. I’ve made a number of tweaks to make it more likely that your code will load correctly the first time:

  • readr now looks at the first 1000 rows (instead of just the first 100) when guessing column types: this only takes a fraction more time, but should hopefully yield better guesses for more inputs.

  • col_date() and col_datetime() no longer recognise partial dates like 19, 1900, 1900-01. These triggered many false positives and after re-reading the ISO8601 spec, I believe they actually refer to periods of time, so should not be parsed into a specific instant.

  • col_integer() no longer recognises values started with zeros (e.g. 0001) as these are often used as identifiers.

  • col_number() will automatically recognise numbers containing the grouping mark (see below for more details).

But you can override these defaults with the col_types() argument. In this version, col_types gains some much needed flexibility:

  • New cols() function takes of assembling the list of column types, and with its .default argument, allows you to control the default column type:
    read_csv("x,y\n1,2", col_types = cols(.default = "c"))
    #> Source: local data frame [1 x 2]
    #> 
    #>       x     y
    #>   (chr) (chr)
    #> 1     1     2

    You can refer to parsers with their full name (e.g. col_character()) or their one letter abbreviation (e.g. c). The default value of .default is “?”: guess the type of column from the data.

  • cols_only() allows you to load only the specified columns:

    read_csv("a,b,c\n1,2,3", col_types = cols_only("b" = "?"))
    #> Source: local data frame [1 x 1]
    #> 
    #>       b
    #>   (int)
    #> 1     2

Many of the individual parsers have also been improved:

  • col_integer() and col_double() no longer silently ignore trailing characters after the number.

  • New col_number()/parse_number() replace the old col_numeric()/ parse_numeric(). This parser is less flexible, so it’s less likely to silently ignored bad input. It’s designed specifically to read currencies and percentages. It only reads the first number from a string, ignoring the grouping mark defined by the locale:

    parse_number("1,234,566")
    #> [1] 1234566
    parse_number("$1,234")
    #> [1] 1234
    parse_number("27%")
    #> [1] 27
  • New parse_time() and col_time() allow you to parse times. They have an optional format argument, that uses the same components as parse_datetime(). If format is omitted, they use a flexible parser that looks for hours, then an optional colon, then minutes, then an optional colon, then optional seconds, then optional am/pm.
    parse_time(c("1:45 PM", "1345", "13:45:00"))
    #> [1] 13:45:00 13:45:00 13:45:00

    parse_time() returns the number of seconds since midnight as an integer with class “time”. readr includes a basic print method.

  • parse_date()/col_date() and parse_datetime()/col_datetime() gain two new format strings: “%+” skips one or more non-digits, and %p reads in AM/PM (and am/pm).

File parsing improvements

read_csv(), read_tsv(), and read_delim() gain extra arguments that allow you to parse more files:

  • Multiple NA values can be specified by passing a character vector to na. The default has been changed to na = c("", "NA").
    read_csv("a,b\n.,NA\n1,3", na = c(".", "NA"))
    #> Source: local data frame [2 x 2]
    #> 
    #>       a     b
    #>   (int) (int)
    #> 1    NA    NA
    #> 2     1     3
  • New comment argument allows you to ignore all text after a string:
    read_csv(
    "#This is a comment
    #This is another comment
    a,b
    1,10
    2,20", comment = "#")
    #> Source: local data frame [2 x 2]
    #> 
    #>       a     b
    #>   (int) (int)
    #> 1     1    10
    #> 2     2    20
  • trim_ws argument controls whether leading and trailing whitespace is removed. It defaults to TRUE.
    read_csv("a,b\n     1,     2")
    #> Source: local data frame [1 x 2]
    #> 
    #>       a     b
    #>   (int) (int)
    #> 1     1     2
    read_csv("a,b\n     1,     2", trim_ws = FALSE)
    #> Source: local data frame [1 x 2]
    #> 
    #>        a      b
    #>    (chr)  (chr)
    #> 1      1      2

Specifying the wrong number of column names, or having rows with an unexpected number of columns, now gives a warning, rather than an error:

read_csv("a,b,c\n1,2\n1,2,3,4")
#> Warning: 2 parsing failures.
#> row col  expected    actual
#>   1  -- 3 columns 2 columns
#>   2  -- 3 columns 4 columns
#> Source: local data frame [2 x 3]
#> 
#>       a     b     c
#>   (int) (int) (int)
#> 1     1     2    NA
#> 2     1     2     3

Note that the warning message now also shows you the first five problems. I hope this will often allow you to iterate immediately, rather than having to look at the full problems().

Writers

Despite the name, readr also provides some tools for writing data frames to disk. In this version there are three output functions:

  • write_csv() and write_tsv() write tab and comma delimted files, and write_delim() writes with user specified delimiter.

  • write_rds() and read_rds() wrap around readRDS() and saveRDS(), defaulting to no compression, because you’re usually more interested in saving time (expensive) than disk space (cheap).

All these functions invisibly return their output so you can use them as part of a pipeline:

my_df %>%
  some_manipulation() %>%
  write_csv("interim-a.csv") %>%
  some_more_manipulation() %>%
  write_csv("interim-b.csv") %>%
  even_more_manipulation() %>%
  write_csv("final.csv")

You can now control how missing values are written with the na argument, and the quoting algorithm has been further refined to only add quotes when needed: when the string contains a quote, the delimiter, a new line or the same text as missing value.
Output for doubles now uses the same precision as R, and POSIXt vectors are saved in a ISO8601 compatible format.
For testing, you can use format_csv(), format_tsv(), and format_delim() to write csv to a string:

mtcars %>%
  head(4) %>%
  format_csv() %>%
  cat()
#> mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
#> 21,6,160,110,3.9,2.62,16.46,0,1,4,4
#> 21,6,160,110,3.9,2.875,17.02,0,1,4,4
#> 22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
#> 21.4,6,258,110,3.08,3.215,19.44,1,0,3,1

This is particularly useful for generating reprexes.

testthat 0.11.0 is now available on CRAN. Testthat makes it easy to turn your existing informal tests into formal automated tests that you can rerun quickly and easily. Learn more at http://r-pkgs.had.co.nz/tests.html. Install the latest version with:

install.packages("testthat")

In this version:

  • New expect_silent() ensures that code produces no output, messages, or warnings. expect_output(), expect_message(), expect_warning(), and expect_error() now accept NA as the second argument to indicate that there shouldn’t be any output, messages, warnings, or errors (i.e. they should be missing)
    f <- function() {
      print(1)
      message(2)
      warning(3)
    }
    expect_silent(f())
    #> Error: f() produced output, warnings, messages
    
    expect_warning(log(-1), NA)
    #> Error: log(-1) expected no warnings:
    #> *  NaNs produced
  • Praise gets more diverse thanks to Gabor Csardi’s praise package, and you now also get random encouragment if your tests don’t pass.
  • testthat no longer muffles warning messages. This was a bug in the previous version, as warning messages are usually important and should be dealt with explicitly, either by resolving the problem or explicitly capturing them with expect_warning().
  • Two new skip functions make it easier to skip tests that don’t work in certain environments: skip_on_os() skips tests on the specified operating system, and skip_on_appveyor() skips tests on Appveyor.

There were a number of other minor improvements and bug fixes. See the release notes for a complete list.

A big thanks goes out to all the contributors who made this release happen. There’s no way I could be as productive without the fantastic commmunity of R developers who come up with thoughtful new features, and who discover and fix my bugs!

Purrr is a new package that fills in the missing pieces in R’s functional programming tools: it’s designed to make your pure functions purrr. Like many of my recent packages, it works with magrittr to allow you to express complex operations by combining simple pieces in a standard way.

Install it with:

install.packages("purrr")

Purrr wouldn’t be possible without Lionel Henry. He wrote a lot of the package and his insightful comments helped me rapidly iterate towards a stable, useful, and understandable package.

Map functions

The core of purrr is a set of functions for manipulating vectors (atomic vectors, lists, and data frames). The goal is similar to dplyr: help you tackle the most common 90% of data manipulation challenges. But where dplyr focusses on data frames, purrr focusses on vectors. For example, the following code splits the built-in mtcars dataset up by number of cylinders (using the base split() function), fits a linear model to each piece, summarises each model, then extracts the the \(R^2\):

mtcars %>%
  split(.$cyl) %>%
  map(~lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
  map_dbl("r.squared")
#>     4     6     8 
#> 0.509 0.465 0.423

The first argument to all map functions is the vector to operate on. The second argument, .f specifies what to do with each piece. It can be:

  • A function, like summary().
  • A formula, which is converted to an anonymous function, so that ~ lm(mpg ~ wt, data = .) is shorthand for function(x) lm(mpg ~ wt, data = x).
  • A string or number, which is used to extract components, i.e. "r.squared" is shorthand for function(x) x[[r.squared]] and 1 is shorthand for function(x) x[[1]].

Map functions come in a few different variations based on their inputs and output:

  • map() takes a vector (list or atomic vector) and returns a list. map_lgl(), map_int(), map_dbl(), and map_chr() take a vector and return an atomic vector. flatmap() works similarly, but allows the function to return arbitrary length vectors.
  • map_if() only applies .f to those elements of the list where .p is true. For example, the following snippet converts factors into characters:
    iris %>% map_if(is.factor, as.character) %>% str()
    #> 'data.frame':    150 obs. of  5 variables:
    #>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
    #>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
    #>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
    #>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
    #>  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

    map_at() works similarly but instead of working with a logical vector or predicate function, it works with a integer vector of element positions.

  • map2() takes a pair of lists and iterates through them in parallel:
    map2(1:3, 2:4, c)
    #> [[1]]
    #> [1] 1 2
    #> 
    #> [[2]]
    #> [1] 2 3
    #> 
    #> [[3]]
    #> [1] 3 4
    map2(1:3, 2:4, ~ .x * (.y - 1))
    #> [[1]]
    #> [1] 1
    #> 
    #> [[2]]
    #> [1] 4
    #> 
    #> [[3]]
    #> [1] 9

    map3() does the same thing for three lists, and map_n() does it in general.

  • invoke(), invoke_lgl(), invoke_int(), invoke_dbl(), and invoke_chr() take a list of functions, and call each one with the supplied arguments:
    list(m1 = mean, m2 = median) %>%
      invoke_dbl(rcauchy(100))
    #>    m1    m2 
    #> 9.765 0.117
  • walk() takes a vector, calls a function on piece, and returns its original input. It’s useful for functions called for their side-effects; it returns the input so you can use it in a pipe.

Purrr and dplyr

I’m becoming increasingly enamoured with the list-columns in data frames. The following example combines purrr and dplyr to generate 100 random test-training splits in order to compute an unbiased estimate of prediction quality. These tools are still experimental (and currently need quite a bit of extra scaffolding), but I think the basic approach is really appealing.

library(dplyr)
random_group <- function(n, probs) {
  probs <- probs / sum(probs)
  g <- findInterval(seq(0, 1, length = n), c(0, cumsum(probs)),
    rightmost.closed = TRUE)
  names(probs)[sample(g)]
}
partition <- function(df, n, probs) {
  n %>% 
    replicate(split(df, random_group(nrow(df), probs)), FALSE) %>%
    zip_n() %>%
    as_data_frame()
}

msd <- function(x, y) sqrt(mean((x - y) ^ 2))

# Genearte 100 random test-training splits, 
cv <- mtcars %>%
  partition(100, c(training = 0.8, test = 0.2)) %>% 
  mutate(
    # Fit the model
    model = map(training, ~ lm(mpg ~ wt, data = .)),
    # Make predictions on test data
    pred = map2(model, test, predict),
    # Calculate mean squared difference
    diff = map2(pred, test %>% map("mpg"), msd) %>% flatten()
  )
cv
#> Source: local data frame [100 x 5]
#> 
#>                   test             training   model     pred  diff
#>                 (list)               (list)  (list)   (list) (dbl)
#> 1  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  3.70
#> 2  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  2.03
#> 3  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  2.29
#> 4  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  4.88
#> 5  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  3.20
#> 6  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  4.68
#> 7  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  3.39
#> 8  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  3.82
#> 9  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  2.56
#> 10 <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  3.40
#> ..                 ...                  ...     ...      ...   ...
mean(cv$diff)
#> [1] 3.22

Other functions

There are too many other pieces of purrr to describe in detail here. A few of the most useful functions are noted below:

  • zip_n() allows you to turn a list of lists “inside-out”:
    x <- list(list(a = 1, b = 2), list(a = 2, b = 1))
    x %>% str()
    #> List of 2
    #>  $ :List of 2
    #>   ..$ a: num 1
    #>   ..$ b: num 2
    #>  $ :List of 2
    #>   ..$ a: num 2
    #>   ..$ b: num 1
    
    x %>%
      zip_n() %>%
      str()
    #> List of 2
    #>  $ a:List of 2
    #>   ..$ : num 1
    #>   ..$ : num 2
    #>  $ b:List of 2
    #>   ..$ : num 2
    #>   ..$ : num 1
    
    x %>%
      zip_n(.simplify = TRUE) %>%
      str()
    #> List of 2
    #>  $ a: num [1:2] 1 2
    #>  $ b: num [1:2] 2 1
  • keep() and discard() allow you to filter a vector based on a predicate function. compact() is a helpful wrapper that throws away empty elements of a list.
    1:10 %>% keep(~. %% 2 == 0)
    #> [1]  2  4  6  8 10
    1:10 %>% discard(~. %% 2 == 0)
    #> [1] 1 3 5 7 9
    
    list(list(x = TRUE, y = 10), list(x = FALSE, y = 20)) %>%
      keep("x") %>% 
      str()
    #> List of 1
    #>  $ :List of 2
    #>   ..$ x: logi TRUE
    #>   ..$ y: num 10
    
    list(NULL, 1:3, NULL, 7) %>% 
      compact() %>%
      str()
    #> List of 2
    #>  $ : int [1:3] 1 2 3
    #>  $ : num 7
  • lift() (and friends) allow you to convert a function that takes multiple arguments into a function that takes a list. It helps you compose functions by lifting their domain from a kind of input to another kind. The domain can be changed to and from a list (l), a vector (v) and dots (d).
  • cross2(), cross3() and cross_n() allow you to create the Cartesian product of the inputs (with optional filtering).
  • A number of functions let you manipulate functions: negate(), compose(), partial().
  • A complete set of predicate functions provides predictable versions of the is.* functions: is_logical(), is_list(), is_bare_double(), is_scalar_character(), etc.
  • Other equivalents functions wrap existing base R functions into to the consistent design of purrr: replicate() -> rerun(), Reduce() -> reduce(), Find() -> detect(), Position() -> detect_index().

Design philosophy

The goal of purrr is not try and turn R into Haskell in R: it does not implement currying, or destructuring binds, or pattern matching. The goal is to give you similar expressiveness to a classical FP language, while allowing you to write code that looks and feels like R.

  • Anonymous functions are verbose in R, so we provide two convenient shorthands. For predicate functions, ~ .x + 1 is equivalent to function(.x) .x + 1. For chains of transformations functions, . %>% f() %>% g() is equivalent to function(.) . %>% f() %>% g().
  • R is weakly typed, so we can implement general zip_n(), rather than having to specialise on the number of arguments. That said, we still provide map2() and map3() since it’s useful to clearly separate which arguments are vectorised over. Functions are designed to be output type-stable (respecting Postel’s law) so you can rely on the output being as you expect.
  • R has named arguments, so instead of providing different functions for minor variations (e.g. detect() and detectLast()) we use a named arguments.
  • Instead of currying, we use ... to pass in extra arguments. Arguments of purrr functions always start with . to avoid matching to the arguments of .f passed in via ....
  • Instead of point free style, use the pipe, %>%, to write code that can be read from left to right.
Follow

Get every new post delivered to your Inbox.

Join 19,401 other followers