You are currently browsing the monthly archive for September 2015.

Purrr is a new package that fills in the missing pieces in R’s functional programming tools: it’s designed to make your pure functions purrr. Like many of my recent packages, it works with magrittr to allow you to express complex operations by combining simple pieces in a standard way.

Install it with:

install.packages("purrr")

Purrr wouldn’t be possible without Lionel Henry. He wrote a lot of the package and his insightful comments helped me rapidly iterate towards a stable, useful, and understandable package.

Map functions

The core of purrr is a set of functions for manipulating vectors (atomic vectors, lists, and data frames). The goal is similar to dplyr: help you tackle the most common 90% of data manipulation challenges. But where dplyr focusses on data frames, purrr focusses on vectors. For example, the following code splits the built-in mtcars dataset up by number of cylinders (using the base split() function), fits a linear model to each piece, summarises each model, then extracts the the \(R^2\):

mtcars %>%
  split(.$cyl) %>%
  map(~lm(mpg ~ wt, data = .)) %>%
  map(summary) %>%
  map_dbl("r.squared")
#>     4     6     8 
#> 0.509 0.465 0.423

The first argument to all map functions is the vector to operate on. The second argument, .f specifies what to do with each piece. It can be:

  • A function, like summary().
  • A formula, which is converted to an anonymous function, so that ~ lm(mpg ~ wt, data = .) is shorthand for function(x) lm(mpg ~ wt, data = x).
  • A string or number, which is used to extract components, i.e. "r.squared" is shorthand for function(x) x[[r.squared]] and 1 is shorthand for function(x) x[[1]].

Map functions come in a few different variations based on their inputs and output:

  • map() takes a vector (list or atomic vector) and returns a list. map_lgl(), map_int(), map_dbl(), and map_chr() take a vector and return an atomic vector. flatmap() works similarly, but allows the function to return arbitrary length vectors.
  • map_if() only applies .f to those elements of the list where .p is true. For example, the following snippet converts factors into characters:
    iris %>% map_if(is.factor, as.character) %>% str()
    #> 'data.frame':    150 obs. of  5 variables:
    #>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
    #>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
    #>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
    #>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
    #>  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

    map_at() works similarly but instead of working with a logical vector or predicate function, it works with a integer vector of element positions.

  • map2() takes a pair of lists and iterates through them in parallel:
    map2(1:3, 2:4, c)
    #> [[1]]
    #> [1] 1 2
    #> 
    #> [[2]]
    #> [1] 2 3
    #> 
    #> [[3]]
    #> [1] 3 4
    map2(1:3, 2:4, ~ .x * (.y - 1))
    #> [[1]]
    #> [1] 1
    #> 
    #> [[2]]
    #> [1] 4
    #> 
    #> [[3]]
    #> [1] 9

    map3() does the same thing for three lists, and map_n() does it in general.

  • invoke(), invoke_lgl(), invoke_int(), invoke_dbl(), and invoke_chr() take a list of functions, and call each one with the supplied arguments:
    list(m1 = mean, m2 = median) %>%
      invoke_dbl(rcauchy(100))
    #>    m1    m2 
    #> 9.765 0.117
  • walk() takes a vector, calls a function on piece, and returns its original input. It’s useful for functions called for their side-effects; it returns the input so you can use it in a pipe.

Purrr and dplyr

I’m becoming increasingly enamoured with the list-columns in data frames. The following example combines purrr and dplyr to generate 100 random test-training splits in order to compute an unbiased estimate of prediction quality. These tools are still experimental (and currently need quite a bit of extra scaffolding), but I think the basic approach is really appealing.

library(dplyr)
random_group <- function(n, probs) {
  probs <- probs / sum(probs)
  g <- findInterval(seq(0, 1, length = n), c(0, cumsum(probs)),
    rightmost.closed = TRUE)
  names(probs)[sample(g)]
}
partition <- function(df, n, probs) {
  n %>% 
    replicate(split(df, random_group(nrow(df), probs)), FALSE) %>%
    zip_n() %>%
    as_data_frame()
}

msd <- function(x, y) sqrt(mean((x - y) ^ 2))

# Genearte 100 random test-training splits, 
cv <- mtcars %>%
  partition(100, c(training = 0.8, test = 0.2)) %>% 
  mutate(
    # Fit the model
    model = map(training, ~ lm(mpg ~ wt, data = .)),
    # Make predictions on test data
    pred = map2(model, test, predict),
    # Calculate mean squared difference
    diff = map2(pred, test %>% map("mpg"), msd) %>% flatten()
  )
cv
#> Source: local data frame [100 x 5]
#> 
#>                   test             training   model     pred  diff
#>                 (list)               (list)  (list)   (list) (dbl)
#> 1  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  3.70
#> 2  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  2.03
#> 3  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  2.29
#> 4  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  4.88
#> 5  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  3.20
#> 6  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  4.68
#> 7  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  3.39
#> 8  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  3.82
#> 9  <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  2.56
#> 10 <data.frame [7,11]> <data.frame [25,11]> <S3:lm> <dbl[7]>  3.40
#> ..                 ...                  ...     ...      ...   ...
mean(cv$diff)
#> [1] 3.22

Other functions

There are too many other pieces of purrr to describe in detail here. A few of the most useful functions are noted below:

  • zip_n() allows you to turn a list of lists “inside-out”:
    x <- list(list(a = 1, b = 2), list(a = 2, b = 1))
    x %>% str()
    #> List of 2
    #>  $ :List of 2
    #>   ..$ a: num 1
    #>   ..$ b: num 2
    #>  $ :List of 2
    #>   ..$ a: num 2
    #>   ..$ b: num 1
    
    x %>%
      zip_n() %>%
      str()
    #> List of 2
    #>  $ a:List of 2
    #>   ..$ : num 1
    #>   ..$ : num 2
    #>  $ b:List of 2
    #>   ..$ : num 2
    #>   ..$ : num 1
    
    x %>%
      zip_n(.simplify = TRUE) %>%
      str()
    #> List of 2
    #>  $ a: num [1:2] 1 2
    #>  $ b: num [1:2] 2 1
  • keep() and discard() allow you to filter a vector based on a predicate function. compact() is a helpful wrapper that throws away empty elements of a list.
    1:10 %>% keep(~. %% 2 == 0)
    #> [1]  2  4  6  8 10
    1:10 %>% discard(~. %% 2 == 0)
    #> [1] 1 3 5 7 9
    
    list(list(x = TRUE, y = 10), list(x = FALSE, y = 20)) %>%
      keep("x") %>% 
      str()
    #> List of 1
    #>  $ :List of 2
    #>   ..$ x: logi TRUE
    #>   ..$ y: num 10
    
    list(NULL, 1:3, NULL, 7) %>% 
      compact() %>%
      str()
    #> List of 2
    #>  $ : int [1:3] 1 2 3
    #>  $ : num 7
  • lift() (and friends) allow you to convert a function that takes multiple arguments into a function that takes a list. It helps you compose functions by lifting their domain from a kind of input to another kind. The domain can be changed to and from a list (l), a vector (v) and dots (d).
  • cross2(), cross3() and cross_n() allow you to create the Cartesian product of the inputs (with optional filtering).
  • A number of functions let you manipulate functions: negate(), compose(), partial().
  • A complete set of predicate functions provides predictable versions of the is.* functions: is_logical(), is_list(), is_bare_double(), is_scalar_character(), etc.
  • Other equivalents functions wrap existing base R functions into to the consistent design of purrr: replicate() -> rerun(), Reduce() -> reduce(), Find() -> detect(), Position() -> detect_index().

Design philosophy

The goal of purrr is not try and turn R into Haskell in R: it does not implement currying, or destructuring binds, or pattern matching. The goal is to give you similar expressiveness to a classical FP language, while allowing you to write code that looks and feels like R.

  • Anonymous functions are verbose in R, so we provide two convenient shorthands. For predicate functions, ~ .x + 1 is equivalent to function(.x) .x + 1. For chains of transformations functions, . %>% f() %>% g() is equivalent to function(.) . %>% f() %>% g().
  • R is weakly typed, so we can implement general zip_n(), rather than having to specialise on the number of arguments. That said, we still provide map2() and map3() since it’s useful to clearly separate which arguments are vectorised over. Functions are designed to be output type-stable (respecting Postel’s law) so you can rely on the output being as you expect.
  • R has named arguments, so instead of providing different functions for minor variations (e.g. detect() and detectLast()) we use a named arguments.
  • Instead of currying, we use ... to pass in extra arguments. Arguments of purrr functions always start with . to avoid matching to the arguments of .f passed in via ....
  • Instead of point free style, use the pipe, %>%, to write code that can be read from left to right.

I’m pleased to announce rvest 0.3.0 is now available on CRAN. Rvest makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with pipes so that you can express complex operations by composed simple pieces. Install it with:

install.packages("rvest")

What’s new

The biggest change in this version is that rvest now uses the xml2 package instead of XML. This makes rvest much simpler, eliminates memory leaks, and should improve performance a little.

A number of functions have changed names to improve consistency with other packages: most importantly html() is now read_html(), and html_tag() is now html_name(). The old versions still work, but are deprecated and will be removed in rvest 0.4.0.

html_node() now throws an error if there are no matches, and a warning if there’s more than one match. I think this should make it more likely to fail clearly when the structure of the page changes. If you don’t want this behaviour, use html_nodes().

There were a number of other bug fixes and minor improvements as described in the release notes.

RStudio will again teach the new essentials for doing (big) data science in R at this year’s Strata NYC conference, September 29 2015 (http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/44154).  You will learn from Garrett Grolemund, Yihui Xie, and Nathan Stephens who are all working on fascinating new ways to keep the R ecosystem apace of the challenges facing those who work with data.

Topics include:

  • R Quickstart: Wrangle, transform, and visualize data
    Instructor: Garrett Grolemund (90 minutes)
  • Work with Big Data in R
    Instructor: Nathan Stephens (90 minutes)
  • Reproducible Reports with Big Data
    Instructor: Yihui Xie (90 minutes)
  • Interactive Shiny Applications built on Big Data
    Instructor: Garrett Grolemund (90 minutes)

If you plan to stay for the full Strata Conference+Hadoop World be sure to look us up at booth 633 during the Expo Hall hours. We’ll have the latest books from RStudio authors and “shiny” t-shirts to win. Share with us what you’re doing with RStudio and get your product and company questions answered by RStudio employees.

See you in New York City! (http://strataconf.com/big-data-conference-ny-2015)

Devtools 1.9.1 is now available on CRAN. Devtools makes package building so easy a package can become your default way to organise code, data, and documentation. You can learn more about developing packages in R packages, my book about package development that’s freely available online..

Get the latest version of devtools with:

install.packages("devtools")

There are three major improvements that I contributed:

  • check() is now much closer to what CRAN does – it passes on --as-cran to R CMD check, using an env var to turn off the incoming CRAN checks. These are turned off because they’re slow (they have to retrieve data from CRAN), and are not necessary except just prior to release (so release() turns them back on again).
  • install_deps() now automatically upgrades out of date dependencies. This is typically what you want when you’re working on a development version of a package: otherwise you can get an unpleasant surprise when you go to submit your package to CRAN and discover it doesn’t work with the latest version of its dependencies. To suppress this behaviour, set upgrade_dependencies = FALSE.
  • revdep_check() received a number of tweaks that I’ve found helpful when preparing my packages for CRAN:
    • Suggested dependencies of the revdeps are installed by default.
    • The NOT_CRAN env var is set to false so tests that are skipped on CRAN are also skipped for you.
    • The RGL_USE_NULL env var is set to true to stop rgl windows from popping up during testing.
    • All revdep sources are downloaded at the start of the checks. This makes life a bit easier if you’re on a flaky internet connection.

But like many recent devtools releases, most of the coolest new features have been contributed by the community:

  • Jim Hester implemented experimental remote depedencies for install(). You can now tell devtools where to find dependencies with a remotes field:
    Imports:
      MASS,
      testthat
    Remotes:
      hadley/testthat

    The default allows you to refer to github repos, but you can easily add deps from any of the other sources that devtools supports: see vignette("dependencies") for more details.

    Support for installing development dependencies is still experimental so we appreciate any feedback.

  • Jenny Bryan considerably improved the existing GitHub integration. use_github() now pushes to the newly created GitHub repo, and sets a remote tracking branch. It also populates the URL and BugReports fields of your DESCRIPTION.
  • Kirill Müller contributed many bug fixes, minor improvements and test cases.

See the release notes for complete bug fixes and other minor changes.

tidyr 0.3.0 is now available on CRAN. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has variables in columns and observations in rows, and is described in more detail in the tidy data vignette. Install tidyr with:

install.packages("tidyr")

tidyr contains four new verbs: fill(), replace() and complete(), and unnest(), and lots of smaller bug fixes and improvements.

fill()

The new fill function fills in missing observations from the last non-missing value. This is useful if you’re getting data from Excel users who haven’t read Karl Broman’s excellent data organisation guide and leave cells blank to indicate that the previous value should be carried forward:

df <- dplyr::data_frame(
  year = c(2015, NA, NA, NA), 
  trt = c("A", NA, "B", NA)
)
df
#> Source: local data frame [4 x 2]
#> 
#>    year   trt
#>   (dbl) (chr)
#> 1  2015     A
#> 2    NA    NA
#> 3    NA     B
#> 4    NA    NA
df %>% fill(year, trt)
#> Source: local data frame [4 x 2]
#> 
#>    year   trt
#>   (dbl) (chr)
#> 1  2015     A
#> 2  2015     A
#> 3  2015     B
#> 4  2015     B

replace_na() and complete()

replace_na() makes it easy to replace missing values on a column-by-column basis:

df <- dplyr::data_frame(
  x = c(1, 2, NA), 
  y = c("a", NA, "b")
)
df %>% replace_na(list(x = 0, y = "unknown"))
#> Source: local data frame [3 x 2]
#> 
#>       x       y
#>   (dbl)   (chr)
#> 1     1       a
#> 2     2 unknown
#> 3     0       b

It is particularly useful when called from complete(), which makes it easy to fill in missing combinations of your data:

df <- dplyr::data_frame(
  group = c(1:2, 1),
  item_id = c(1:2, 2),
  item_name = c("a", "b", "b"),
  value1 = 1:3,
  value2 = 4:6
)
df
#> Source: local data frame [3 x 5]
#> 
#>   group item_id item_name value1 value2
#>   (dbl)   (dbl)     (chr)  (int)  (int)
#> 1     1       1         a      1      4
#> 2     2       2         b      2      5
#> 3     1       2         b      3      6

df %>% complete(group, c(item_id, item_name))
#> Source: local data frame [4 x 5]
#> 
#>   group item_id item_name value1 value2
#>   (dbl)   (dbl)     (chr)  (int)  (int)
#> 1     1       1         a      1      4
#> 2     1       2         b      3      6
#> 3     2       1         a     NA     NA
#> 4     2       2         b      2      5

df %>% complete(
  group, c(item_id, item_name), 
  fill = list(value1 = 0)
)
#> Source: local data frame [4 x 5]
#> 
#>   group item_id item_name value1 value2
#>   (dbl)   (dbl)     (chr)  (dbl)  (int)
#> 1     1       1         a      1      4
#> 2     1       2         b      3      6
#> 3     2       1         a      0     NA
#> 4     2       2         b      2      5

Note how I’ve grouped item_id and item_name together with c(item_id, item_name). This treats them as nested, not crossed, so we don’t get every combination of group, item_id and item_name, as we would otherwise:

df %>% complete(group, item_id, item_name)
#> Source: local data frame [8 x 5]
#> 
#>    group item_id item_name value1 value2
#>    (dbl)   (dbl)     (chr)  (int)  (int)
#> 1      1       1         a      1      4
#> 2      1       1         b     NA     NA
#> 3      1       2         a     NA     NA
#> 4      1       2         b      3      6
#> 5      2       1         a     NA     NA
#> ..   ...     ...       ...    ...    ...

Read more about this behaviour in ?expand.

unnest()

unnest() is out of beta, and is now ready to help you unnest columns that are lists of vectors. This can occur when you have hierarchical data that’s been collapsed into a string:

df <- dplyr::data_frame(x = 1:2, y = c("1,2", "3,4,5,6,7"))
df
#> Source: local data frame [2 x 2]
#> 
#>       x         y
#>   (int)     (chr)
#> 1     1       1,2
#> 2     2 3,4,5,6,7

df %>% 
  dplyr::mutate(y = strsplit(y, ","))
#> Source: local data frame [2 x 2]
#> 
#>       x        y
#>   (int)   (list)
#> 1     1 <chr[2]>
#> 2     2 <chr[5]>

df %>% 
  dplyr::mutate(y = strsplit(y, ",")) %>%
  unnest()
#> Source: local data frame [7 x 2]
#> 
#>        x     y
#>    (int) (chr)
#> 1      1     1
#> 2      1     2
#> 3      2     3
#> 4      2     4
#> 5      2     5
#> ..   ...   ...

unnest() also works on columns that are lists of data frames. This is admittedly esoteric, but I think it might be useful when you’re generating pairs of test-training splits. I’m still thinking about this idea, so look for more examples and better support across my packages in the future.

Minor improvements

There were 13 minor improvements and bug fixes. The most important are listed below. To read about the rest, please consult the release notes.

  • %>% is re-exported from magrittr: this means that you no longer need to load dplyr or magrittr if you want to use the pipe.
  • extract() and separate() now return multiple NA columns for NA inputs:
    df <- dplyr::data_frame(x = c("a-b", NA, "c-d"))
    df %>% separate(x, c("x", "y"), "-")
    #> Source: local data frame [3 x 2]
    #> 
    #>       x     y
    #>   (chr) (chr)
    #> 1     a     b
    #> 2    NA    NA
    #> 3     c     d
  • separate() gains finer control if there are too few matches:
    df <- dplyr::data_frame(x = c("a-b-c", "a-c"))
    df %>% separate(x, c("x", "y", "z"), "-")
    #> Warning: Too few values at 1 locations: 2
    #> Source: local data frame [2 x 3]
    #> 
    #>       x     y     z
    #>   (chr) (chr) (chr)
    #> 1     a     b     c
    #> 2     a     c    NA
    df %>% separate(x, c("x", "y", "z"), "-", fill = "right")
    #> Source: local data frame [2 x 3]
    #> 
    #>       x     y     z
    #>   (chr) (chr) (chr)
    #> 1     a     b     c
    #> 2     a     c    NA
    df %>% separate(x, c("x", "y", "z"), "-", fill = "left")
    #> Source: local data frame [2 x 3]
    #> 
    #>       x     y     z
    #>   (chr) (chr) (chr)
    #> 1     a     b     c
    #> 2    NA     c     a

    This complements the support for too many matches:

    df <- dplyr::data_frame(x = c("a-b-c", "a-c"))
    df %>% separate(x, c("x", "y"), "-")
    #> Warning: Too many values at 1 locations: 1
    #> Source: local data frame [2 x 2]
    #> 
    #>       x     y
    #>   (chr) (chr)
    #> 1     a     b
    #> 2     a     c
    df %>% separate(x, c("x", "y"), "-", extra = "merge")
    #> Source: local data frame [2 x 2]
    #> 
    #>       x     y
    #>   (chr) (chr)
    #> 1     a   b-c
    #> 2     a     c
    df %>% separate(x, c("x", "y"), "-", extra = "drop")
    #> Source: local data frame [2 x 2]
    #> 
    #>       x     y
    #>   (chr) (chr)
    #> 1     a     b
    #> 2     a     c
  • tidyr no longer depends on reshape2. This should fix issues when you load reshape and tidyr at the same time. It also frees tidyr to evolve in a different direction to the more general reshape2.

dplyr 0.4.3 includes over 30 minor improvements and bug fixes, which are described in detail in the release notes. Here I wanted to draw your attention five small, but important, changes:

  • mutate() no longer randomly crashes! (Sorry it took us so long to fix this – I know it’s been causing a lot of pain.)
  • dplyr now has much better support for non-ASCII column names. It’s probably not perfect, but should be a lot better than previous versions.
  • When printing a tbl_df, you now see the types of all columns, not just those that don’t fit on the screen:
    data_frame(x = 1:3, y = letters[x], z = factor(y))
    #> Source: local data frame [3 x 3]
    #> 
    #>       x     y      z
    #>   (int) (chr) (fctr)
    #> 1     1     a      a
    #> 2     2     b      b
    #> 3     3     c      c
  • bind_rows() gains a .id argument. When supplied, it creates a new column that gives the name of each data frame:
    a <- data_frame(x = 1, y = "a")
    b <- data_frame(x = 2, y = "c")
    
    bind_rows(a = a, b = b)
    #> Source: local data frame [2 x 2]
    #> 
    #>       x     y
    #>   (dbl) (chr)
    #> 1     1     a
    #> 2     2     c
    bind_rows(a = a, b = b, .id = "source")
    #> Source: local data frame [2 x 3]
    #> 
    #>   source     x     y
    #>    (chr) (dbl) (chr)
    #> 1      a     1     a
    #> 2      b     2     c
    # Or equivalently
    bind_rows(list(a = a, b = b), .id = "source")
    #> Source: local data frame [2 x 3]
    #> 
    #>   source     x     y
    #>    (chr) (dbl) (chr)
    #> 1      a     1     a
    #> 2      b     2     c
  • dplyr is now more forgiving of unknown attributes. All functions should now copy column attributes from the input to the output, instead of complaining. Additionally arrange(), filter(), slice(), and summarise() preserve attributes of the data frame itself.