On May 19 and 20, 2016, Hadley Wickham will teach his two day Master R Developer Workshop in the centrally located European city of Amsterdam.

Are you ready to upgrade your R skills?  Register soon to secure your seat.

For the convenience of those who may travel to the workshop, it will be held at the Hotel NH Amsterdam Schiphol Airport.

Hadley teaches a few workshops each year and this is the only one planned for Europe. They are very popular and hotel rooms are limited. Please register soon.

We look forward to seeing you in the month of May!

We are pleased to announce version 1.0.0 of the memoise package is now available on CRAN. Memoization stores the value of function call and returns the cached result when the function is called again with the same arguments.

The following function computes Fibonacci numbers and illustrates the usefulness of memoization. Because the function definition is recursive, the intermediate results can be looked up rather than recalculated at each level of recursion, which reduces the runtime drastically. The last time the memoised function is called the final result can simply be returned, so no measurable execution time is recorded.

fib <- function(n) {
  if (n < 2) {
    return(n)
  } else {
    return(fib(n-1) + fib(n-2))
  }
}
system.time(x <- fib(30))
#>    user  system elapsed 
#>   4.454   0.010   4.472
fib <- memoise(fib)
system.time(y <- fib(30))
#>    user  system elapsed 
#>   0.004   0.000   0.004
system.time(z <- fib(30))
#>    user  system elapsed 
#>       0       0       0
all.equal(x, y)
#> [1] TRUE
all.equal(x, z)
#> [1] TRUE

Memoization is also very useful for storing queries to external resources, such as network APIs and databases.

Improvements in this release make memoised functions much nicer to use interactively. Memoised functions now have a print method which outputs the original function definition rather than the memoization code.

mem_sum <- memoise(sum)
mem_sum
#> Memoised Function:
#> function (..., na.rm = FALSE)  .Primitive("sum")

Memoised functions now forward their arguments from the original function rather than simply passing them with .... This allows autocompletion to work transparently for memoised functions and also fixes a bug related to non-constant default arguments. [1]

mem_scan <- memoise(scan)
args(mem_scan)
#> function (file = "", what = double(), nmax = -1L, n = -1L, sep = "", 
#>     quote = if (identical(sep, "\n")) "" else "'\"", dec = ".", 
#>     skip = 0L, nlines = 0L, na.strings = "NA", flush = FALSE, 
#>     fill = FALSE, strip.white = FALSE, quiet = FALSE, blank.lines.skip = TRUE, 
#>     multi.line = TRUE, comment.char = "", allowEscapes = FALSE, 
#>     fileEncoding = "", encoding = "unknown", text, skipNul = FALSE) 
#> NULL

Memoisation can now depend on external variables aside from the function arguments. This feature can be used in a variety of ways, such as invalidating the memoisation when a new package is attached.

mem_f <- memoise(runif, ~search())
mem_f(2)
#> [1] 0.009113091 0.988083122
mem_f(2)
#> [1] 0.009113091 0.988083122
library(ggplot2)
mem_f(2)
#> [1] 0.89150566 0.01128355

Or invalidating the memoisation after a given amount of time has elapsed. A timeout() helper function is provided to make this feature easier to use.

mem_f <- memoise(runif, ~timeout(10))
mem_f(2)
#> [1] 0.6935329 0.3584699
mem_f(2)
#> [1] 0.6935329 0.3584699
Sys.sleep(10)
mem_f(2)
#> [1] 0.2008418 0.4538413

A great amount of thanks for this release goes to Kirill Müller, who wrote the argument forwarding implementation and added comprehensive tests to the package. [2, 3]

See the release notes for a complete list of changes.

I’m pleased to announce tidyr 0.4.0. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has a simple convention: put variables in the columns and observations in the rows. You can learn more about it in the tidy data vignette. Install it with:

install.packages("tidyr")

There are two big features in this release: support for nested data frames, and improved tools for turning implicit missing values into explicit missing values. These are described in detail below. As well as these big features, all tidyr verbs now handle grouped_df objects created by dplyr, gather() makes a character key column (instead of a factor), and there are lots of other minor fixes and improvements. Please see the release notes for a complete list of changes.

Nested data frames

nest() and unnest() have been overhauled to support a new way of structuring your data: the nested data frame. In a grouped data frame, you have one row per observation, and additional metadata define the groups. In a nested data frame, you have one row per group, and the individual observations are stored in a column that is a list of data frames. This is a useful structure when you have lists of other objects (like models) with one element per group.

For example, take the gapminder dataset:

library(gapminder)
library(dplyr)

gapminder
#> Source: local data frame [1,704 x 6]
#> 
#>        country continent  year lifeExp      pop gdpPercap
#>         (fctr)    (fctr) (int)   (dbl)    (int)     (dbl)
#> 1  Afghanistan      Asia  1952    28.8  8425333       779
#> 2  Afghanistan      Asia  1957    30.3  9240934       821
#> 3  Afghanistan      Asia  1962    32.0 10267083       853
#> 4  Afghanistan      Asia  1967    34.0 11537966       836
#> 5  Afghanistan      Asia  1972    36.1 13079460       740
#> 6  Afghanistan      Asia  1977    38.4 14880372       786
#> 7  Afghanistan      Asia  1982    39.9 12881816       978
#> 8  Afghanistan      Asia  1987    40.8 13867957       852
#> ..         ...       ...   ...     ...      ...       ...

We can plot the trend in life expetancy for each country:

library(ggplot2)

ggplot(gapminder, aes(year, lifeExp)) +
  geom_line(aes(group = country))

unnamed-chunk-4-1

But it’s hard to see what’s going on because of all the overplotting. One interesting solution is to summarise each country with a linear model. To do that most naturally, you want one data frame for each country. nest() creates this structure:

by_country <- gapminder %>% 
  group_by(continent, country) %>% 
  nest()

by_country
#> Source: local data frame [142 x 3]
#> 
#>    continent     country            data
#>       (fctr)      (fctr)          (list)
#> 1       Asia Afghanistan <tbl_df [12,4]>
#> 2     Europe     Albania <tbl_df [12,4]>
#> 3     Africa     Algeria <tbl_df [12,4]>
#> 4     Africa      Angola <tbl_df [12,4]>
#> 5   Americas   Argentina <tbl_df [12,4]>
#> 6    Oceania   Australia <tbl_df [12,4]>
#> 7     Europe     Austria <tbl_df [12,4]>
#> 8       Asia     Bahrain <tbl_df [12,4]>
#> ..       ...         ...             ...

The intriguing thing about this data frame is that it now contains one row per group, and to store the original data we have a new data column, a list of data frames. If we look at the first one, we can see that it contains the complete data for Afghanistan (sans grouping columns):

by_country$data[[1]]
#> Source: local data frame [12 x 4]
#> 
#>     year lifeExp      pop gdpPercap
#>    (int)   (dbl)    (int)     (dbl)
#> 1   1952    43.1  9279525      2449
#> 2   1957    45.7 10270856      3014
#> 3   1962    48.3 11000948      2551
#> 4   1967    51.4 12760499      3247
#> 5   1972    54.5 14760787      4183
#> 6   1977    58.0 17152804      4910
#> 7   1982    61.4 20033753      5745
#> 8   1987    65.8 23254956      5681
#> ..   ...     ...      ...       ...

This form is natural because there are other vectors where you’ll have one value per country. For example, we could fit a linear model to each country with purrr:

by_country <- by_country %>% 
  mutate(model = purrr::map(data, ~ lm(lifeExp ~ year, data = .))
)
by_country
#> Source: local data frame [142 x 4]
#> 
#>    continent     country            data   model
#>       (fctr)      (fctr)          (list)  (list)
#> 1       Asia Afghanistan <tbl_df [12,4]> <S3:lm>
#> 2     Europe     Albania <tbl_df [12,4]> <S3:lm>
#> 3     Africa     Algeria <tbl_df [12,4]> <S3:lm>
#> 4     Africa      Angola <tbl_df [12,4]> <S3:lm>
#> 5   Americas   Argentina <tbl_df [12,4]> <S3:lm>
#> 6    Oceania   Australia <tbl_df [12,4]> <S3:lm>
#> 7     Europe     Austria <tbl_df [12,4]> <S3:lm>
#> 8       Asia     Bahrain <tbl_df [12,4]> <S3:lm>
#> ..       ...         ...             ...     ...

Because we used mutate(), we get an extra column containing one linear model per country.

It might seem unnatural to store a list of linear models in a data frame. However, I think it is actually a really convenient and powerful strategy because it allows you to keep related vectors together. If you filter or arrange the vector of models, there’s no way for the other components to get out of sync.

nest() got us into this form; unnest() gets us out. You give it the list-columns that you want to unnested, and tidyr will automatically repeat the grouping columns. Unnesting data gets us back to the original form:

by_country %>% unnest(data)
#> Source: local data frame [1,704 x 6]
#> 
#>    continent     country  year lifeExp      pop gdpPercap
#>       (fctr)      (fctr) (int)   (dbl)    (int)     (dbl)
#> 1       Asia Afghanistan  1952    43.1  9279525      2449
#> 2       Asia Afghanistan  1957    45.7 10270856      3014
#> 3       Asia Afghanistan  1962    48.3 11000948      2551
#> 4       Asia Afghanistan  1967    51.4 12760499      3247
#> 5       Asia Afghanistan  1972    54.5 14760787      4183
#> 6       Asia Afghanistan  1977    58.0 17152804      4910
#> 7       Asia Afghanistan  1982    61.4 20033753      5745
#> 8       Asia Afghanistan  1987    65.8 23254956      5681
#> ..       ...         ...   ...     ...      ...       ...

When working with models, unnesting is particularly useful when you combine it with broom to extract model summaries:

# Extract model summaries:
by_country %>% unnest(model %>% purrr::map(broom::glance))
#> Source: local data frame [142 x 15]
#> 
#>    continent     country            data   model r.squared
#>       (fctr)      (fctr)          (list)  (list)     (dbl)
#> 1       Asia Afghanistan <tbl_df [12,4]> <S3:lm>     0.985
#> 2     Europe     Albania <tbl_df [12,4]> <S3:lm>     0.888
#> 3     Africa     Algeria <tbl_df [12,4]> <S3:lm>     0.967
#> 4     Africa      Angola <tbl_df [12,4]> <S3:lm>     0.034
#> 5   Americas   Argentina <tbl_df [12,4]> <S3:lm>     0.919
#> 6    Oceania   Australia <tbl_df [12,4]> <S3:lm>     0.766
#> 7     Europe     Austria <tbl_df [12,4]> <S3:lm>     0.680
#> 8       Asia     Bahrain <tbl_df [12,4]> <S3:lm>     0.493
#> ..       ...         ...             ...     ...       ...
#> Variables not shown: adj.r.squared (dbl), sigma (dbl),
#>   statistic (dbl), p.value (dbl), df (int), logLik (dbl),
#>   AIC (dbl), BIC (dbl), deviance (dbl), df.residual (int).

# Extract coefficients:
by_country %>% unnest(model %>% purrr::map(broom::tidy))
#> Source: local data frame [284 x 7]
#> 
#>    continent     country        term  estimate std.error
#>       (fctr)      (fctr)       (chr)     (dbl)     (dbl)
#> 1       Asia Afghanistan (Intercept) -1.07e+03   43.8022
#> 2       Asia Afghanistan        year  5.69e-01    0.0221
#> 3     Europe     Albania (Intercept) -3.77e+02   46.5834
#> 4     Europe     Albania        year  2.09e-01    0.0235
#> 5     Africa     Algeria (Intercept) -6.13e+02   38.8918
#> 6     Africa     Algeria        year  3.34e-01    0.0196
#> 7     Africa      Angola (Intercept) -6.55e+01  202.3625
#> 8     Africa      Angola        year  6.07e-02    0.1022
#> ..       ...         ...         ...       ...       ...
#> Variables not shown: statistic (dbl), p.value (dbl).

# Extract residuals etc:
by_country %>% unnest(model %>% purrr::map(broom::augment))
#> Source: local data frame [1,704 x 11]
#> 
#>    continent     country lifeExp  year .fitted .se.fit
#>       (fctr)      (fctr)   (dbl) (int)   (dbl)   (dbl)
#> 1       Asia Afghanistan    43.1  1952    43.4   0.718
#> 2       Asia Afghanistan    45.7  1957    46.2   0.627
#> 3       Asia Afghanistan    48.3  1962    49.1   0.544
#> 4       Asia Afghanistan    51.4  1967    51.9   0.472
#> 5       Asia Afghanistan    54.5  1972    54.8   0.416
#> 6       Asia Afghanistan    58.0  1977    57.6   0.386
#> 7       Asia Afghanistan    61.4  1982    60.5   0.386
#> 8       Asia Afghanistan    65.8  1987    63.3   0.416
#> ..       ...         ...     ...   ...     ...     ...
#> Variables not shown: .resid (dbl), .hat (dbl), .sigma
#>   (dbl), .cooksd (dbl), .std.resid (dbl).

I think storing multiple models in a data frame is a powerful and convenient technique, and I plan to write more about it in the future.

Expanding

The complete() function allows you to turn implicit missing values into explicit missing values. For example, imagine you’ve collected some data every year basis, but unfortunately some of your data has gone missing:

resources <- frame_data(
  ~year, ~metric, ~value,
  1999, "coal", 100,
  2001, "coal", 50,
  2001, "steel", 200
)
resources
#> Source: local data frame [3 x 3]
#> 
#>    year metric value
#>   (dbl)  (chr) (dbl)
#> 1  1999   coal   100
#> 2  2001   coal    50
#> 3  2001  steel   200

Here the value for steel in 1999 is implicitly missing: it’s simply absent from the data frame. We can use complete() to make this missing row explicit, adding that combination of the variables and inserting a placeholder NA:

resources %>% complete(year, metric)
#> Source: local data frame [4 x 3]
#> 
#>    year metric value
#>   (dbl)  (chr) (dbl)
#> 1  1999   coal   100
#> 2  1999  steel    NA
#> 3  2001   coal    50
#> 4  2001  steel   200

With complete you’re not limited to just combinations that exist in the data. For example, here we know that there should be data for every year, so we can use the fullseq() function to generate every year over the range of the data:

resources %>% complete(year = full_seq(year, 1L), metric)
#> Source: local data frame [6 x 3]
#> 
#>    year metric value
#>   (dbl)  (chr) (dbl)
#> 1  1999   coal   100
#> 2  1999  steel    NA
#> 3  2000   coal    NA
#> 4  2000  steel    NA
#> 5  2001   coal    50
#> 6  2001  steel   200

In other scenarios, you may not want to generate the full set of combinations. For example, imagine you have an experiment where each person is assigned one treatment. You don’t want to expand the combinations of person and treatment, but you do want to make sure every person has all replicates. You can use nesting() to prevent the full Cartesian product from being generated:

experiment <- data_frame(
  person = rep(c("Alex", "Robert", "Sam"), c(3, 2, 1)),
  trt  = rep(c("a", "b", "a"), c(3, 2, 1)),
  rep = c(1, 2, 3, 1, 2, 1),
  measurment_1 = runif(6),
  measurment_2 = runif(6)
)
experiment
#> Source: local data frame [6 x 5]
#> 
#>   person   trt   rep measurment_1 measurment_2
#>    (chr) (chr) (dbl)        (dbl)        (dbl)
#> 1   Alex     a     1       0.7161        0.927
#> 2   Alex     a     2       0.3231        0.942
#> 3   Alex     a     3       0.4548        0.668
#> 4 Robert     b     1       0.0356        0.667
#> 5 Robert     b     2       0.5081        0.143
#> 6    Sam     a     1       0.6917        0.753

experiment %>% complete(nesting(person, trt), rep)
#> Source: local data frame [9 x 5]
#> 
#>    person   trt   rep measurment_1 measurment_2
#>     (chr) (chr) (dbl)        (dbl)        (dbl)
#> 1    Alex     a     1       0.7161        0.927
#> 2    Alex     a     2       0.3231        0.942
#> 3    Alex     a     3       0.4548        0.668
#> 4  Robert     b     1       0.0356        0.667
#> 5  Robert     b     2       0.5081        0.143
#> 6  Robert     b     3           NA           NA
#> 7     Sam     a     1       0.6917        0.753
#> 8     Sam     a     2           NA           NA
#> ..    ...   ...   ...          ...          ...

httr 1.1.0 is now available on CRAN. The httr packages makes it easy to talk to web APIs from R. Learn more in the quick start vignette.

Install the latest version with:

install.packages("httr")

When writing this blog post I discovered that I forgot to annouce httr 1.0.0. This was a major release marking the transition from the RCurl package to the curl package, a modern binding to libcurl written by Jeroen Ooms. This makes httr more reliable, less likely to leak memory, and prevents the diabolical “easy handle already used in multi handle” error.

httr 1.1.0 includes a couple of new features:

  • stop_for_status(), warn_for_status() and (new) message_for_status() replace the old message argument with a new task argument that optionally describes the current task. This allows API wrappers to provide more informative error messages on failure.

  • http_error() replaces url_ok() and url_successful(). http_error() more clearly conveys intent and works with urls, responses and status codes.

Otherwise, OAuth support continues to improve thanks to support from the community:

  • Nathan Goulding added RSA-SHA1 signature support to oauth1.0_token(). He also fixed bugs in oauth_service_token() and improved the caching behaviour of refresh_oauth2.0(). This makes httr easier to use with Google’s service accounts.

  • Graham Parsons added support for HTTP basic authentication to oauth2.0_token() with the use_basic_auth. This is now the default method used when retrieving a token.

  • Daniel Lockau implemented user_params which allows you to pass arbitrary additional parameters to the token access endpoint when acquiring or refreshing a token. This allows you to use httr with Microsoft Azure. He also wrote a demo so you can see exactly how this works.

To see the full list of changes, please read the release notes for 1.0.0 and 1.1.0.

Devtools 1.10.0 is now available on CRAN. Devtools makes package building so easy that a package can become your default way to organise code, data, documentation, and tests. You can learn more about creating your own package in R packages. Install devtools with:

install.packages("devtools")

This version is mostly a collection of bug fixes and minor improvements. For example:

  • Devtools employs a new strategy for detecting RTools on windows: we now only check for Rtools if you need to load_all() or build() a package with compiled code. This should make life easier for most windows users.
  • Package installation receieved a lot of tweaks from the community. Devtools now makes use of the Additional_repositories field, which is useful if you’re using drat for non-CRAN packages. install_github() is now lazy and won’t reinstall if the currently installed version is the same as the one on github. Local installs now add git and github metadata, if available.
  • use_news_md() adds a (very) basic NEWS.md template. CRAN now accepts NEWS.md files so release() warns if you’ve previously added it to .Rbuilignore.
  • use_mit_license() writes the necessary infrastructure to declare that your package is MIT licensed (in a CRAN-compliant way).
  • check(cran = TRUE) automatically adds --run-donttest as this is a de facto CRAN standard.

To see the full list of changes, please read the release notes.

Shiny 0.13.0 is now available on CRAN! This release has some of the most exciting features we’ve shipped since the first version of Shiny. Highlights include:

  • Shiny Gadgets
  • HTML templates
  • Shiny modules
  • Error stack traces
  • Checking for missing inputs
  • New JavaScript events

For a comprehensive list of changes, see the NEWS file.

To install the new version from CRAN, run:

install.packages("shiny")

Read on for details about these new features!

Read the rest of this entry »

(Post by Dirk Eddelbuettel and JJ Allaire)

A common theme over the last few decades was that we could afford to simply sit back and let computer (hardware) engineers take care of increases in computing speed thanks to Moore’s law. That same line of thought now frequently points out that we are getting closer and closer to the physical limits of what Moore’s law can do for us.

So the new best hope is (and has been) parallel processing. Even our smartphones have multiple cores, and most if not all retail PCs now possess two, four or more cores. Real computers, aka somewhat decent servers, can be had with 24, 32 or more cores as well, and all that is before we even consider GPU coprocessors or other upcoming changes.

Sometimes our tasks are embarrassingly simple as is the case with many data-parallel jobs: we can use higher-level operations such as those offered by the base R package parallel to spawn multiple processing tasks and gather the results. Dirk covered all this in some detail in previous talks on High Performance Computing with R (and you can also consult the CRAN Task View on High Performance Computing with R).

But sometimes we cannot use data-parallel approaches. Hence we have to redo our algorithms. Which is really hard. R itself has been relying on the (fairly mature) OpenMP standard for some of its operations. Luke Tierney’s keynote at the 2014 R/Finance conference mentioned some of the issues related to OpenMP, which works really well on Linux but currently not so well on other platforms. R is expected to make wider use of it in future versions once compiler support for OpenMP on Windows and OS X improves.

In the meantime, the RcppParallel package provides a complete toolkit for creating portable, high-performance parallel algorithms without requiring direct manipulation of operating system threads. RcppParallel includes:

  • Intel Thread Building Blocks (v4.3), a C++ library for task parallelism with a wide variety of parallel algorithms and data structures (Windows, OS X, Linux, and Solaris x86 only).
  • TinyThread, a C++ library for portable use of operating system threads.
  • RVector and RMatrix wrapper classes for safe and convenient access to R data structures in a multi-threaded environment.
  • High level parallel functions (parallelFor and parallelReduce) that use Intel TBB as a back-end on systems that support it and TinyThread on other platforms.

RcppParallel is available on CRAN now and several packages including dbmssgastonmarkovchain, rPref, SpatPCA, StMoSim, and text2vec are already taking advantage of it (you can read more about the tex2vec implementation here).

For more background and documentation see the RcppParallel web site as well as the slides from the talk we gave on RcppParallel at the Workshop for Distributed Computing in R.

In addition, the Rcpp Gallery includes several pieces demonstrating the use of RcppParallel, including:

All four are interesting and demonstrate different aspects of parallel computing via RcppParallel. But the last article is key—it shows how a particular matrix distance metric (which is missing from R) can be implemented in a serial manner in both R, and also via Rcpp. The fastest implementation, however, uses both Rcpp and RcppParallel and thereby achieves a truly impressive speed gain as the gains from using compiled code (via Rcpp) and from using a parallel algorithm (via RcppParallel) are multiplicative. On a couple of four-core machines the RcppParallel version was between 200 and 300 times faster than the R version.

Exciting times for parallel programming in R! To learn more head over to the RcppParallel package and start playing.

I’m pleased to announce purrr 0.2.0. Purrr fills in the missing pieces in R’s functional programming tools, and is designed to make your pure (and now) type-stable functions purrr.

I’m still working out exactly what purrr should do, and how it compares to existing functions in base R, dplyr, and tidyr. One main insight that has affected much of the current version is that functions designed for programming should be type-stable. Type-stability is an idea brought to my attention by Julia. Even though functions in R and Julia can return different types of output, by and large, you should strive to make functions that always return the same type of data structure. This makes functions more robust to varying input, and makes them easier to reason about (and in Julia, to optimise). (But not every function can be type-stable – how could $ work?)

Purrr 0.2.0 adds type-stable alternatives for maps, flattens, and try(), as described below. There were a lot of other minor improvements, bug fixes, and a number of deprecations. Please see the release notes for a complete list of changes.

Type stable maps

A map is a function that calls an another function on each element of a vector. Map functions in base R are the “applys”: lapply(), sapply(), vapply(), etc. lapply() is type-stable: no matter what the inputs are, the output is already a list. sapply() is not type-stable: it can return different types of output depending on the input. The following code shows a simple (if somewhat contrived) example of sapply() returning either a vector, a matrix, or a list, depending on its inputs:

df <- data.frame(
  a = 1L,
  b = 1.5,
  y = Sys.time(),
  z = ordered(1)
)

df[1:4] %>% sapply(class) %>% str()
#> List of 4
#>  $ a: chr "integer"
#>  $ b: chr "numeric"
#>  $ y: chr [1:2] "POSIXct" "POSIXt"
#>  $ z: chr [1:2] "ordered" "factor"
df[1:2] %>% sapply(class) %>% str()
#>  Named chr [1:2] "integer" "numeric"
#>  - attr(*, "names")= chr [1:2] "a" "b"
df[3:4] %>% sapply(class) %>% str()
#>  chr [1:2, 1:2] "POSIXct" "POSIXt" "ordered" "factor"
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : NULL
#>   ..$ : chr [1:2] "y" "z"

This behaviour makes sapply() appropriate for interactive use, since it usually guesses correctly and gives a useful data structure. It’s not appropriate for use in package or production code because if the input isn’t what you expect, it won’t fail, and will instead return an unexpected data structure. This typically causes an error further along the process, so you get a confusing error message and it’s difficult to isolate the root cause.

Base R has a type-stable version of sapply() called vapply(). It takes an additional argument that determines what the output will be. purrr takes a different approach. Instead of one function that does it all, purrr has multiple functions, one for each common type of output: map_lgl(), map_int(), map_dbl(), map_chr(), and map_df(). These either produce the specified type of output or throw an error. This forces you to deal with the problem right away:

df[1:4] %>% map_chr(class)
#> Error: Result 3 is not a length 1 atomic vector
df[1:4] %>% map_chr(~ paste(class(.), collapse = "/"))
#>                a                b                y                z 
#>        "integer"        "numeric" "POSIXct/POSIXt" "ordered/factor"

Other variants of map() have similar suffixes. For example, map2() allows you to iterate over two vectors in parallel:

x <- list(1, 3, 5)
y <- list(2, 4, 6)
map2(x, y, c)
#> [[1]]
#> [1] 1 2
#> 
#> [[2]]
#> [1] 3 4
#> 
#> [[3]]
#> [1] 5 6

map2() always returns a list. If you want to add together the corresponding values and store the result as a double vector, you can use map2_dbl():

map2_dbl(x, y, `+`)
#> [1]  3  7 11

Another map variant is invoke_map(), which takes a list of functions and list of arguments. It also has type-stable suffixes:

spread <- list(sd = sd, iqr = IQR, mad = mad)
x <- rnorm(100)

invoke_map_dbl(spread, x = x)
#>        sd       iqr       mad 
#> 0.9121309 1.2515807 0.9774154

Type-stable flatten

Another situation when type-stability is important is flattening a nested list into a simpler data structure. Base R has unlist(), but it’s dangerous because it always succeeds. As an alternative, purrr provides flatten_lgl(), flatten_int(), flatten_dbl(), and flatten_chr():

x <- list(1L, 2:3, 4L)
x %>% str()
#> List of 3
#>  $ : int 1
#>  $ : int [1:2] 2 3
#>  $ : int 4
x %>% flatten() %>% str()
#> List of 4
#>  $ : int 1
#>  $ : int 2
#>  $ : int 3
#>  $ : int 4
x %>% flatten_int() %>% str()
#>  int [1:4] 1 2 3 4

Type-stable try()

Another function in base R that is not type-stable is try(). try() ensures that an expression always succeeds, either returning the original value or the error message:

str(try(log(10)))
#>  num 2.3
str(try(log("a"), silent = TRUE))
#> Class 'try-error'  atomic [1:1] Error in log("a") : non-numeric argument to mathematical function
#> 
#>   ..- attr(*, "condition")=List of 2
#>   .. ..$ message: chr "non-numeric argument to mathematical function"
#>   .. ..$ call   : language log("a")
#>   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

safely() is a type-stable version of try. It always returns a list of two elements, the result and the error, and one will always be NULL.

safely(log)(10)
#> $result
#> [1] 2.302585
#> 
#> $error
#> NULL
safely(log)("a")
#> $result
#> NULL
#> 
#> $error
#> <simpleError in .f(...): non-numeric argument to mathematical function>

Notice that safely() takes a function as input and returns a “safe” function, a function that never throws an error. A powerful technique is to use safely() and map() together to attempt an operation on each element of a list:

safe_log <- safely(log)
x <- list(10, "a", 5)
log_x <- x %>% map(safe_log)

str(log_x)
#> List of 3
#>  $ :List of 2
#>   ..$ result: num 2.3
#>   ..$ error : NULL
#>  $ :List of 2
#>   ..$ result: NULL
#>   ..$ error :List of 2
#>   .. ..$ message: chr "non-numeric argument to mathematical function"
#>   .. ..$ call   : language .f(...)
#>   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
#>  $ :List of 2
#>   ..$ result: num 1.61
#>   ..$ error : NULL

This is output is slightly inconvenient because you’d rather have a list of three results, and another list of three errors. You can use the new transpose() function to switch the order of the first and second levels in the hierarchy:

log_x %>% transpose() %>% str()
#> List of 2
#>  $ result:List of 3
#>   ..$ : num 2.3
#>   ..$ : NULL
#>   ..$ : num 1.61
#>  $ error :List of 3
#>   ..$ : NULL
#>   ..$ :List of 2
#>   .. ..$ message: chr "non-numeric argument to mathematical function"
#>   .. ..$ call   : language .f(...)
#>   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
#>   ..$ : NULL

This makes it easy to extract the inputs where the original functions failed, or just keep the good successful result:

results <- x %>% map(safe_log) %>% transpose()

(ok <- results$error %>% map_lgl(is_null))
#> [1]  TRUE FALSE  TRUE
(bad_inputs <- x %>% discard(ok))
#> [[1]]
#> [1] "a"
(successes <- results$result %>% keep(ok) %>% flatten_dbl())
#> [1] 2.302585 1.609438

I’m very pleased to announce the release of ggplot2 2.0.0. I know I promised that there wouldn’t be any more updates, but while working on the 2nd edition of the ggplot2 book, I just couldn’t stop myself from fixing some long standing problems.

On the scale of ggplot2 releases, this one is huge with over one hundred fixes and improvements. This might break some of your existing code (although I’ve tried to minimise breakage as much as possible), but I hope the new features make up for any short term hassle. This blog post documents the most important changes:

  • ggplot2 now has an official extension mechanism.
  • There are a handful of new geoms, and updates to existing geoms.
  • The default appearance has been thoroughly tweaked so most plots should look better.
  • Facets have a much richer set of labelling options.
  • The documentation has been overhauled to be more helpful, and require less integration across multiple pages.
  • A number of older and less used features have been deprecated.

These are described in more detail below. See the release notes for a complete list of all changes.

Extensibility

Perhaps the bigggest news in this release is that ggplot2 now has an official extension mechanism. This means that others can now easily create their on stats, geoms and positions, and provide them in other packages. This should allow the ggplot2 community to flourish, even as less development work happens in ggplot2 itself. See vignette("extending-ggplot2") for details.

Coupled with this change, ggplot2 no longer uses proto or reference classes. Instead, we now use ggproto, a new OO system designed specifically for ggplot2. Unlike proto and RC, ggproto supports clean cross-package inheritance, which is necessary for extensibility. Creating a new OO system isn’t usually the right solution, but I’m pretty sure it was necessary here. Read more about it in the vignette.

New and updated geoms

  • ggplot no longer throws an error if you your plot has no layers. Instead it automatically adds geom_blank():
    ggplot(mpg, aes(cyl, hwy))

  • geom_count() (a new alias for the old stat_sum()) counts the number of points at unique locations on a scatterplot, and maps the size of the point to the count:
    ggplot(mpg, aes(cty, hwy)) + 
      geom_point()
    ggplot(mpg, aes(cty, hwy)) +
      geom_count()

  • geom_curve() draws curved lines in the same way that geom_segment() draws straight lines:
    df <- expand.grid(x = 1:2, y = 1:2)
    ggplot(df, aes(x, y, xend = x + 0.5, yend = y + 0.5)) +
      geom_curve(aes(colour = "curve")) +
      geom_segment(aes(colour = "segment"))

  • geom_bar() now behaves differently from geom_histogram(). Instead of binning the data, it counts the number of unique observations at each location:
    ggplot(mpg, aes(cyl)) + 
      geom_bar()
    
    ggplot(mpg, aes(cyl)) + 
      geom_histogram(binwidth = 1)

    If you got into the (bad) habit of using geom_histogram() to create bar charts, or geom_bar() to create histograms, you’ll need to switch.

  • Layers are now much stricter about their arguments – you will get an error if you’ve supplied an argument that isn’t an aesthetic or a parameter. This breaks the handful of geoms/stats that used ... to pass additional arguments on to the underlying computation. Now geom_smooth()/stat_smooth() and geom_quantile()/stat_quantile() use method.args instead; and stat_summary(), stat_summary_hex(), and stat_summary2d() use fun.args. This is likely to cause some short-term pain but in the long-term it will make it much easier to spot spelling mistakes and other errors.
  • geom_text() has been overhauled to make labelling your data a little easier. You can use nudge_x and nudge_y arguments to offset labels from their corresponding points. check_overlap = TRUE provides a simple way to avoid overplotting of labels: labels that would otherwise overlap are omitted.
    ggplot(mtcars, aes(wt, mpg, label = rownames(mtcars))) +
      geom_point() + 
      geom_text(nudge_y = 0.5, check_overlap = TRUE)

    (Labelling points well is still a huge pain, but at least these new features make life a lit better.)

  • geom_label() works like geom_text() but draws a rounded rectangle underneath each label:
    grid <- expand.grid(
      x = seq(-pi, pi, length = 50),
      y = seq(-pi, pi, length = 50)
    ) %>% mutate(r = x ^ 2 + y ^ 2, z = cos(r ^ 2) * exp(-r / 6))
    
    ggplot(grid, aes(x, y)) +
      geom_raster(aes(fill = z)) +
      geom_label(data = data.frame(x = 0, y = 0), label = "Center") +
      theme(legend.position = "none") +
      coord_fixed()

  • aes_() replaces aes_q(), and works like the SE functions in dplyr and my other recent packages. It supports formulas, so the most concise SE version of aes(carat, price) is now aes_(~carat, ~price). You may want to use this form in packages, as it will avoid spurious R CMD check warnings about undefined global variables.
    ggplot(mpg, aes_(~displ, ~cty)) + 
      geom_point()
    # Same as
    ggplot(mpg, aes(displ, cty)) + 
      geom_point()

Appearance

I’ve made a number of small tweaks to the default appearance:

  • The default theme_grey() background colour has been changed from “grey90” to “grey92”: this makes the background a little less visually prominent.
  • Labels and titles have been tweaked for readability. Axis labels are darker, and legend titles get the same visual treatment as axis labels.
  • The default font size dropped from 12 to 11. You might be surprised that I’ve made the default text size smaller as it was already hard for many people to read. It turns out there was a bug in RStudio (fixed in 0.99.724), that shrunk the text of all grid based graphics. Once that was resolved the defaults seemed too big to my eyes.
  • scale_size() now maps values to area, not radius. Use scale_radius() if you want the old behaviour (not recommended, except perhaps for lines). Continue to use scale_size_area() if you want 0 values to have 0 area.
  • Bar and rectangle legends no longer get a diagonal line. Instead, the border has been tweaked to make it visible, and more closely match the size of line drawn on the plot.
    ggplot(mpg, aes(factor(cyl), fill = drv)) +  
      geom_bar(colour = "black", size = 1) + 
      coord_flip()

  • geom_point() now uses shape 19 instead of 16. This looks much better on the default Linux graphics device. (It’s very slightly smaller than the old point, but it shouldn’t affect any graphics significantly). You can now control the width of the outline on shapes 21-25 with the stroke parameter.
  • The default legend will now allocate multiple rows (if vertical) or columns (if horizontal) in order to make a legend that is more likely to fit on the screen. You can override with the nrow/ncol arguments to guide_legend()
    p <- ggplot(mpg, aes(displ,hwy, colour = manufacturer)) +
      geom_point() + 
      theme(legend.position = "bottom")
    p
    # Revert back to previous behaviour
    p + guides(colour = guide_legend(nrow = 1))

  • Two new themes were contributed by Jean-Olivier Irisson: theme_void() is completely empty and theme_dark() has a dark background designed to make colours pop out.

Facet labels

Thanks to the work of Lionel Henry, facet labels have received three major improvements:

  1. You can switch the position of facet labels so they’re next to the axes.
  2. facet_wrap() now supports custom labellers.
  3. You can create combined labels when facetting by multiple variables.

Switching the labels

The new switch argument allows you to switch the labels to display near the axes:

data <- transform(mtcars,
  am = factor(am, levels = 0:1, c("Automatic", "Manual")),
  gear = factor(gear, levels = 3:5, labels = c("Three", "Four", "Five"))
)

ggplot(data, aes(mpg, disp)) +
  geom_point() +
  facet_grid(am ~ gear, switch = "both")

This is especially useful when the labels directly characterise the axes. In that situation, switching the labels can make the plot clearer and more readable. You may also want to use a neutral label background by setting strip.background to element_blank():

data <- mtcars %>% 
  mutate(
    Logarithmic = log(mpg),
    Inverse = 1 / mpg,
    Cubic = mpg ^ 3,
    Original = mpg
) %>% tidyr::gather(transformation, mpg2, Logarithmic:Original)

ggplot(data, aes(mpg2, disp)) +
  geom_point() +
  facet_wrap(~transformation, scales = "free", switch = "x") +
  theme(strip.background = element_blank())

Wrap labeller

A longstanding issue in ggplot was that facet_wrap() did not support custom labellers. Labellers are small functions that make it easy to customise the labels. You can now supply labellers to both wrap and grid facets:

ggplot(data, aes(mpg2, disp)) +
  geom_point() +
  facet_wrap(~transformation, scales = "free", labeller = "label_both")

Composite margins

Labellers have now better support for composite margins when you facet over multiple variable with +. All labellers gain a multi_line argument to control whether labels should be displayed as a single line or over multiple lines, one for each factor.

The labellers still work the same way except for label_bquote(). That labeller makes it easy to write mathematical expression involving the values of facetted factors. Historically, label_bquote() could only specify a single expression for all margins and factor. The factor value was referred to via the backquoted placeholder .(x). Now that it supports expressions combining multiple factors, you must backquote the variable names themselves. In addition, you can provide different expressions for each margin:

my_labeller <- label_bquote(
  rows = .(am) / alpha,
  cols = .(vs) ^ .(cyl)
)

ggplot(mtcars, aes(wt, mpg)) +
  geom_point() +
  facet_grid(am ~ vs + cyl, labeller = my_labeller)

Documentation

I’ve given the documentation a thorough overhaul:

  • Tighly linked geoms and stats (e.g. geom_boxplot() and stat_boxplot()) are now documented in the same file so you can see all the arguments in one place. Similarly, variations on a theme (like geom_path(), geom_line(), and geom_step()) are documented together.
  • I’ve tried to reduce the use of ... so that you can see all the documentation in one place rather than having to follow links around. In some cases this has involved adding additional arguments to geoms to make it more clear what you can do.
  • Thanks to Bob Rudis, the use of qplot() in examples has been grealy reduced. This is inline with the 2nd edition of the ggplot2 book, which eliminates qplot() in favour of ggplot().

Deprecated features

  • The order aesthetic is officially deprecated. It never really worked, and was poorly documented.
  • The stat and position arguments to qplot() have been deprecated. qplot() is designed for quick plots – if you need to specify position or stat, use ggplot() instead.
  • The theme setting axis.ticks.margin has been deprecated: now use the margin property of axis.ticks.
  • stat_abline(), stat_hline() and stat_vline() have been removed: these were never suitable for use other than with their corresponding geoms and were not documented.
  • show_guide has been renamed to show.legend: this more accurately reflects what it does (controls appearance of layer in legend), and uses the same convention as other ggplot2 arguments (i.e. a . between names). (Yes, I know that’s inconsistent with function names (which use _) but it’s too late to change now.)

A number of geoms have been renamed to be more consistent. The previous names will continue to work for the forseeable future, but you should switch to the new names for new work.

  • stat_binhex() and stat_bin2d() have been renamed to stat_bin_hex() and stat_bin_2d(). stat_summary2d() has been renamed to stat_summary_2d(), geom_density2d()/stat_density2d() has been renamed to geom_density_2d()/stat_density_2d().
  • stat_spoke() is now geom_spoke() since I realised it’s a reparameterisation of geom_segment().
  • stat_bindot() has been removed because it’s so tightly coupled to geom_dotplot(). If you happened to use stat_bindot(), just change to geom_dotplot().

All defunct functions have been removed.

I’m pleased to announced a new package for producing SVGs from R: svglite. This package is a fork of Matthieu Decorde RSvgDevice and wouldn’t be possible without his hard work. I’d also like to thank David Gohel who wrote the gdtools package: it solves all the hardest problems associated with making good SVGs from R.

Today, most browsers have good support for SVG and it is a great way of displaying vector graphics on the web. Unfortunately, R’s built-in svg() device is focussed on high quality rendering, not size or speed. It renders text as individual polygons: this ensures a graphic will look exactly the same regardless of what fonts you have installed, but makes output considerably larger (and harder to edit in other tools). svglite produces hand-optimised SVG that is as small as possible.

Features

svglite is a complete graphics device: that means you can give it any graphic and it will look the same as the equivalent .pdf or .png. Please file an issue if you discover a plot that doesn’t look right.

Use

In an interactive session, you use it like any other R graphics device:

svglite::svglite("myfile.svg")
plot(runif(10), runif(10))
dev.off()

If you want to use it in knitr, just set your chunk options as follows:

```{r setup, include = FALSE}
library(svglite)
knitr::opts_chunk$set(
  dev = "svglite",
  fig.ext = ".svg"
)

(Thanks to Bob Rudis for the tip)

There are also a few helper functions:

  • htmlSVG() makes it easy to preview the SVG in RStudio.
  • editSVG() opens the SVG file in your default SVG editor.
  • xmlSVG() returns the SVG as an xml2 object.
Follow

Get every new post delivered to your Inbox.

Join 19,391 other followers