You are currently browsing the category archive for the ‘Uncategorized’ category.

I’m pleased to announce that dplyr 0.7.0 is now on CRAN! (This was dplyr 0.6.0 previously; more on that below.) dplyr provides a “grammar” of data transformation, making it easy and elegant to solve the most common data manipulation challenges. dplyr supports multiple backends: as well as in-memory data frames, you can also use it with remote SQL databases. If you haven’t heard of dplyr before, the best place to start is the Data transformation chapter in R for Data Science.

You can install the latest version of dplyr with:

install.packages("dplyr")

Features

dplyr 0.7.0 is a major release including over 100 improvements and bug fixes, as described in the release notes. In this blog post, I want to discuss one big change and a handful of smaller updates. This version of dplyr also saw a major revamp of database connections. That’s a big topic, so it’ll get its own blog post next week.

Tidy evaluation

The biggest change is a new system for programming with dplyr, called tidy evaluation, or tidy eval for short. Tidy eval is a system for capturing expressions and later evaluating them in the correct context. It is is important because it allows you to interpolate values in contexts where dplyr usually works with expressions:

my_var <- quo(homeworld)

starwars %>%
  group_by(!!my_var) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)
#> # A tibble: 49 x 3
#>         homeworld   height  mass
#>                  
#>  1       Alderaan 176.3333  64.0
#>  2    Aleen Minor  79.0000  15.0
#>  3         Bespin 175.0000  79.0
#>  4     Bestine IV 180.0000 110.0
#>  5 Cato Neimoidia 191.0000  90.0
#>  6          Cerea 198.0000  82.0
#>  7       Champala 196.0000   NaN
#>  8      Chandrila 150.0000   NaN
#>  9   Concord Dawn 183.0000  79.0
#> 10       Corellia 175.0000  78.5
#> # ... with 39 more rows

This makes it possible to write your functions that work like dplyr functions, reducing the amount of copy-and-paste in your code:

starwars_mean <- function(my_var) {
  my_var <- enquo(my_var)
  
  starwars %>%
    group_by(!!my_var) %>%
    summarise_at(vars(height:mass), mean, na.rm = TRUE)
}
starwars_mean(homeworld)

You can also use the new .data pronoun to refer to variables with strings:

my_var <- "homeworld"

starwars %>%
  group_by(.data[[my_var]]) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)

This is useful when you’re writing packages that use dplyr code because it avoids an annoying note from R CMD check.

To learn more about how tidy eval helps solve data analysis challenge, please read the new programming with dplyr vignette. Tidy evaluation is implemented in the rlang package, which also provides a vignette on the theoretical underpinnings. Tidy eval is a rich system and takes a while to get your head around it, but we are confident that learning tidy eval will pay off, especially as it roles out to other packages in the tidyverse (tidyr and ggplot2 are next on the todo list).

The introduction of tidy evaluation means that the standard evaluation (underscored) version of each main verb (filter_(), select_() etc) is no longer needed, and so these functions have been deprecated (but remain around for backward compatibility).

Character encoding

We have done a lot of work to ensure that dplyr works with encodings other than Latin1 on Windows. This is most likely to affect you if you work with data that contains Chinese, Japanese, or Korean (CJK) characters. dplyr should now just work with such data. Please let us know if you have problems!

New datasets

dplyr has some new datasets that will help write more interesting examples:

  • starwars, shown above, contains information about characters from the Star Wars movies, sourced from the Star Wars API. It contains a number of list-columns.
    starwars
    #> # A tibble: 87 x 13
    #>                  name height  mass    hair_color  skin_color eye_color
    #>                                         
    #>  1     Luke Skywalker    172    77         blond        fair      blue
    #>  2              C-3PO    167    75                  gold    yellow
    #>  3              R2-D2     96    32           white, blue       red
    #>  4        Darth Vader    202   136          none       white    yellow
    #>  5        Leia Organa    150    49         brown       light     brown
    #>  6          Owen Lars    178   120   brown, grey       light      blue
    #>  7 Beru Whitesun lars    165    75         brown       light      blue
    #>  8              R5-D4     97    32            white, red       red
    #>  9  Biggs Darklighter    183    84         black       light     brown
    #> 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
    #> # ... with 77 more rows, and 7 more variables: birth_year ,
    #> #   gender , homeworld , species , films ,
    #> #   vehicles , starships 
  • storms has the trajectories of ~200 tropical storms. It contains a strong grouping structure.
    storms
    #> # A tibble: 10,010 x 13
    #>     name  year month   day  hour   lat  long              status category
    #>                             
    #>  1   Amy  1975     6    27     0  27.5 -79.0 tropical depression       -1
    #>  2   Amy  1975     6    27     6  28.5 -79.0 tropical depression       -1
    #>  3   Amy  1975     6    27    12  29.5 -79.0 tropical depression       -1
    #>  4   Amy  1975     6    27    18  30.5 -79.0 tropical depression       -1
    #>  5   Amy  1975     6    28     0  31.5 -78.8 tropical depression       -1
    #>  6   Amy  1975     6    28     6  32.4 -78.7 tropical depression       -1
    #>  7   Amy  1975     6    28    12  33.3 -78.0 tropical depression       -1
    #>  8   Amy  1975     6    28    18  34.0 -77.0 tropical depression       -1
    #>  9   Amy  1975     6    29     0  34.4 -75.8      tropical storm        0
    #> 10   Amy  1975     6    29     6  34.0 -74.8      tropical storm        0
    #> # ... with 10,000 more rows, and 4 more variables: wind ,
    #> #   pressure , ts_diameter , hu_diameter 
  • band_members, band_instruments and band_instruments2 has a tiny amount of data about bands. It’s designed to be very simple so you can illustrate how joins work without getting distracted by the details of the data.
    band_members
    #> # A tibble: 3 x 2
    #>    name    band
    #>      
    #> 1  Mick  Stones
    #> 2  John Beatles
    #> 3  Paul Beatles
    band_instruments
    #> # A tibble: 3 x 2
    #>    name  plays
    #>     
    #> 1  John guitar
    #> 2  Paul   bass
    #> 3 Keith guitar

New and improved verbs

  • The pull() generic allows you to extract a single column either by name or position. It’s similar to select() but returns a vector, rather than a smaller tibble.
    mtcars %>% pull(-1) %>% str()
    #>  num [1:32] 4 4 1 1 2 1 4 2 2 4 ...
    mtcars %>% pull(cyl) %>% str()
    #>  num [1:32] 6 6 4 6 8 6 8 4 4 6 ...

    Thanks to Paul Poncet for the idea!

  • arrange() for grouped data frames gains a .by_group argument so you can choose to sort by groups if you want to (defaults to FALSE).
  • All single table verbs now have scoped variantssuffixed with _if(), _at() and _all(). Use these if you want to do something to every variable (_all), variables selected by their names (_at), or variables that satisfy some predicate (_if).
    iris %>% summarise_if(is.numeric, mean)
    starwars %>% select_if(Negate(is.list))
    storms %>% group_by_at(vars(month:hour))

Other important changes

  • Local join functions can now control how missing values are matched. The default value is na_matches = "na", which treats two missing values as equal. To prevent missing values from matching, use na_matches = "never".

You can change the default behaviour by calling pkgconfig::set_config("dplyr::na_matches", "never").

  • bind_rows() and combine() are more strict when coercing. Logical values are no longer coerced to integer and numeric. Date, POSIXct and other integer or double-based classes are no longer coerced to integer or double to avoid dropping important metadata. We plan to continue improving this interface in the future.

Breaking changes

From time-to-time I discover that I made a mistake in an older version of dplyr and developed what is now a clearly suboptimal API. If the problem isn’t too big, I try to just leave it – the cost of making small improvements is not worth it when compared to to the cost of breaking existing code. However, there are bigger improvements where I believe the short-term pain of breaking code is worth the long-term payoff of a better API.

Regardless, it’s still frustrating when an update to dplyr breaks your code. To minimise this pain, I plan to do two things going forward:

  • Adopt an odd-even release cycle so that API breaking changes only occur in odd numbered releases. Even numbered releases will only contain bug fixes and new features. This is why I’ve skipped dplyr 0.6.0 and gone directly to dplyr 0.7.0.
  • Invest time in developing better tools isolating packages across projects so that you can choose when to upgrade a package on a project-by-project basis, and if something goes wrong, easily roll back to a version that worked. Look for news about this later in the year.

Contributors

dplyr is truly a community effort. Apart from the dplyr team (myself, Kirill Müller, and Lionel Henry), this release wouldn’t have been possible without patches from Christophe Dervieux, Dean Attali, Ian Cook, Ian Lyttle, Jake Russ, Jay Hesselberth, Jennifer (Jenny) Bryan, @lindbrook, Mauro Lepore, Nicolas Coutin, Daniel, Tony Fischetti, Hiroaki Yutani and Sergio Oller. Thank you all for your contributions!

I’m very pleased to announce ggplot2 2.2.0. It includes four major new features:

  • Subtitles and captions.
  • A large rewrite of the facetting system.
  • Improved theme options.
  • Better stacking.

It also includes as numerous bug fixes and minor improvements, as described in the release notes.

The majority of this work was carried out by Thomas Pederson, who I was lucky to have as my “ggplot2 intern” this summer. Make sure to check out his other visualisation packages: ggraphggforce, and tweenr.

Install ggplot2 with:

install.packages("ggplot2")

Subtitles and captions

Thanks to Bob Rudis, you can now add subtitles and captions to your plots:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE, method = "loess") +
  labs(
    title = "Fuel efficiency generally decreases with engine size",
    subtitle = "Two seaters (sports cars) are an exception because of their light weight",
    caption = "Data from fueleconomy.gov"
  )

 

subtitle-1

These are controlled by the theme settings plot.subtitle and plot.caption.

The plot title is now aligned to the left by default. To return to the previous centered alignment, use theme(plot.title = element_text(hjust = 0.5)).

Facets

The facet and layout implementation has been moved to ggproto and received a large rewrite and refactoring. This will allow others to create their own facetting systems, as descrbied in the vignette("extending-ggplot2"). Along with the rewrite a number of features and improvements has been added, most notably:

  • ou can now use functions in facetting formulas, thanks to Dan Ruderman.
    ggplot(diamonds, aes(carat, price)) + 
      geom_hex(bins = 20) + 
      facet_wrap(~cut_number(depth, 6))

    facet-1-1

  • Axes are now drawn under the panels in facet_wrap() when the rentangle is not completely filled.
    ggplot(mpg, aes(displ, hwy)) + 
      geom_point() + 
      facet_wrap(~class)

    facet-2-1

  • You can set the position of the axes with the position argument.
    ggplot(mpg, aes(displ, hwy)) + 
      geom_point() + 
      scale_x_continuous(position = "top") + 
      scale_y_continuous(position = "right")

    facet-3-1

  • You can display a secondary axis that is a one-to-one transformation of the primary axis with sec.axis.
    ggplot(mpg, aes(displ, hwy)) + 
      geom_point() + 
      scale_y_continuous(
        "mpg (US)", 
        sec.axis = sec_axis(~ . * 1.20, name = "mpg (UK)")
      )

     

  • Strips can be placed on any side, and the placement with respect to axes can be controlled with the strip.placement theme option.
    ggplot(mpg, aes(displ, hwy)) + 
      geom_point() + 
      facet_wrap(~ drv, strip.position = "bottom") + 
      theme(
        strip.placement = "outside",
        strip.background = element_blank(),
        strip.text = element_text(face = "bold")
      ) +
      xlab(NULL)

    facet-5-1

Theming

  • The theme() function now has named arguments so autocomplete and documentation suggestions are vastly improved.
  • Blank elements can now be overridden again so you get the expected behavior when setting e.g. axis.line.x.
  • element_line() gets an arrow argument that lets you put arrows on axes.
    arrow <- arrow(length = unit(0.4, "cm"), type = "closed")
    
    ggplot(mpg, aes(displ, hwy)) + 
      geom_point() + 
      theme_minimal() + 
      theme(
        axis.line = element_line(arrow = arrow)
      )

    theme-1-1

  • Control of legend styling has been improved. The whole legend area can be aligned with the plot area and a box can be drawn around all legends:
    ggplot(mpg, aes(displ, hwy, shape = drv, colour = fl)) + 
      geom_point() + 
      theme(
        legend.justification = "top", 
        legend.box = "horizontal",
        legend.box.margin = margin(3, 3, 3, 3, "mm"), 
        legend.margin = margin(),
        legend.box.background = element_rect(colour = "grey50")
      )

    theme-2-1

  • panel.margin and legend.margin have been renamed to panel.spacing and legend.spacing respectively, as this better indicates their roles. A new legend.margin actually controls the margin around each legend.
  • When computing the height of titles, ggplot2 now inclues the height of the descenders (i.e. the bits g and y that hang underneath). This improves the margins around titles, particularly the y axis label. I have also very slightly increased the inner margins of axis titles, and removed the outer margins.
  • The default themes has been tweaked by Jean-Olivier Irisson making them better match theme_grey().

Stacking bars

position_stack() and position_fill() now stack values in the reverse order of the grouping, which makes the default stack order match the legend.

avg_price <- diamonds %>% 
  group_by(cut, color) %>% 
  summarise(price = mean(price)) %>% 
  ungroup() %>% 
  mutate(price_rel = price - mean(price))

ggplot(avg_price) + 
  geom_col(aes(x = cut, y = price, fill = color))

stack-1-1

(Note also the new geom_col() which is short-hand for geom_bar(stat = "identity"), contributed by Bob Rudis.)

If you want to stack in the opposite order, try forcats::fct_rev():

ggplot(avg_price) + 
  geom_col(aes(x = cut, y = price, fill = fct_rev(color)))

stack-2-1

Additionally, you can now stack negative values:

ggplot(avg_price) + 
  geom_col(aes(x = cut, y = price_rel, fill = color))

stack-3-1

The overall ordering cannot necessarily be matched in the presence of negative values, but the ordering on either side of the x-axis will match.

Labels can also be stacked, but the default position is suboptimal:

series <- data.frame(
  time = c(rep(1, 4),rep(2, 4), rep(3, 4), rep(4, 4)),
  type = rep(c('a', 'b', 'c', 'd'), 4),
  value = rpois(16, 10)
)

ggplot(series, aes(time, value, group = type)) +
  geom_area(aes(fill = type)) +
  geom_text(aes(label = type), position = "stack")

stack-4-1

You can improve the position with the vjust parameter. A vjust of 0.5 will center the labels inside the corresponding area:

ggplot(series, aes(time, value, group = type)) +
  geom_area(aes(fill = type)) +
  geom_text(aes(label = type), position = position_stack(vjust = 0.5))

stack-5-1

We are pleased to announce version 1.0.0 of the memoise package is now available on CRAN. Memoization stores the value of function call and returns the cached result when the function is called again with the same arguments.

The following function computes Fibonacci numbers and illustrates the usefulness of memoization. Because the function definition is recursive, the intermediate results can be looked up rather than recalculated at each level of recursion, which reduces the runtime drastically. The last time the memoised function is called the final result can simply be returned, so no measurable execution time is recorded.

fib <- function(n) {
  if (n < 2) {
    return(n)
  } else {
    return(fib(n-1) + fib(n-2))
  }
}
system.time(x <- fib(30))
#>    user  system elapsed 
#>   4.454   0.010   4.472
fib <- memoise(fib)
system.time(y <- fib(30))
#>    user  system elapsed 
#>   0.004   0.000   0.004
system.time(z <- fib(30))
#>    user  system elapsed 
#>       0       0       0
all.equal(x, y)
#> [1] TRUE
all.equal(x, z)
#> [1] TRUE

Memoization is also very useful for storing queries to external resources, such as network APIs and databases.

Improvements in this release make memoised functions much nicer to use interactively. Memoised functions now have a print method which outputs the original function definition rather than the memoization code.

mem_sum <- memoise(sum)
mem_sum
#> Memoised Function:
#> function (..., na.rm = FALSE)  .Primitive("sum")

Memoised functions now forward their arguments from the original function rather than simply passing them with .... This allows autocompletion to work transparently for memoised functions and also fixes a bug related to non-constant default arguments. [1]

mem_scan <- memoise(scan)
args(mem_scan)
#> function (file = "", what = double(), nmax = -1L, n = -1L, sep = "", 
#>     quote = if (identical(sep, "\n")) "" else "'\"", dec = ".", 
#>     skip = 0L, nlines = 0L, na.strings = "NA", flush = FALSE, 
#>     fill = FALSE, strip.white = FALSE, quiet = FALSE, blank.lines.skip = TRUE, 
#>     multi.line = TRUE, comment.char = "", allowEscapes = FALSE, 
#>     fileEncoding = "", encoding = "unknown", text, skipNul = FALSE) 
#> NULL

Memoisation can now depend on external variables aside from the function arguments. This feature can be used in a variety of ways, such as invalidating the memoisation when a new package is attached.

mem_f <- memoise(runif, ~search())
mem_f(2)
#> [1] 0.009113091 0.988083122
mem_f(2)
#> [1] 0.009113091 0.988083122
library(ggplot2)
mem_f(2)
#> [1] 0.89150566 0.01128355

Or invalidating the memoisation after a given amount of time has elapsed. A timeout() helper function is provided to make this feature easier to use.

mem_f <- memoise(runif, ~timeout(10))
mem_f(2)
#> [1] 0.6935329 0.3584699
mem_f(2)
#> [1] 0.6935329 0.3584699
Sys.sleep(10)
mem_f(2)
#> [1] 0.2008418 0.4538413

A great amount of thanks for this release goes to Kirill Müller, who wrote the argument forwarding implementation and added comprehensive tests to the package. [2, 3]

See the release notes for a complete list of changes.

testthat 0.10.0 is now available on CRAN. Testthat makes it easy to turn the informal testing that you’re already doing into formal automated tests. Learn more at http://r-pkgs.had.co.nz/tests.html. Install the latest version with:

install.packages("testthat")

There are four big changes in this release:

  • test_check() uses a new reporter specifically designed for R CMD check. It displays a summary at the end of the tests, designed to be <13 lines long so test failures in R CMD check display are as useful as possible.
  • New skip_if_not_installed() skips tests if a package isn’t installed: this is useful if you want tests to skip if a suggested package isn’t installed.
  • The expect_that(a, equals(b)) style of testing has been soft-deprecated in favour of expect_equals(a, b). It will keep working, but it’s no longer demonstrated in the documentation, and new expectations will only be available in expect_equal(a, b) style.
  • compare() is now documented and exported: compare is used to display test failures for expect_equal(), and is designed to help you spot exactly where the failure occured. It currently has methods for character and numeric vectors.

There were a number of other minor improvements and bug fixes. See the release notes for a complete list.

I’m very pleased to announce that dplyr 0.4.0 is now available from CRAN. Get the latest version by running:

install.packages("dplyr")

dplyr 0.4.0 includes over 80 minor improvements and bug fixes, which are described in detail in the release notes. Here I wanted to draw your attention to two areas that have particularly improved since dplyr 0.3, two-table verbs and data frame support.

Two table verbs

dplyr now has full support for all two-table verbs provided by SQL:

  • Mutating joins, which add new variables to one table from matching rows in another: inner_join(), left_join(), right_join(), full_join(). (Support for non-equi joins is planned for dplyr 0.5.0.)
  • Filtering joins, which filter observations from one table based on whether or not they match an observation in the other table: semi_join(), anti_join().
  • Set operations, which combine the observations in two data sets as if they were set elements: intersect(), union(), setdiff().

Together, these verbs should allow you to solve 95% of data manipulation problems that involve multiple tables. If any of the concepts are unfamiliar to you, I highly recommend reading the two-table vignette (and if you still don’t understand, please let me know so I can make it better.)

Data frames

dplyr wraps data frames in a tbl_df class. These objects are structured in exactly the same way as regular data frames, but their behaviour has been tweaked a little to make them easier to work with. The new data_frames vignette describes how dplyr works with data frames in general, and below I highlight some of the features new in 0.4.0.

Printing

The biggest difference is printing: print.tbl_df() doesn’t try and print 10,000 rows! Printing got a lot of love in dplyr 0.4 and now:

  • All print() method methods invisibly return their input so you can interleave print() statements into a pipeline to see interim results.
  • If you’ve managed to produce a 0-row data frame, dplyr won’t try to print the data, but will tell you the column names and types:
    data_frame(x = numeric(), y = character())
    #> Source: local data frame [0 x 2]
    #> 
    #> Variables not shown: x (dbl), y (chr)
  • dplyr never prints row names since no dplyr method is guaranteed to preserve them:
    df <- data.frame(x = c(a = 1, b = 2, c = 3))
    df
    #>   x
    #> a 1
    #> b 2
    #> c 3
    df %>% tbl_df()
    #> Source: local data frame [3 x 1]
    #> 
    #>   x
    #> 1 1
    #> 2 2
    #> 3 3

    I don’t think using row names is a good idea because it violates one of the principles of tidy data: every variable should be stored in the same way.

    To make life a bit easier if you do have row names, you can use the new add_rownames() to turn your row names into a proper variable:

    df %>% 
      add_rownames()
    #>   rowname x
    #> 1       a 1
    #> 2       b 2
    #> 3       c 3

    (But you’re better off never creating them in the first place.)

  • options(dplyr.print_max) is now 20, so dplyr will never print more than 20 rows of data (previously it was 100). The best way to see more rows of data is to use View().

Coercing lists to data frames

When you have a list of vectors of equal length that you want to turn into a data frame, dplyr provides as_data_frame() as a simple alternative to as.data.frame(). as_data_frame() is considerably faster than as.data.frame() because it does much less:

l <- replicate(26, sample(100), simplify = FALSE)
names(l) <- letters
microbenchmark::microbenchmark(
  as_data_frame(l),
  as.data.frame(l)
)
#> Unit: microseconds
#>              expr      min        lq   median        uq      max neval
#>  as_data_frame(l)  101.856  112.0615  124.855  143.0965  254.193   100
#>  as.data.frame(l) 1402.075 1466.6365 1511.644 1635.1205 3007.299   100

It’s difficult to precisely describe what as.data.frame(x) does, but it’s similar to do.call(cbind, lapply(x, data.frame)) – it coerces each component to a data frame and then cbind()s them all together.

The speed of as.data.frame() is not usually a bottleneck in interactive use, but can be a problem when combining thousands of lists into one tidy data frame (this is common when working with data stored in json or xml).

Binding rows and columns

dplyr now provides bind_rows() and bind_cols() for binding data frames together. Compared to rbind() and cbind(), the functions:

  • Accept either individual data frames, or a list of data frames:
    a <- data_frame(x = 1:5)
    b <- data_frame(x = 6:10)
    
    bind_rows(a, b)
    #> Source: local data frame [10 x 1]
    #> 
    #>    x
    #> 1  1
    #> 2  2
    #> 3  3
    #> 4  4
    #> 5  5
    #> .. .
    bind_rows(list(a, b))
    #> Source: local data frame [10 x 1]
    #> 
    #>    x
    #> 1  1
    #> 2  2
    #> 3  3
    #> 4  4
    #> 5  5
    #> .. .

    If x is a list of data frames, bind_rows(x) is equivalent to do.call(rbind, x).

  • Are much faster:
    dfs <- replicate(100, data_frame(x = runif(100)), simplify = FALSE)
    microbenchmark::microbenchmark(
      do.call("rbind", dfs),
      bind_rows(dfs)
    )
    #> Unit: microseconds
    #>                   expr      min        lq   median        uq       max
    #>  do.call("rbind", dfs) 5344.660 6605.3805 6964.236 7693.8465 43457.061
    #>         bind_rows(dfs)  240.342  262.0845  317.582  346.6465  2345.832
    #>  neval
    #>    100
    #>    100

(Generally you should avoid bind_cols() in favour of a join; otherwise check carefully that the rows are in a compatible order).

List-variables

Data frames are usually made up of a list of atomic vectors that all have the same length. However, it’s also possible to have a variable that’s a list, which I call a list-variable. Because of data.frame()s complex coercion rules, the easiest way to create a data frame containing a list-column is with data_frame():

data_frame(x = 1, y = list(1), z = list(list(1:5, "a", "b")))
#> Source: local data frame [1 x 3]
#> 
#>   x        y         z
#> 1 1 <dbl[1]> <list[3]>

Note how list-variables are printed: a list-variable could contain a lot of data, so dplyr only shows a brief summary of the contents. List-variables are useful for:

  • Working with summary functions that return more than one value:
    qs <- mtcars %>%
      group_by(cyl) %>%
      summarise(y = list(quantile(mpg)))
    
    # Unnest input to collpase into rows
    qs %>% tidyr::unnest(y)
    #> Source: local data frame [15 x 2]
    #> 
    #>    cyl    y
    #> 1    4 21.4
    #> 2    4 22.8
    #> 3    4 26.0
    #> 4    4 30.4
    #> 5    4 33.9
    #> .. ...  ...
    
    # To extract individual elements into columns, wrap the result in rowwise()
    # then use summarise()
    qs %>% 
      rowwise() %>% 
      summarise(q25 = y[2], q75 = y[4])
    #> Source: local data frame [3 x 2]
    #> 
    #>     q25   q75
    #> 1 22.80 30.40
    #> 2 18.65 21.00
    #> 3 14.40 16.25
  • Keeping associated data frames and models together:
    by_cyl <- split(mtcars, mtcars$cyl)
    models <- lapply(by_cyl, lm, formula = mpg ~ wt)
    
    data_frame(cyl = c(4, 6, 8), data = by_cyl, model = models)
    #> Source: local data frame [3 x 3]
    #> 
    #>   cyl            data   model
    #> 1   4 <S3:data.frame> <S3:lm>
    #> 2   6 <S3:data.frame> <S3:lm>
    #> 3   8 <S3:data.frame> <S3:lm>

dplyr’s support for list-variables continues to mature. In 0.4.0, you can join and row bind list-variables and you can create them in summarise and mutate.

My vision of list-variables is still partial and incomplete, but I’m convinced that they will make pipeable APIs for modelling much eaiser. See the draft lowliner package for more explorations in this direction.

Bonus

My colleague, Garrett, helped me make a cheat sheet that summarizes the data wrangling features of dplyr 0.4.0. You can download it from RStudio’s new gallery of R cheat sheets.

Data wrangling cheatsheet

httr 0.6.0 is now available on CRAN. The httr packages makes it easy to talk to web APIs from R. Learn more in the quick start vignette.

This release is mostly bug fixes and minor improvements. The most important are:

  • handle_reset(), which allows you to reset the default handle if you get the error “easy handle already used in multi handle”.
  • write_stream() which lets you process the response from a server as a stream of raw vectors (#143).
  • VERB() allows to you send a request with a custom http verb.
  • brew_dr() checks for common problems. It currently checks if your libcurl uses NSS. This is unlikely to work so it gives you some advice on how to fix the problem (thanks to Dirk Eddelbuettel for debugging this problem and suggesting a remedy).
  • Added support for Google OAuth2 service accounts. (#119, thanks to help from @siddharthab). See ?oauth_service_token for details.

I’ve also switched from RC to R6 (which should make it easier to extend OAuth classes for non-standard OAuth implementations), and tweaked the use of the backend SSL certificate details bundled with httr. See the release notes for complete details.

ggvis 0.4 is now available on CRAN. You can install it with:

install.packages("ggvis")

The major features of this release are:

  • Boxplots, with layer_boxplots()
chickwts %>% ggvis(~feed, ~weight) %>% layer_boxplots()

ggvis box plot

  • Better stability when errors occur.
  • Better handling of empty data and malformed data.
  • More consistent handling of data in compute pipeline functions.

Because of these changes, interactive graphics with dynamic data sources will work more reliably.

Additionally, there are many small improvements and bug fixes under the hood. You can see the full change log here.

RStudio is planning a new Master R Developer Workshop to be taught by Hadley Wickham in the San Francisco Bay Area on January 19-20. This will be the same workshop that Hadley is teaching in September in New York City to a sold out audience.

If you did not get a chance to register for the NYC workshop but wished to, consider attending the January Bay Area workshop. We will open registration once we have planned out all of the event details. If you would like to be notified when registration opens, leave a contact address here.

tidyr is new package that makes it easy to “tidy” your data. Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modelling packages). The two most important properties of tidy data are:

  • Each column is a variable.
  • Each row is an observation.

Arranging your data in this way makes it easier to work with because you have a consistent way of referring to variables (as column names) and observations (as row indices). When use tidy data and tidy tools, you spend less time worrying about how to feed the output from one function into the input of another, and more time answering your questions about the data.

To tidy messy data, you first identify the variables in your dataset, then use the tools provided by tidyr to move them into columns. tidyr provides three main functions for tidying your messy data: gather(), separate() and spread().

gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. Other names for gather include melt (reshape2), pivot (spreadsheets) and fold (databases). Here’s an example how you might use gather() on a made-up dataset. In this experiment we’ve given three people two different drugs and recorded their heart rate:

library(tidyr)
library(dplyr)

messy <- data.frame(
  name = c("Wilbur", "Petunia", "Gregory"),
  a = c(67, 80, 64),
  b = c(56, 90, 50)
)
messy
#>      name  a  b
#> 1  Wilbur 67 56
#> 2 Petunia 80 90
#> 3 Gregory 64 50

We have three variables (name, drug and heartrate), but only name is currently in a column. We use gather() to gather the a and b columns into key-value pairs of drug and heartrate:

messy %>%
  gather(drug, heartrate, a:b)
#>      name drug heartrate
#> 1  Wilbur    a        67
#> 2 Petunia    a        80
#> 3 Gregory    a        64
#> 4  Wilbur    b        56
#> 5 Petunia    b        90
#> 6 Gregory    b        50

Sometimes two variables are clumped together in one column. separate() allows you to tease them apart (extract() works similarly but uses regexp groups instead of a splitting pattern or position). Take this example from stackoverflow (modified slightly for brevity). We have some measurements of how much time people spend on their phones, measured at two locations (work and home), at two times. Each person has been randomly assigned to either treatment or control.

set.seed(10)
messy <- data.frame(
  id = 1:4,
  trt = sample(rep(c('control', 'treatment'), each = 2)),
  work.T1 = runif(4),
  home.T1 = runif(4),
  work.T2 = runif(4),
  home.T2 = runif(4)
)

To tidy this data, we first use gather() to turn columns work.T1, home.T1, work.T2 and home.T2 into a key-value pair of key and time. (Only the first eight rows are shown to save space.)

tidier <- messy %>%
  gather(key, time, -id, -trt)
tidier %>% head(8)
#>   id       trt     key    time
#> 1  1 treatment work.T1 0.08514
#> 2  2   control work.T1 0.22544
#> 3  3 treatment work.T1 0.27453
#> 4  4   control work.T1 0.27231
#> 5  1 treatment home.T1 0.61583
#> 6  2   control home.T1 0.42967
#> 7  3 treatment home.T1 0.65166
#> 8  4   control home.T1 0.56774

Next we use separate() to split the key into location and time, using a regular expression to describe the character that separates them.

tidy <- tidier %>%
  separate(key, into = c("location", "time"), sep = "\\.") 
tidy %>% head(8)
#>   id       trt location time    time
#> 1  1 treatment     work   T1 0.08514
#> 2  2   control     work   T1 0.22544
#> 3  3 treatment     work   T1 0.27453
#> 4  4   control     work   T1 0.27231
#> 5  1 treatment     home   T1 0.61583
#> 6  2   control     home   T1 0.42967
#> 7  3 treatment     home   T1 0.65166
#> 8  4   control     home   T1 0.56774

The last tool, spread(), takes two columns (a key-value pair) and spreads them in to multiple columns, making “long” data wider. Spread is known by other names in other places: it’s cast in reshape2, unpivot in spreadsheets and unfold in databases. spread() is used when you have variables that form rows instead of columns. You need spread() less frequently than gather() or separate() so to learn more, check out the documentation and the demos.

Just as reshape2 did less than reshape, tidyr does less than reshape2. It’s designed specifically for tidying data, not general reshaping. In particular, existing methods only work for data frames, and tidyr never aggregates. This makes each function in tidyr simpler: each function does one thing well. For more complicated operations you can string together multiple simple tidyr and dplyr functions with %>%.

You can learn more about the underlying principles in my tidy data paper. To see more examples of data tidying, read the vignette, vignette("tidy-data"), or check out the demos, demo(package = "tidyr"). Alternatively, check out some of the great stackoverflow answers that use tidyr. Keep up-to-date with development at http://github.com/hadley/tidyr, report bugs at http://github.com/hadley/tidyr/issues and get help with data manipulation challenges at https://groups.google.com/group/manipulatr. If you ask a question specifically about tidyr on stackoverflow, please tag it with tidyr and I’ll make sure to read it.