You are currently browsing the category archive for the ‘Uncategorized’ category.
I’m very pleased to announce that dplyr 0.4.0 is now available from CRAN. Get the latest version by running:
dplyr 0.4.0 includes over 80 minor improvements and bug fixes, which are described in detail in the release notes. Here I wanted to draw your attention to two areas that have particularly improved since dplyr 0.3, two-table verbs and data frame support.
Two table verbs
dplyr now has full support for all two-table verbs provided by SQL:
- Mutating joins, which add new variables to one table from matching rows in another:
full_join(). (Support for non-equi joins is planned for dplyr 0.5.0.)
- Filtering joins, which filter observations from one table based on whether or not they match an observation in the other table:
- Set operations, which combine the observations in two data sets as if they were set elements:
Together, these verbs should allow you to solve 95% of data manipulation problems that involve multiple tables. If any of the concepts are unfamiliar to you, I highly recommend reading the two-table vignette (and if you still don’t understand, please let me know so I can make it better.)
dplyr wraps data frames in a
tbl_df class. These objects are structured in exactly the same way as regular data frames, but their behaviour has been tweaked a little to make them easier to work with. The new data_frames vignette describes how dplyr works with data frames in general, and below I highlight some of the features new in 0.4.0.
The biggest difference is printing:
print.tbl_df() doesn’t try and print 10,000 rows! Printing got a lot of love in dplyr 0.4 and now:
print()method methods invisibly return their input so you can interleave
print()statements into a pipeline to see interim results.
- If you’ve managed to produce a 0-row data frame, dplyr won’t try to print the data, but will tell you the column names and types:
data_frame(x = numeric(), y = character()) #> Source: local data frame [0 x 2] #> #> Variables not shown: x (dbl), y (chr)
- dplyr never prints row names since no dplyr method is guaranteed to preserve them:
df <- data.frame(x = c(a = 1, b = 2, c = 3)) df #> x #> a 1 #> b 2 #> c 3 df %>% tbl_df() #> Source: local data frame [3 x 1] #> #> x #> 1 1 #> 2 2 #> 3 3
I don’t think using row names is a good idea because it violates one of the principles of tidy data: every variable should be stored in the same way.
To make life a bit easier if you do have row names, you can use the new
add_rownames()to turn your row names into a proper variable:
df %>% add_rownames() #> rowname x #> 1 a 1 #> 2 b 2 #> 3 c 3
(But you’re better off never creating them in the first place.)
options(dplyr.print_max)is now 20, so dplyr will never print more than 20 rows of data (previously it was 100). The best way to see more rows of data is to use
Coercing lists to data frames
When you have a list of vectors of equal length that you want to turn into a data frame, dplyr provides
as_data_frame() as a simple alternative to
as_data_frame() is considerably faster than
as.data.frame() because it does much less:
l <- replicate(26, sample(100), simplify = FALSE) names(l) <- letters microbenchmark::microbenchmark( as_data_frame(l), as.data.frame(l) ) #> Unit: microseconds #> expr min lq median uq max neval #> as_data_frame(l) 101.856 112.0615 124.855 143.0965 254.193 100 #> as.data.frame(l) 1402.075 1466.6365 1511.644 1635.1205 3007.299 100
It’s difficult to precisely describe what
as.data.frame(x) does, but it’s similar to
do.call(cbind, lapply(x, data.frame)) – it coerces each component to a data frame and then
cbind()s them all together.
The speed of
as.data.frame() is not usually a bottleneck in interactive use, but can be a problem when combining thousands of lists into one tidy data frame (this is common when working with data stored in json or xml).
Binding rows and columns
dplyr now provides
bind_cols() for binding data frames together. Compared to
cbind(), the functions:
- Accept either individual data frames, or a list of data frames:
a <- data_frame(x = 1:5) b <- data_frame(x = 6:10) bind_rows(a, b) #> Source: local data frame [10 x 1] #> #> x #> 1 1 #> 2 2 #> 3 3 #> 4 4 #> 5 5 #> .. . bind_rows(list(a, b)) #> Source: local data frame [10 x 1] #> #> x #> 1 1 #> 2 2 #> 3 3 #> 4 4 #> 5 5 #> .. .
xis a list of data frames,
bind_rows(x)is equivalent to
- Are much faster:
dfs <- replicate(100, data_frame(x = runif(100)), simplify = FALSE) microbenchmark::microbenchmark( do.call("rbind", dfs), bind_rows(dfs) ) #> Unit: microseconds #> expr min lq median uq max #> do.call("rbind", dfs) 5344.660 6605.3805 6964.236 7693.8465 43457.061 #> bind_rows(dfs) 240.342 262.0845 317.582 346.6465 2345.832 #> neval #> 100 #> 100
(Generally you should avoid
bind_cols() in favour of a join; otherwise check carefully that the rows are in a compatible order).
Data frames are usually made up of a list of atomic vectors that all have the same length. However, it’s also possible to have a variable that’s a list, which I call a list-variable. Because of
data.frame()s complex coercion rules, the easiest way to create a data frame containing a list-column is with
data_frame(x = 1, y = list(1), z = list(list(1:5, "a", "b"))) #> Source: local data frame [1 x 3] #> #> x y z #> 1 1 <dbl> <list>
Note how list-variables are printed: a list-variable could contain a lot of data, so dplyr only shows a brief summary of the contents. List-variables are useful for:
- Working with summary functions that return more than one value:
qs <- mtcars %>% group_by(cyl) %>% summarise(y = list(quantile(mpg))) # Unnest input to collpase into rows qs %>% tidyr::unnest(y) #> Source: local data frame [15 x 2] #> #> cyl y #> 1 4 21.4 #> 2 4 22.8 #> 3 4 26.0 #> 4 4 30.4 #> 5 4 33.9 #> .. ... ... # To extract individual elements into columns, wrap the result in rowwise() # then use summarise() qs %>% rowwise() %>% summarise(q25 = y, q75 = y) #> Source: local data frame [3 x 2] #> #> q25 q75 #> 1 22.80 30.40 #> 2 18.65 21.00 #> 3 14.40 16.25
- Keeping associated data frames and models together:
by_cyl <- split(mtcars, mtcars$cyl) models <- lapply(by_cyl, lm, formula = mpg ~ wt) data_frame(cyl = c(4, 6, 8), data = by_cyl, model = models) #> Source: local data frame [3 x 3] #> #> cyl data model #> 1 4 <S3:data.frame> <S3:lm> #> 2 6 <S3:data.frame> <S3:lm> #> 3 8 <S3:data.frame> <S3:lm>
dplyr’s support for list-variables continues to mature. In 0.4.0, you can join and row bind list-variables and you can create them in summarise and mutate.
My vision of list-variables is still partial and incomplete, but I’m convinced that they will make pipeable APIs for modelling much eaiser. See the draft lowliner package for more explorations in this direction.
My colleague, Garrett, helped me make a cheat sheet that summarizes the data wrangling features of dplyr 0.4.0. You can download it from RStudio’s new gallery of R cheat sheets.
httr 0.6.0 is now available on CRAN. The httr packages makes it easy to talk to web APIs from R. Learn more in the quick start vignette.
This release is mostly bug fixes and minor improvements. The most important are:
handle_reset(), which allows you to reset the default handle if you get the error “easy handle already used in multi handle”.
write_stream()which lets you process the response from a server as a stream of raw vectors (#143).
VERB()allows to you send a request with a custom http verb.
brew_dr()checks for common problems. It currently checks if your
libcurluses NSS. This is unlikely to work so it gives you some advice on how to fix the problem (thanks to Dirk Eddelbuettel for debugging this problem and suggesting a remedy).
- Added support for Google OAuth2 service accounts. (#119, thanks to help from @siddharthab). See
I’ve also switched from RC to R6 (which should make it easier to extend OAuth classes for non-standard OAuth implementations), and tweaked the use of the backend SSL certificate details bundled with httr. See the release notes for complete details.
ggvis 0.4 is now available on CRAN. You can install it with:
The major features of this release are:
- Boxplots, with
chickwts %>% ggvis(~feed, ~weight) %>% layer_boxplots()
- Better stability when errors occur.
- Better handling of empty data and malformed data.
- More consistent handling of data in compute pipeline functions.
Because of these changes, interactive graphics with dynamic data sources will work more reliably.
Additionally, there are many small improvements and bug fixes under the hood. You can see the full change log here.
RStudio is planning a new Master R Developer Workshop to be taught by Hadley Wickham in the San Francisco Bay Area on January 19-20. This will be the same workshop that Hadley is teaching in September in New York City to a sold out audience.
If you did not get a chance to register for the NYC workshop but wished to, consider attending the January Bay Area workshop. We will open registration once we have planned out all of the event details. If you would like to be notified when registration opens, leave a contact address here.
tidyr is new package that makes it easy to “tidy” your data. Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modelling packages). The two most important properties of tidy data are:
- Each column is a variable.
- Each row is an observation.
Arranging your data in this way makes it easier to work with because you have a consistent way of referring to variables (as column names) and observations (as row indices). When use tidy data and tidy tools, you spend less time worrying about how to feed the output from one function into the input of another, and more time answering your questions about the data.
To tidy messy data, you first identify the variables in your dataset, then use the tools provided by tidyr to move them into columns. tidyr provides three main functions for tidying your messy data:
gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. Other names for gather include melt (reshape2), pivot (spreadsheets) and fold (databases). Here’s an example how you might use
gather() on a made-up dataset. In this experiment we’ve given three people two different drugs and recorded their heart rate:
library(tidyr) library(dplyr) messy <- data.frame( name = c("Wilbur", "Petunia", "Gregory"), a = c(67, 80, 64), b = c(56, 90, 50) ) messy #> name a b #> 1 Wilbur 67 56 #> 2 Petunia 80 90 #> 3 Gregory 64 50
We have three variables (name, drug and heartrate), but only name is currently in a column. We use
gather() to gather the a and b columns into key-value pairs of drug and heartrate:
messy %>% gather(drug, heartrate, a:b) #> name drug heartrate #> 1 Wilbur a 67 #> 2 Petunia a 80 #> 3 Gregory a 64 #> 4 Wilbur b 56 #> 5 Petunia b 90 #> 6 Gregory b 50
Sometimes two variables are clumped together in one column.
separate() allows you to tease them apart (
extract() works similarly but uses regexp groups instead of a splitting pattern or position). Take this example from stackoverflow (modified slightly for brevity). We have some measurements of how much time people spend on their phones, measured at two locations (work and home), at two times. Each person has been randomly assigned to either treatment or control.
set.seed(10) messy <- data.frame( id = 1:4, trt = sample(rep(c('control', 'treatment'), each = 2)), work.T1 = runif(4), home.T1 = runif(4), work.T2 = runif(4), home.T2 = runif(4) )
To tidy this data, we first use
gather() to turn columns
home.T2 into a key-value pair of key and time. (Only the first eight rows are shown to save space.)
tidier <- messy %>% gather(key, time, -id, -trt) tidier %>% head(8) #> id trt key time #> 1 1 treatment work.T1 0.08514 #> 2 2 control work.T1 0.22544 #> 3 3 treatment work.T1 0.27453 #> 4 4 control work.T1 0.27231 #> 5 1 treatment home.T1 0.61583 #> 6 2 control home.T1 0.42967 #> 7 3 treatment home.T1 0.65166 #> 8 4 control home.T1 0.56774
Next we use
separate() to split the key into location and time, using a regular expression to describe the character that separates them.
tidy <- tidier %>% separate(key, into = c("location", "time"), sep = "\\.") tidy %>% head(8) #> id trt location time time #> 1 1 treatment work T1 0.08514 #> 2 2 control work T1 0.22544 #> 3 3 treatment work T1 0.27453 #> 4 4 control work T1 0.27231 #> 5 1 treatment home T1 0.61583 #> 6 2 control home T1 0.42967 #> 7 3 treatment home T1 0.65166 #> 8 4 control home T1 0.56774
The last tool,
spread(), takes two columns (a key-value pair) and spreads them in to multiple columns, making “long” data wider. Spread is known by other names in other places: it’s cast in reshape2, unpivot in spreadsheets and unfold in databases.
spread() is used when you have variables that form rows instead of columns. You need
spread() less frequently than
separate() so to learn more, check out the documentation and the demos.
Just as reshape2 did less than reshape, tidyr does less than reshape2. It’s designed specifically for tidying data, not general reshaping. In particular, existing methods only work for data frames, and tidyr never aggregates. This makes each function in tidyr simpler: each function does one thing well. For more complicated operations you can string together multiple simple tidyr and dplyr functions with
You can learn more about the underlying principles in my tidy data paper. To see more examples of data tidying, read the vignette,
vignette("tidy-data"), or check out the demos,
demo(package = "tidyr"). Alternatively, check out some of the great stackoverflow answers that use tidyr. Keep up-to-date with development at http://github.com/hadley/tidyr, report bugs at http://github.com/hadley/tidyr/issues and get help with data manipulation challenges at https://groups.google.com/group/manipulatr. If you ask a question specifically about tidyr on stackoverflow, please tag it with tidyr and I’ll make sure to read it.