You are currently browsing hadleywickham’s articles.

I’m pleased to announce tidyr 0.6.0. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has a simple convention: put variables in the columns and observations in the rows. You can learn more about it in the tidy data vignette. Install it with:

install.packages("tidyr")

I mostly released this version to bundle up a number of small tweaks needed for R for Data Science. But there’s one nice new feature, contributed by Jan Schulzdrop_na()drop_na()drops rows containing missing values:

df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df
#> # A tibble: 3 × 2
#>       x     y
#>   <dbl> <chr>
#> 1     1     a
#> 2     2  <NA>
#> 3    NA     b

# Called without arguments, it drops rows containing
# missing values in any variable:
df %>% drop_na()
#> # A tibble: 1 × 2
#>       x     y
#>   <dbl> <chr>
#> 1     1     a

# Or you can restrict the variables it looks at, 
# using select() style syntax:
df %>% drop_na(x)
#> # A tibble: 2 × 2
#>       x     y
#>   <dbl> <chr>
#> 1     1     a
#> 2     2  <NA>

Please see the release notes for a complete list of changes.

readr 1.0.0 is now available on CRAN. readr makes it easy to read many types of rectangular data, including csv, tsv and fixed width files. Compared to base equivalents like read.csv(), readr is much faster and gives more convenient output: it never converts strings to factors, can parse date/times, and it doesn’t munge the column names. Install the latest version with:

install.packages("readr")

Releasing a version 1.0.0 was a deliberate choice to reflect the maturity and stability and readr, thanks largely to work by Jim Hester. readr is by no means perfect, but I don’t expect any major changes to the API in the future.

In this version we:

  • Use a better strategy for guessing column types.
  • Improved the default date and time parsers.
  • Provided a full set of lower-level file and line readers and writers.
  • Fixed many bugs.

Column guessing

The process by which readr guesses the types of columns has received a substantial overhaul to make it easier to fix problems when the initial guesses aren’t correct, and to make it easier to generate reproducible code. Now column specifications are printing by default when you read from a file:

mtcars2 <- read_csv(readr_example("mtcars.csv"))
#> Parsed with column specification:
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer()
#> )

The thought is that once you’ve figured out the correct column types for a file, you should make the parsing strict. You can do this either by copying and pasting the printed column specification or by saving the spec to disk:

# Once you've figured out the correct types
mtcars_spec <- write_rds(spec(mtcars2), "mtcars2-spec.rds")

# Every subsequent load
mtcars2 <- read_csv(
  readr_example("mtcars.csv"), 
  col_types = read_rds("mtcars2-spec.rds")
)
# In production, you might want to throw an error if there
# are any parsing problems.
stop_for_problems(mtcars2)

You can now also adjust the number of rows that readr uses to guess the column types with guess_max:

challenge <- read_csv(readr_example("challenge.csv"))
#> Parsed with column specification:
#> cols(
#>   x = col_integer(),
#>   y = col_character()
#> )
#> Warning: 1000 parsing failures.
#>  row col               expected             actual
#> 1001   x no trailing characters .23837975086644292
#> 1002   x no trailing characters .41167997173033655
#> 1003   x no trailing characters .7460716762579978 
#> 1004   x no trailing characters .723450553836301  
#> 1005   x no trailing characters .614524137461558  
#> .... ... ...................... ..................
#> See problems(...) for more details.
challenge <- read_csv(readr_example("challenge.csv"), guess_max = 1500)
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_date(format = "")
#> )

(If you want to suppress the printed specification, just provide the dummy spec col_types = cols())

You can now access the guessing algorithm from R: guess_parser() will tell you which parser readr will select.

guess_parser("1,234")
#> [1] "number"

# Were previously guessed as numbers
guess_parser(c(".", "-"))
#> [1] "character"
guess_parser(c("10W", "20N"))
#> [1] "character"

# Now uses the default time format
guess_parser("10:30")
#> [1] "time"

Date-time parsing improvements:

The date time parsers recognise three new format strings:

  • %I for 12 hour time format:
    library(hms)
    parse_time("1 pm", "%I %p")
    #> 13:00:00

    Note that parse_time() returns hms from the hms package, rather than a custom time class

  • %AD and %AT are “automatic” date and time parsers. They are both slightly less flexible than previous defaults. The automatic date parser requires a four digit year, and only accepts - and / as separators. The flexible time parser now requires colons between hours and minutes and optional seconds.
    parse_date("2010-01-01", "%AD")
    #> [1] "2010-01-01"
    parse_time("15:01", "%AT")
    #> 15:01:00

If the format argument is omitted in parse_date() or parse_time(), the default date and time formats specified in the locale will be used. These now default to %AD and %AT respectively. You may want to override in your standard locale() if the conventions are different where you live.

Low-level readers and writers

readr now contains a full set of efficient lower-level readers:

  • read_file() reads a file into a length-1 character vector; read_file_raw() reads a file into a single raw vector.
  • read_lines() reads a file into a character vector with one entry per line; read_lines_raw() reads into a list of raw vectors with one entry per line.

These are paired with write_lines() and write_file() to efficient write character and raw vectors back to disk.

Other changes

  • read_fwf() was overhauled to reliably read only a partial set of columns, to read files with ragged final columns (by setting the final position/width to NA), and to skip comments (with the comment argument).
  • readr contains an experimental API for reading a file in chunks, e.g. read_csv_chunked() and read_lines_chunked(). These allow you to work with files that are bigger than memory. We haven’t yet finalised the API so please use with care, and send us your feedback.
  • There are many otherbug fixes and other minor improvements. You can see a complete list in the release notes.

A big thanks goes to all the community members who contributed to this release: @antoine-lizee, @fpinter, @ghaarsma, @jennybc, @jeroenooms, @leeper, @LluisRamon, @noamross, and @tvedebrink.

We’re proud to announce version 1.1 of the tibble package. Tibbles are a modern reimagining of the data frame, keeping what time has shown to be effective, and throwing out what is not. Grab the latest version with:

install.packages("tibble")

There are three major new features:

  • A more consistent naming scheme
  • Changes to how columns are extracted
  • Tweaks to the output

There are many other small improvements and bug fixes: please see the release notes for a complete list.

A better naming scheme

It’s caused some confusion that you use data_frame() and as_data_frame() to create and coerce tibbles. It’s also more important to make the distinction between tibbles and data frames more clear as we evolve a little further away from the semantics of data frames.

Now, we’re consistently using “tibble” as the key word in creation, coercion, and testing functions:

tibble(x = 1:5, y = letters[1:5])
#> # A tibble: 5 x 2
#>       x     y
#>   <int> <chr>
#> 1     1     a
#> 2     2     b
#> 3     3     c
#> 4     4     d
#> 5     5     e
as_tibble(data.frame(x = runif(5)))
#> # A tibble: 5 x 1
#>           x
#>       <dbl>
#> 1 0.4603887
#> 2 0.4824339
#> 3 0.4546795
#> 4 0.5042028
#> 5 0.4558387
is_tibble(data.frame())
#> [1] FALSE

Previously tibble() was an alias for frame_data(). If you were using tibble() to create tibbles by rows, you’ll need to switch to frame_data(). This is a breaking change, but we believe that the new naming scheme will be less confusing in the long run.

Extracting columns

The previous version of tibble was a little too strict when you attempted to retrieve a column that did not exist: we had forgotten that many people check for the presence of column with is.null(df$x). This is bad idea because of partial matching, but it is common:

df1 <- data.frame(xyz = 1)
df1$x
#> [1] 1

Now, instead of throwing an error, tibble will return NULL. If you use $, common in interactive scripts, tibble will generate a warning:

df2 <- tibble(xyz = 1)
df2$x
#> Warning: Unknown column 'x'
#> NULL
df2[["x"]]
#> NULL

We also provide a convenient helper for detecting the presence/absence of a column:

has_name(df1, "x")
#> [1] FALSE
has_name(df2, "x")
#> [1] FALSE

Output tweaks

We’ve tweaked the output to have a shorter header, more information in the footer. We’re using # consistently to denote metadata, and we print missing character values as <NA> (instead of NA).

The example below shows the new rendering of the flights table.

nycflights13::flights
#> # A tibble: 336,776 x 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1   2013     1     1      517            515         2      830
#> 2   2013     1     1      533            529         4      850
#> 3   2013     1     1      542            540         2      923
#> 4   2013     1     1      544            545        -1     1004
#> 5   2013     1     1      554            600        -6      812
#> 6   2013     1     1      554            558        -4      740
#> 7   2013     1     1      555            600        -5      913
#> 8   2013     1     1      557            600        -3      709
#> 9   2013     1     1      557            600        -3      838
#> 10  2013     1     1      558            600        -2      753
#> # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <time>

Thanks to Lionel Henry for contributing an option for determining the number of printed extra columns: getOption("tibble.max_extra_cols"). This is particularly important for the ultra-wide tables often released by statistical offices and other institutions.

Expect the printed output to continue to evolve. In the next version, we hope to do better with very wide columns (e.g. from long strings), and to make better use of now unused horizontal space (e.g. from long column names).

httr 1.2.0 is now available on CRAN. The httr package makes it easy to talk to web APIs from R. Learn more in the quick start vignette. Install the latest version with:

install.packages("httr")

There are a few small new features:

  • New RETRY() function allows you to retry a request multiple times until it succeeds, if you you are trying to talk to an unreliable service. To avoid hammering the server, it uses exponential backoff with jitter, as described in https://www.awsarchitectureblog.com/2015/03/backoff.html.
  • DELETE() gains a body parameter.
  • encode = "raw" parameter to functions that accept bodies. This allows you to do your own encoding.
  • http_type() returns the content/mime type of a request, sans parameters.

There is one important bug fix:

  • No longer uses use custom requests for standard POST requests. This has the side-effect of properly following redirects after POST, fixing some login issues in rvest.

httr 1.2.1 includes a fix for a small bug that I discovered shortly after releasing 1.2.0.

For the complete list of improvements, please see the release notes.

We are pleased to announced that xml2 1.0.0 is now available on CRAN. Xml2 is a wrapper around the comprehensive libxml2 C library, and makes it easy to work with XML and HTML files in R. Install the latest version with:

install.packages("xml2")

There are three major improvements in 1.0.0:

  1. You can now modify and create XML documents.
  2. xml_find_first() replaces xml_find_one(), and provides better semantics for missing nodes.
  3. Improved namespace handling when working with XPath.

There are many other small improvements and bug fixes: please see the release notes for a complete list.

Modification and creation

xml2 now supports modification and creation of XML nodes. This includes new functions xml_new_document(), xml_new_child(), xml_new_sibling(), xml_set_namespace(), xml_remove(), xml_replace(), xml_root(), and replacement methods for xml_name(), xml_attr(), xml_attrs() and xml_text().

The basic process of creating an XML document by hand looks something like this:

root <- xml_new_document() %>% xml_add_child("root")

root %>% 
  xml_add_child("a1", x = "1", y = "2") %>% 
  xml_add_child("b") %>% 
  xml_add_child("c") %>% 
  invisible()

root %>% 
  xml_add_child("a2") %>% 
  xml_add_sibling("a3") %>% 
  invisible()

cat(as.character(root))
#> <?xml version="1.0"?>
#> <root><a1 x="1" y="2"><b><c/></b></a1><a2/><a3/></root>

For a complete description of creation and mutation, please see vignette("modification", package = "xml2").

xml_find_first()

xml_find_one() has been deprecated in favor of xml_find_first(). xml_find_first() now always returns a single node: if there are multiple matches, it returns the first (without a warning), and if there are no matches, it returns a new xml_missing object.

This makes it much easier to work with ragged/inconsistent hierarchies:

x1 <- read_xml("<a>
  <b></b>
  <b><c>See</c></b>
  <b><c>Sea</c><c /></b>
</a>")

c <- x1 %>% 
  xml_find_all(".//b") %>% 
  xml_find_first(".//c")
c
#> {xml_nodeset (3)}
#> [1] <NA>
#> [2] <c>See</c>
#> [3] <c>Sea</c>

Missing nodes are replaced by missing values in functions that return vectors:

xml_name(c)
#> [1] NA  "c" "c"
xml_text(c)
#> [1] NA    "See" "Sea"

XPath and namespaces

XPath is challenging to use if your document contains any namespaces:

x <- read_xml('
 <root>
   <doc1 xmlns = "http://foo.com"><baz /></doc1>
   <doc2 xmlns = "http://bar.com"><baz /></doc2>
 </root>
')
x %>% xml_find_all(".//baz")
#> {xml_nodeset (0)}

To make life slightly easier, the default xml_ns() object is automatically passed to xml_find_*():

x %>% xml_ns()
#> d1 <-> http://foo.com
#> d2 <-> http://bar.com
x %>% xml_find_all(".//d1:baz")
#> {xml_nodeset (1)}
#> [1] <baz/>

If you just want to avoid the hassle of namespaces altogether, we have a new nuclear option: xml_ns_strip():

xml_ns_strip(x)
x %>% xml_find_all(".//baz")
#> {xml_nodeset (2)}
#> [1] <baz/>
#> [2] <baz/>

I’m very pleased to announce that dplyr 0.5.0 is now available from CRAN. Get the latest version with:

install.packages("dplyr")

dplyr 0.5.0 is a big release with a heap of new features, a whole bunch of minor improvements, and many bug fixes, both from me and from the broader dplyr community. In this blog post, I’ll highlight the most important changes:

  • Some breaking changes to single table verbs.
  • New tibble and dtplyr packages.
  • New vector functions.
  • Replacements for summarise_each() and mutate_each().
  • Improvements to SQL translation.

To see the complete list, please read the release notes.

Breaking changes

arrange() once again ignores grouping, reverting back to the behaviour of dplyr 0.3 and earlier. This makes arrange() inconsistent with other dplyr verbs, but I think this behaviour is generally more useful. Regardless, it’s not going to change again, as more changes will just cause more confusion.

mtcars %>% 
  group_by(cyl) %>% 
  arrange(desc(mpg))
#> Source: local data frame [32 x 11]
#> Groups: cyl [3]
#> 
#> # A tibble: 32 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
#> 2  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
#> 3  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
#> 4  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
#> 5  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
#> ... with 27 more rows

If you give distinct() a list of variables, it now only keeps those variables (instead of, as previously, keeping the first value from the other variables). To preserve the previous behaviour, use .keep_all = TRUE:

df <- data_frame(x = c(1, 1, 1, 2, 2), y = 1:5)

# Now only keeps x variable
df %>% distinct(x)
#> # A tibble: 2 x 1
#>       x
#>   <dbl>
#> 1     1
#> 2     2

# Previous behaviour preserved all variables
df %>% distinct(x, .keep_all = TRUE)
#> # A tibble: 2 x 2
#>       x     y
#>   <dbl> <int>
#> 1     1     1
#> 2     2     4

The select() helper functions starts_with(), ends_with(), etc are now real exported functions. This means that they have better documentation, and there’s an extension mechnaism if you want to write your own helpers.

Tibble and dtplyr packages

Functions related to the creation and coercion of tbl_dfs (“tibble”s for short), now live in their own package: tibble. See vignette("tibble") for more details.

Similarly, all code related to the data table dplyr backend code has been separated out in to a new dtplyr package. This decouples the development of the data.table interface from the development of the dplyr package, and I hope will spur improvements to the backend. If both data.table and dplyr are loaded, you’ll get a message reminding you to load dtplyr.

Vector functions

This version of dplyr gains a number of vector functions inspired by SQL. Two functions make it a little easier to eliminate or generate missing values:

  • Given a set of vectors, coalesce() finds the first non-missing value in each position:
    x <- c(1,  2, NA, 4, NA, 6)
    y <- c(NA, 2,  3, 4,  5, NA)
    
    # Use this to piece together a complete vector:
    coalesce(x, y)
    #> [1] 1 2 3 4 5 6
    
    # Or just replace missing value with a constant:
    coalesce(x, 0)
    #> [1] 1 2 0 4 0 6
  • The complement of coalesce() is na_if(): it replaces a specified value with an NA.
    x <- c(1, 5, 2, -99, -99, 10)
    na_if(x, -99)
    #> [1]  1  5  2 NA NA 10

Three functions provide convenient ways of replacing values. In order from simplest to most complicated, they are:

  • if_else(), a vectorised if statement, takes a logical vector (usually created with a comparison operator like ==, <, or %in%) and replaces TRUEs with one vector and FALSEs with another.
    x1 <- sample(5)
    if_else(x1 < 5, "small", "big")
    #> [1] "small" "small" "big"   "small" "small"

    if_else() is similar to base::ifelse(), but has two useful improvements.
    First, it has a fourth argument that will replace missing values:

    x2 <- c(NA, x1)
    if_else(x2 < 5, "small", "big", "unknown")
    #> [1] "unknown" "small"   "small"   "big"     "small"   "small"

    Secondly, it also have stricter semantics that ifelse(): the true and false arguments must be the same type. This gives a less surprising return type, and preserves S3 vectors like dates and factors:

    x <- factor(sample(letters[1:5], 10, replace = TRUE))
    ifelse(x %in% c("a", "b", "c"), x, factor(NA))
    #>  [1] NA NA  1 NA  3  2  3 NA  3  2
    if_else(x %in% c("a", "b", "c"), x, factor(NA))
    #>  [1] <NA> <NA> a    <NA> c    b    c    <NA> c    b   
    #> Levels: a b c d e

    Currently, if_else() is very strict, so you’ll need to careful match the types of true and false. This is most likely to bite you when you’re using missing values, and you’ll need to use a specific NA: NA_integer_, NA_real_, or NA_character_:

    if_else(TRUE, 1, NA)
    #> Error: `false` has type 'logical' not 'double'
    if_else(TRUE, 1, NA_real_)
    #> [1] 1
  • recode(), a vectorised switch(), takes a numeric vector, character vector, or factor, and replaces elements based on their values.
    x <- sample(c("a", "b", "c", NA), 10, replace = TRUE)
    
    # The default is to leave non-replaced values as is
    recode(x, a = "Apple")
    #>  [1] "c"     "Apple" NA      NA      "c"     NA      "b"     NA     
    #>  [9] "c"     "Apple"
    # But you can choose to override the default:
    recode(x, a = "Apple", .default = NA_character_)
    #>  [1] NA      "Apple" NA      NA      NA      NA      NA      NA     
    #>  [9] NA      "Apple"
    # You can also choose what value is used for missing values
    recode(x, a = "Apple", .default = NA_character_, .missing = "Unknown")
    #>  [1] NA        "Apple"   "Unknown" "Unknown" NA        "Unknown" NA       
    #>  [8] "Unknown" NA        "Apple"
  • case_when(), is a vectorised set of if and else ifs. You provide it a set of test-result pairs as formulas: The left side of the formula should return a logical vector, and the right hand side should return either a single value, or a vector the same length as the left hand side. All results must be the same type of vector.
    x <- 1:40
    case_when(
      x %% 35 == 0 ~ "fizz buzz",
      x %% 5 == 0 ~ "fizz",
      x %% 7 == 0 ~ "buzz",
      TRUE ~ as.character(x)
    )
    #>  [1] "1"         "2"         "3"         "4"         "fizz"     
    #>  [6] "6"         "buzz"      "8"         "9"         "fizz"     
    #> [11] "11"        "12"        "13"        "buzz"      "fizz"     
    #> [16] "16"        "17"        "18"        "19"        "fizz"     
    #> [21] "buzz"      "22"        "23"        "24"        "fizz"     
    #> [26] "26"        "27"        "buzz"      "29"        "fizz"     
    #> [31] "31"        "32"        "33"        "34"        "fizz buzz"
    #> [36] "36"        "37"        "38"        "39"        "fizz"

    case_when() is still somewhat experiment and does not currently work inside mutate(). That will be fixed in a future version.

I also added one small helper for dealing with floating point comparisons: near() tests for equality with numeric tolerance (abs(x - y) < tolerance).

x <- sqrt(2) ^ 2

x == 2
#> [1] FALSE
near(x, 2)
#> [1] TRUE

Predicate functions

Thanks to ideas and code from Lionel Henry, a new family of functions improve upon summarise_each() and mutate_each():

  • summarise_all() and mutate_all() apply a function to all (non-grouped) columns:
    mtcars %>% group_by(cyl) %>% summarise_all(mean)    
    #> # A tibble: 3 x 11
    #>     cyl      mpg     disp        hp     drat       wt     qsec        vs
    #>   <dbl>    <dbl>    <dbl>     <dbl>    <dbl>    <dbl>    <dbl>     <dbl>
    #> 1     4 26.66364 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909
    #> 2     6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286
    #> 3     8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000
    #> ... with 3 more variables: am <dbl>, gear <dbl>, carb <dbl>
  • summarise_at() and mutate_at() operate on a subset of columns. You can select columns with:
    • a character vector of column names,
    • a numeric vector of column positions, or
    • a column specification with select() semantics generated with the new vars() helper.
    mtcars %>% group_by(cyl) %>% summarise_at(c("mpg", "wt"), mean)
    #> # A tibble: 3 x 3
    #>     cyl      mpg       wt
    #>   <dbl>    <dbl>    <dbl>
    #> 1     4 26.66364 2.285727
    #> 2     6 19.74286 3.117143
    #> 3     8 15.10000 3.999214
    mtcars %>% group_by(cyl) %>% summarise_at(vars(mpg, wt), mean)
    #> # A tibble: 3 x 3
    #>     cyl      mpg       wt
    #>   <dbl>    <dbl>    <dbl>
    #> 1     4 26.66364 2.285727
    #> 2     6 19.74286 3.117143
    #> 3     8 15.10000 3.999214
  • summarise_if() and mutate_if() take a predicate function (a function that returns TRUE or FALSE when given a column). This makes it easy to apply a function only to numeric columns:
    iris %>% summarise_if(is.numeric, mean)
    #>   Sepal.Length Sepal.Width Petal.Length Petal.Width
    #> 1     5.843333    3.057333        3.758    1.199333

All of these functions pass ... on to the individual funs:

iris %>% summarise_if(is.numeric, mean, trim = 0.25)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     5.802632    3.032895     3.934211    1.230263

A new select_if() allows you to pick columns with a predicate function:

df <- data_frame(x = 1:3, y = c("a", "b", "c"))
df %>% select_if(is.numeric)
#> # A tibble: 3 x 1
#>       x
#>   <int>
#> 1     1
#> 2     2
#> 3     3
df %>% select_if(is.character)
#> # A tibble: 3 x 1
#>       y
#>   <chr>
#> 1     a
#> 2     b
#> 3     c

summarise_each() and mutate_each() will be deprecated in a future release.

SQL translation

I have completely overhauled the translation of dplyr verbs into SQL statements. Previously, dplyr used a rather ad-hoc approach which tried to guess when a new subquery was needed. Unfortunately this approach was fraught with bugs, so I have now implemented a richer internal data model. In the short-term, this is likely to lead to some minor performance decreases (as the generated SQL is more complex), but the dplyr is much more likely to generate correct SQL. In the long-term, these abstractions will make it possible to write a query optimiser/compiler in dplyr, which would make it possible to generate much more succinct queries. If you know anything about writing query optimisers or compilers and are interested in working on this problem, please let me know!

I’m pleased to announce tidyr 0.5.0. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has a simple convention: put variables in the columns and observations in the rows. You can learn more about it in the tidy data vignette. Install it with:

install.packages("tidyr")

This release has three useful new features:

  1. separate_rows() separates values that contain multiple values separated by a delimited into multiple rows. Thanks to Aaron Wolen for the contribution!
    df <- data_frame(x = 1:2, y = c("a,b", "d,e,f"))
    df %>% 
      separate_rows(y, sep = ",")
    #> Source: local data frame [5 x 2]
    #> 
    #>       x     y
    #>   <int> <chr>
    #> 1     1     a
    #> 2     1     b
    #> 3     2     d
    #> 4     2     e
    #> 5     2     f

    Compare with separate() which separates into (named) columns:

    df %>% 
      separate(y, c("y1", "y2", "y3"), sep = ",", fill = "right")
    #> Source: local data frame [2 x 4]
    #> 
    #>       x    y1    y2    y3
    #> * <int> <chr> <chr> <chr>
    #> 1     1     a     b  <NA>
    #> 2     2     d     e     f
  2. spread() gains a sep argument. Setting this will name columns as “key|sep|value”. This is useful when you’re spreading based on a numeric column:
    df <- data_frame(
      x = c(1, 2, 1), 
      key = c(1, 1, 2), 
      val = c("a", "b", "c")
    )
    df %>% spread(key, val)
    #> Source: local data frame [2 x 3]
    #> 
    #>       x     1     2
    #> * <dbl> <chr> <chr>
    #> 1     1     a     c
    #> 2     2     b  <NA>
    df %>% spread(key, val, sep = "_")
    #> Source: local data frame [2 x 3]
    #> 
    #>       x key_1 key_2
    #> * <dbl> <chr> <chr>
    #> 1     1     a     c
    #> 2     2     b  <NA>
  3. unnest() gains a .sep argument. This is useful if you have multiple columns of data frames that have the same variable names:
    df <- data_frame(
      x = 1:2,
      y1 = list(
        data_frame(y = 1),
        data_frame(y = 2)
      ),
      y2 = list(
        data_frame(y = "a"),
        data_frame(y = "b")
      )
    )
    df %>% unnest()
    #> Source: local data frame [2 x 3]
    #> 
    #>       x     y     y
    #>   <int> <dbl> <chr>
    #> 1     1     1     a
    #> 2     2     2     b
    df %>% unnest(.sep = "_")
    #> Source: local data frame [2 x 3]
    #> 
    #>       x  y1_y  y2_y
    #>   <int> <dbl> <chr>
    #> 1     1     1     a
    #> 2     2     2     b

    It also gains a .id column that makes the names of the list explicit:

    df <- data_frame(
      x = 1:2,
      y = list(
        a = 1:3,
        b = 3:1
      )
    )
    df %>% unnest()
    #> Source: local data frame [6 x 2]
    #> 
    #>       x     y
    #>   <int> <int>
    #> 1     1     1
    #> 2     1     2
    #> 3     1     3
    #> 4     2     3
    #> 5     2     2
    #> 6     2     1
    df %>% unnest(.id = "id")
    #> Source: local data frame [6 x 3]
    #> 
    #>       x     y    id
    #>   <int> <int> <chr>
    #> 1     1     1     a
    #> 2     1     2     a
    #> 3     1     3     a
    #> 4     2     3     b
    #> 5     2     2     b
    #> 6     2     1     b

tidyr 0.5.0 also includes a bumper crop of bug fixes, including fixes for spread() and gather() in the presence of list-columns. Please see the release notes for a complete list of changes.

testthat 1.0.0 is now available on CRAN. Testthat makes it easy to turn your existing informal tests into formal automated tests that you can rerun quickly and easily. Learn more at http://r-pkgs.had.co.nz/tests.html. Install the latest version with:

install.packages("testthat")

This version of testthat saw a major behind the scenes overhaul. This is the reason for the 1.0.0 release, and it will make it easier to add new expectations and reporters in the future. As well as the internal changes, there are improvements in four main areas:

  • New expectations.
  • Support for the pipe.
  • More consistent tests for side-effects.
  • Support for testing C++ code.

These are described in detail below. For a complete set of changes, please see the release notes.

Improved expectations

There are five new expectations:

  • expect_type() checks the base type of an object (with typeof()), expect_s3_class() tests that an object is S3 with given class, and expect_s4_class() tests that an object is S4 with given class. I recommend using these more specific expectations instead of the generic expect_is(), because they more clearly convey intent.
  • expect_length() checks that an object has expected length.
  • expect_output_file() compares output of a function with a text file, optionally update the file. This is useful for regression tests for print() methods.

A number of older expectations have been deprecated:

  • expect_more_than() and expect_less_than() have been deprecated. Please use expect_gt() and expect_lt() instead.
  • takes_less_than() has been deprecated.
  • not() has been deprecated. Please use the explicit individual forms expect_error(..., NA) , expect_warning(.., NA), etc.

We also did a thorough review of the documentation, ensuring that related expectations are documented together.

Piping

Most expectations now invisibly return the input object. This makes it possible to chain together expectations with magrittr:

factor("a") %>% 
  expect_type("integer") %>% 
  expect_s3_class("factor") %>% 
  expect_length(1)

To make this style even easier, testthat now imports and re-exports the pipe so you don’t need to explicitly attach magrittr.

Side-effects

Expectations that test for side-effects (i.e. expect_message(), expect_warning(), expect_error(), and expect_output()) are now more consistent:

  • expect_message(f(), NA) will fail if a message is produced (i.e. it’s not missing), and similarly for expect_output(), expect_warning(), and expect_error().
    quiet <- function() {}
    noisy <- function() message("Hi!")
    
    expect_message(quiet(), NA)
    expect_message(noisy(), NA)
    #> Error: noisy() showed 1 message. 
    #> * Hi!
  • expect_message(f(), NULL) will fail if a message isn’t produced, and similarly for expect_output(), expect_warning(), and expect_error().
    expect_message(quiet(), NULL)
    #> Error: quiet() showed 0 messages
    expect_message(noisy(), NULL)

There were three other changes made in the interest of consistency:

  • Previously testing for one side-effect (e.g. messages) tended to muffle other side effects (e.g. warnings). This is no longer the case.
  • Warnings that are not captured explicitly by expect_warning() are tracked and reported. These do not currently cause a test suite to fail, but may do in the future.
  • If you want to test a print method, expect_output() now requires you to explicitly print the object: expect_output("a", "a") will fail, expect_output(print("a"), "a") will succeed. This makes it more consistent with the other side-effect functions.

C++

Thanks to the work of Kevin Ushey, testthat now includes a simple interface to unit test C++ code using the Catch library. Using Catch in your packages is easy – just call testthat::use_catch() and the necessary infrastructure, alongside a few sample test files, will be generated for your package. By convention, you can place your unit tests in src/test-<name>.cpp. Here’s a simple example of a test file you might write when using testthat + Catch:

#include <testthat.h>
context("Addition") {
  test_that("two plus two equals four") {
    int result = 2 + 2;
    expect_true(result == 4);
  }
}

These unit tests will be compiled and run during calls to devtools::test(), as well as R CMD check. See ?use_catch for a full list of functions supported by testthat, and for more details.

For now, Catch unit tests will only be compiled when using the gcc and clang compilers – this implies that the unit tests you write will not be compiled + run on Solaris, which should make it easier to submit packages that use testthat for C++ unit tests to CRAN.

Wes McKinney, Software Engineer, Cloudera
Hadley Wickham, Chief Scientist, RStudio

This past January, we (Hadley and Wes) met and discussed some of the systems challenges facing the Python and R open source communities. In particular, we wanted to see if there were some opportunities to collaborate on tools for improving interoperability between Python, R, and external compute and storage systems.

One thing that struck us was that while R’s data frames and Python’s pandas data frames utilize very different internal memory representations, they share a very similar semantic model. In both R and Panda’s, data frames are lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (factors), or string. Every column can have missing values.

Around this time, the open source community had just started the new Apache Arrow project, designed to improve data interoperability for systems dealing with columnar tabular data.

In discussing Apache Arrow in the context of Python and R, we wanted to see if we could use the insights from feather to design a very fast file format for storing data frames that could be used by both languages. Thus, the Feather format was born.

What is Feather?

Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:

  • Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible
  • Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too.
  • High read and write performance. When possible, Feather operations should be bound by local disk performance.

Code examples

The Feather API is designed to make reading and writing data frames as easy as possible. In R, the code might look like:

library(feather)
path <- "my_data.feather"
write_feather(df, path)
df <- read_feather(path)

Analogously, in Python, we have:

import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
df = feather.read_dataframe(path)

How fast is Feather?

Feather is extremely fast. Since Feather does not currently use any compression internally, it works best when used with solid-state drives as come with most of today’s laptop computers. For this first release, we prioritized a simple implementation and are thus writing unmodified Arrow memory to disk.

To give you an idea, here is a Python benchmark writing an approximately 800MB pandas DataFrame to disk:

import feather
import pandas as pd
import numpy as np
arr = np.random.randn(10000000) # 10% nulls
arr[::10] = np.nan
df = pd.DataFrame({'column_{0}'.format(i): arr for i in range(10)})
feather.write_dataframe(df, 'test.feather')

On Wes’s laptop (latest-gen Intel processor with SSD), this takes:

In [9]: %time df = feather.read_dataframe('test.feather')
CPU times: user 316 ms, sys: 944 ms, total: 1.26 s
Wall time: 1.26 s

In [11]: 800 / 1.26
Out[11]: 634.9206349206349

This is effective performance of over 600 MB/s. Of course, the performance you see will depend on your hardware configuration.

And in R (on Hadley’s laptop, which is very similar):

library(feather)

x <- runif(1e7)
x[sample(1e7, 1e6)] <- NA # 10% NAs
df <- as.data.frame(replicate(10, x))
write_feather(df, 'test.feather')

system.time(read_feather('test.feather'))
#>   user  system elapsed 
#>  0.731   0.287   1.020 

How can I get Feather?

The Feather source code is hosted at http://github.com/wesm/feather.

Installing Feather for R

Feather is currently available from github, and you can install with:

devtools::install_github("wesm/feather/R")

Feather uses C++11, so if you’re on windows, you’ll need the new gcc 4.93 toolchain. (All going well this will be included in R 3.3.0, which is scheduled for release on April 14. We’ll aim for a CRAN release soon after that).

Installing Feather for Python

For Python, you can install Feather from PyPI like so:

$ pip install feather-format

We will look into providing more installation options, such as conda builds, in the future.

What should you not use Feather for?

Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable between versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.

Feather, Apache Arrow, and the community

One of the great parts of Feather is that the file format is language agnostic. Other languages, such as Julia or Scala (for Spark users), can read and write the format without knowledge of details of Python or R.

Feather is one of the first projects to bring the tangible benefits of the Arrow spec to users in the form of an efficient, language-agnostic representation of tabular data on disk. Since Arrow does not provide for a file format, we are using Google’s Flatbuffers library (github.com/google/flatbuffers) to serialize column types and related metadata in a language-independent way in the file.

The Python interface uses Cython to expose Feather’s C++11 core to users, while the R interface uses Rcpp for the same task.

I’m pleased to announce tibble, a new package for manipulating and printing data frames in R. Tibbles are a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. The name comes from dplyr: originally you created these objects with tbl_df(), which was most easily pronounced as “tibble diff”.

Install tibble with:

install.packages("tibble")

This package extracts out the tbl_df class associated functions from dplyr. Kirill Müller extracted the code from dplyr, enhanced the tests, and added a few minor improvements.

Creating tibbles

You can create a tibble from an existing object with as_data_frame():

as_data_frame(iris)
#> Source: local data frame [150 x 5]
#> 
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
#> 1           5.1         3.5          1.4         0.2  setosa
#> 2           4.9         3.0          1.4         0.2  setosa
#> 3           4.7         3.2          1.3         0.2  setosa
#> 4           4.6         3.1          1.5         0.2  setosa
#> 5           5.0         3.6          1.4         0.2  setosa
#> 6           5.4         3.9          1.7         0.4  setosa
#> 7           4.6         3.4          1.4         0.3  setosa
#> 8           5.0         3.4          1.5         0.2  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 10          4.9         3.1          1.5         0.1  setosa
#> ..          ...         ...          ...         ...     ...

This works for data frames, lists, matrices, and tables.

You can also create a new tibble from individual vectors with data_frame():

data_frame(x = 1:5, y = 1, z = x ^ 2 + y)
#> Source: local data frame [5 x 3]
#> 
#>       x     y     z
#>   (int) (dbl) (dbl)
#> 1     1     1     2
#> 2     2     1     5
#> 3     3     1    10
#> 4     4     1    17
#> 5     5     1    26

data_frame() does much less than data.frame(): it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row.names(). You can read more about these features in the vignette, vignette("tibble").

You can define a tibble row-by-row with frame_data():

frame_data(
  ~x, ~y,  ~z,
  "a", 2,  3.6,
  "b", 1,  8.5
)
#> Source: local data frame [2 x 3]
#> 
#>       x     y     z
#>   (chr) (dbl) (dbl)
#> 1     a     2   3.6
#> 2     b     1   8.5

Tibbles vs data frames

There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.

Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from str():

library(nycflights13)
flights
#> Source: local data frame [336,776 x 16]
#> 
#>     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
#>    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
#> 1   2013     1     1      517         2      830        11      UA  N14228
#> 2   2013     1     1      533         4      850        20      UA  N24211
#> 3   2013     1     1      542         2      923        33      AA  N619AA
#> 4   2013     1     1      544        -1     1004       -18      B6  N804JB
#> 5   2013     1     1      554        -6      812       -25      DL  N668DN
#> 6   2013     1     1      554        -4      740        12      UA  N39463
#> 7   2013     1     1      555        -5      913        19      B6  N516JB
#> 8   2013     1     1      557        -3      709       -14      EV  N829AS
#> 9   2013     1     1      557        -3      838        -8      B6  N593JB
#> 10  2013     1     1      558        -2      753         8      AA  N3ALAA
#> ..   ...   ...   ...      ...       ...      ...       ...     ...     ...
#> Variables not shown: flight (int), origin (chr), dest (chr), air_time
#>   (dbl), distance (dbl), hour (dbl), minute (dbl).

Tibbles are strict about subsetting. If you try to access a variable that does not exist, you’ll get an error:

flights$yea
#> Error: Unknown column 'yea'

Tibbles also clearly delineate [ and [[: [ always returns another tibble, [[ always returns a vector. No more drop = FALSE!

class(iris[ , 1])
#> [1] "numeric"
class(iris[ , 1, drop = FALSE])
#> [1] "data.frame"
class(as_data_frame(iris)[ , 1])
#> [1] "tbl_df"     "tbl"        "data.frame"

Interacting with legacy code

A handful of functions are don’t work with tibbles because they expect df[, 1] to return a vector, not a data frame. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data frame:

class(as.data.frame(tbl_df(iris)))
Follow

Get every new post delivered to your Inbox.

Join 19,826 other followers