You are currently browsing hadleywickham’s articles.

The tidyverse is a set of packages that work in harmony because they share common data representations and API design. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.

The best place to learn about all the packages in the tidyverse and how they fit together is R for Data Science. Expect to hear more about the tidyverse in the coming months as I work on improved package websites, making citation easier, and providing a common home for discussions about data analysis with the tidyverse.

Installation

You can install tidyverse with

install.packages("tidyverse")

This will install the core tidyverse packages that you are likely to use in almost every analysis:

  • ggplot2, for data visualisation.
  • dplyr, for data manipulation.
  • tidyr, for data tidying.
  • readr, for data import.
  • purrr, for functional programming.
  • tibble, for tibbles, a modern re-imagining of data frames.

It also installs a selection of other tidyverse packages that you’re likely to use frequently, but probably not in every analysis. This includes packages for data manipulation:

Data import:

  • DBI, for databases.
  • haven, for SPSS, SAS and Stata files.
  • httr, for web apis.
  • jsonlite for JSON.
  • readxl, for .xls and .xlsx files.
  • rvest, for web scraping.
  • xml2, for XML.

And modelling:

  • modelr, for simple modelling within a pipeline
  • broom, for turning models into tidy data

These packages will be installed along with tidyverse, but you’ll load them explicitly with library().

Usage

library(tidyverse) will load the core tidyverse packages: ggplot2, tibble, tidyr, readr, purrr, and dplyr. You also get a condensed summary of conflicts with other packages you have loaded:

library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ---------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats

You can see conflicts created later with tidyverse_conflicts():

library(MASS)
#> 
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#> 
#>     select
tidyverse_conflicts()
#> Conflicts with tidy packages --------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats
#> select(): dplyr, MASS

And you can check that all tidyverse packages are up-to-date with tidyverse_update():

tidyverse_update()
#> The following packages are out of date:
#>  * broom (0.4.0 -> 0.4.1)
#>  * DBI   (0.4.1 -> 0.5)
#>  * Rcpp  (0.12.6 -> 0.12.7)
#> Update now?
#> 
#> 1: Yes
#> 2: No
install.packages("lubridate")

This release includes a range of bug fixes and minor improvements. Some highlights from this release include:

  • period() and duration() constructors now accept character strings and allow a very flexible specification of timespans:
    period("3H 2M 1S")
    #> [1] "3H 2M 1S"
    
    duration("3 hours, 2 mins, 1 secs")
    #> [1] "10921s (~3.03 hours)"
    
    # Missing numerals default to 1. 
    # Repeated units are summed
    period("hour minute minute")
    #> [1] "1H 2M 0S"

    Period and duration parsing allows for arbitrary abbreviations of time units as long as the specification is unambiguous. For single letter specs, strptime() rules are followed, so m stands for months and M for minutes.

    These same rules allows you to compare strings and durations/periods:

    "2mins 1 sec" > period("2mins")
    #> [1] TRUE
  • Date time rounding (with round_date()floor_date() and ceiling_date()) now supports unit multipliers, like “3 days” or “2 months”:
    ceiling_date(ymd_hms("2016-09-12 17:10:00"), unit = "5 minutes")
    #> [1] "2016-09-12 17:10:00 UTC"
  • The behavior of ceiling_date for Date objects is now more intuitive. In short, dates are now interpreted as time intervals that are physically part of longer unit intervals:
    |day1| ... |day31|day1| ... |day28| ...
    |    January     |   February     | ...

    That means that rounding up 2000-01-01 by a month is done to the boundary between January and February which, i.e. 2000-02-01:

    ceiling_date(ymd("2000-01-01"), unit = "month")
    #> [1] "2000-02-01"

    This behavior is controlled by the change_on_boundary argument.

  • It is now possible to compare POSIXct and Date objects:
    ymd_hms("2000-01-01 00:00:01") > ymd("2000-01-01")
    #> [1] TRUE
  • C-level parsing now handles English months and AM/PM indicator regardless of your locale. This means that English date-times are now always handled by lubridate C-level parsing and you don’t need to explicitly switch the locale.
  • New parsing function yq() allows you to parse a year + quarter:
    yq("2016-02")
    #> [1] "2016-04-01"

    The new q format is available in all lubridate parsing functions.

See the release notes for the full list of changes. A big thanks goes to everyone who contributed: @arneschillert, @cderv, @ijlyttle, @jasonelaw, @jonboiser, and @krlmlr.

If you use packages from the tidyverse (like tibble and readr) you don’t need to worry about getting factors when you don’t want them. But factors are a useful data structure in their own right, particularly for modelling and visualisation, because they allow you to control the order of the levels. Working with factors in base R can be a little frustrating because of a handful of missing tools. The goal of forcats is to fill in those missing pieces so you can access the power of factors with a minimum of pain.

Install forcats with:

install.packages("forcats")

forcats provides two main types of tools to change either the values or the order of the levels. I’ll call out some of the most important functions below, using using the included gss_cat dataset which contains a selection of categorical variables from the General Social Survey.

library(dplyr)
library(ggplot2)
library(forcats)

gss_cat
#> # A tibble: 21,483 × 9
#>    year       marital   age   race        rincome            partyid
#>   <int>        <fctr> <int> <fctr>         <fctr>             <fctr>
#> 1  2000 Never married    26  White  $8000 to 9999       Ind,near rep
#> 2  2000      Divorced    48  White  $8000 to 9999 Not str republican
#> 3  2000       Widowed    67  White Not applicable        Independent
#> 4  2000 Never married    39  White Not applicable       Ind,near rep
#> 5  2000      Divorced    25  White Not applicable   Not str democrat
#> 6  2000       Married    25  White $20000 - 24999    Strong democrat
#> # ... with 2.148e+04 more rows, and 3 more variables: relig <fctr>,
#> #   denom <fctr>, tvhours <int>

Change level values

You can recode specified factor levels with fct_recode():

gss_cat %>% count(partyid)
#> # A tibble: 10 × 2
#>              partyid     n
#>               <fctr> <int>
#> 1          No answer   154
#> 2         Don't know     1
#> 3        Other party   393
#> 4  Strong republican  2314
#> 5 Not str republican  3032
#> 6       Ind,near rep  1791
#> # ... with 4 more rows

gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"
  )) %>%
  count(partyid)
#> # A tibble: 10 × 2
#>                 partyid     n
#>                  <fctr> <int>
#> 1             No answer   154
#> 2            Don't know     1
#> 3           Other party   393
#> 4    Republican, strong  2314
#> 5      Republican, weak  3032
#> 6 Independent, near rep  1791
#> # ... with 4 more rows

Note that unmentioned levels are left as is, and the order of the levels is preserved.

fct_lump() allows you to lump the rarest (or most common) levels in to a new “other” level. The default behaviour is to collapse the smallest levels in to other, ensuring that it’s still the smallest level. For the religion variable that tells us that Protestants out number all other religions, which is interesting, but we probably want more level.

gss_cat %>% 
  mutate(relig = fct_lump(relig)) %>% 
  count(relig)
#> # A tibble: 2 × 2
#>        relig     n
#>       <fctr> <int>
#> 1      Other 10637
#> 2 Protestant 10846

Alternatively you can supply a number of levels to keep, n, or minimum proportion for inclusion, prop. If you use negative values, fct_lump()will change direction, and combine the most common values while preserving the rarest.

gss_cat %>% 
  mutate(relig = fct_lump(relig, n = 5)) %>% 
  count(relig)
#> # A tibble: 6 × 2
#>        relig     n
#>       <fctr> <int>
#> 1      Other   913
#> 2  Christian   689
#> 3       None  3523
#> 4     Jewish   388
#> 5   Catholic  5124
#> 6 Protestant 10846

gss_cat %>% 
  mutate(relig = fct_lump(relig, prop = -0.10)) %>% 
  count(relig)
#> # A tibble: 12 × 2
#>                     relig     n
#>                    <fctr> <int>
#> 1               No answer    93
#> 2              Don't know    15
#> 3 Inter-nondenominational   109
#> 4         Native american    23
#> 5               Christian   689
#> 6      Orthodox-christian    95
#> # ... with 6 more rows

Change level order

There are four simple helpers for common operations:

  • fct_relevel() is similar to stats::relevel() but allows you to move any number of levels to the front.
  • fct_inorder() orders according to the first appearance of each level.
  • fct_infreq() orders from most common to rarest.
  • fct_rev() reverses the order of levels.

fct_reorder() and fct_reorder2() are useful for visualisations. fct_reorder() reorders the factor levels by another variable. This is useful when you map a categorical variable to position, as shown in the following example which shows the average number of hours spent watching television across religions.

relig <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(relig, aes(tvhours, relig)) + geom_point()
reorder-1ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()

reorder-2

fct_reorder2() extends the same idea to plots where a factor is mapped to another aesthetic, like colour. The defaults are designed to make legends easier to read for line plots, as shown in the following example looking at marital status by age.

by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  group_by(age, marital) %>%
  count() %>%
  mutate(prop = n / sum(n))

ggplot(by_age, aes(age, prop)) +
  geom_line(aes(colour = marital))
reorder2-1ggplot(by_age, aes(age, prop)) +
  geom_line(aes(colour = fct_reorder2(marital, age, prop))) +
  labs(colour = "marital")
 reorder2-2

Learning more

You can learn more about forcats in R for data science, and on the forcats website.

Please let me know if you have more factor problems that forcats doesn’t help with!

We’re proud to announce version 1.2.0 of the tibble package. Tibbles are a modern reimagining of the data frame, keeping what time has shown to be effective, and throwing out what is not. Grab the latest version with:

install.packages("tibble")

This is mostly a maintenance release, with the following major changes:

  • More options for adding individual rows and (new!) columns
  • Improved function names
  • Minor tweaks to the output

There are many other small improvements and bug fixes: please see the release notes for a complete list.

Thanks to Jenny Bryan for add_row() and add_column() improvements and ideas, to William Dunlap for pointing out a bug with tibble’s implementation of all.equal(), to Kevin Wright for pointing out a rare bug with glimpse(), and to all the other contributors. Use the issue tracker to submit bugs or suggest ideas, your contributions are always welcome.

Adding rows and columns

There are now more options for adding individual rows, and columns can be added in a similar way, illustrated with this small tibble:

df <- tibble(x = 1:3, y = 3:1)
df
#> # A tibble: 3 × 2
#>       x     y
#>   <int> <int>
#> 1     1     3
#> 2     2     2
#> 3     3     1

The add_row() function allows control over where the new rows are added. In the following example, the row (4, 0) is added before the second row:

df %>% 
  add_row(x = 4, y = 0, .before = 2)
#> # A tibble: 4 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     3
#> 2     4     0
#> 3     2     2
#> 4     3     1

Adding more than one row is now fully supported, although not recommended in general because it can be a bit hard to read.

df %>% 
  add_row(x = 4:5, y = 0:-1)
#> # A tibble: 5 × 2
#>       x     y
#>   <int> <int>
#> 1     1     3
#> 2     2     2
#> 3     3     1
#> 4     4     0
#> 5     5    -1

Columns can now be added in much the same way with the new add_column() function:

df %>% 
  add_column(z = -1:1, w = 0)
#> # A tibble: 3 × 4
#>       x     y     z     w
#>   <int> <int> <int> <dbl>
#> 1     1     3    -1     0
#> 2     2     2     0     0
#> 3     3     1     1     0

It also supports .before and .after arguments:

df %>% 
  add_column(z = -1:1, .after = 1)
#> # A tibble: 3 × 3
#>       x     z     y
#>   <int> <int> <int>
#> 1     1    -1     3
#> 2     2     0     2
#> 3     3     1     1

df %>%  
  add_column(w = 0:2, .before = "x")
#> # A tibble: 3 × 3
#>       w     x     y
#>   <int> <int> <int>
#> 1     0     1     3
#> 2     1     2     2
#> 3     2     3     1

The add_column() function will never alter your existing data: you can’t overwrite existing columns, and you can’t add new observations.

Function names

frame_data() is now tribble(), which stands for “transposed tibble”. The old name still works, but will be deprecated eventually.

tribble(
  ~x, ~y,
   1, "a",
   2, "z"
)
#> # A tibble: 2 × 2
#>       x     y
#>   <dbl> <chr>
#> 1     1     a
#> 2     2     z

Output tweaks

We’ve tweaked the output again to use the multiply character × instead of x when printing dimensions (this still renders nicely on Windows.) We surround non-semantic column with backticks, and dttm is now used instead of time to distinguish POSIXt and hms (or difftime) values.

The example below shows the new rendering:

tibble(`date and time` = Sys.time(), time = hms::hms(minutes = 3))
#> # A tibble: 1 × 2
#>       `date and time`     time
#>                <dttm>   <time>
#> 1 2016-08-29 16:48:57 00:03:00

Expect the printed output to continue to evolve in next release. Stay tuned for a new function that reconstructs tribble() calls from existing data frames.

I’m pleased to announce version 1.1.0 of stringr. stringr makes string manipulation easier by using consistent function and argument names, and eliminating options that you don’t need 95% of the time. To get started with stringr, check out the strings chapter in R for data science. Install it with:

install.packages("stringr")

This release is mostly bug fixes, but there are a couple of new features you might care out.

  • There are three new datasets, fruitwords and sentences, to help you practice your regular expression skills:
    str_subset(fruit, "(..)\\1")
    #> [1] "banana"      "coconut"     "cucumber"    "jujube"      "papaya"     
    #> [6] "salal berry"
    head(words)
    #> [1] "a"        "able"     "about"    "absolute" "accept"   "account"
    sentences[1]
    #> [1] "The birch canoe slid on the smooth planks."
  • More functions work with boundary()str_detect() and str_subset() can detect boundaries, and str_extract() and str_extract_all() pull out the components between boundaries. This is particularly useful if you want to extract logical constructs like words or sentences.
    x <- "This is harder than you might expect, e.g. punctuation!"
    x %>% str_extract_all(boundary("word")) %>% .[[1]]
    #> [1] "This"        "is"          "harder"      "than"        "you"        
    #> [6] "might"       "expect"      "e.g"         "punctuation"
    x %>% str_extract(boundary("sentence"))
    #> [1] "This is harder than you might expect, e.g. punctuation!"
  • str_view() and str_view_all() create HTML widgets that display regular expression matches. This is particularly useful for teaching.

For a complete list of changes, please see the release notes.

I’m pleased to announce tidyr 0.6.0. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has a simple convention: put variables in the columns and observations in the rows. You can learn more about it in the tidy data vignette. Install it with:

install.packages("tidyr")

I mostly released this version to bundle up a number of small tweaks needed for R for Data Science. But there’s one nice new feature, contributed by Jan Schulzdrop_na()drop_na()drops rows containing missing values:

df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df
#> # A tibble: 3 × 2
#>       x     y
#>   <dbl> <chr>
#> 1     1     a
#> 2     2  <NA>
#> 3    NA     b

# Called without arguments, it drops rows containing
# missing values in any variable:
df %>% drop_na()
#> # A tibble: 1 × 2
#>       x     y
#>   <dbl> <chr>
#> 1     1     a

# Or you can restrict the variables it looks at, 
# using select() style syntax:
df %>% drop_na(x)
#> # A tibble: 2 × 2
#>       x     y
#>   <dbl> <chr>
#> 1     1     a
#> 2     2  <NA>

Please see the release notes for a complete list of changes.

readr 1.0.0 is now available on CRAN. readr makes it easy to read many types of rectangular data, including csv, tsv and fixed width files. Compared to base equivalents like read.csv(), readr is much faster and gives more convenient output: it never converts strings to factors, can parse date/times, and it doesn’t munge the column names. Install the latest version with:

install.packages("readr")

Releasing a version 1.0.0 was a deliberate choice to reflect the maturity and stability and readr, thanks largely to work by Jim Hester. readr is by no means perfect, but I don’t expect any major changes to the API in the future.

In this version we:

  • Use a better strategy for guessing column types.
  • Improved the default date and time parsers.
  • Provided a full set of lower-level file and line readers and writers.
  • Fixed many bugs.

Column guessing

The process by which readr guesses the types of columns has received a substantial overhaul to make it easier to fix problems when the initial guesses aren’t correct, and to make it easier to generate reproducible code. Now column specifications are printing by default when you read from a file:

mtcars2 <- read_csv(readr_example("mtcars.csv"))
#> Parsed with column specification:
#> cols(
#>   mpg = col_double(),
#>   cyl = col_integer(),
#>   disp = col_double(),
#>   hp = col_integer(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_integer(),
#>   am = col_integer(),
#>   gear = col_integer(),
#>   carb = col_integer()
#> )

The thought is that once you’ve figured out the correct column types for a file, you should make the parsing strict. You can do this either by copying and pasting the printed column specification or by saving the spec to disk:

# Once you've figured out the correct types
mtcars_spec <- write_rds(spec(mtcars2), "mtcars2-spec.rds")

# Every subsequent load
mtcars2 <- read_csv(
  readr_example("mtcars.csv"), 
  col_types = read_rds("mtcars2-spec.rds")
)
# In production, you might want to throw an error if there
# are any parsing problems.
stop_for_problems(mtcars2)

You can now also adjust the number of rows that readr uses to guess the column types with guess_max:

challenge <- read_csv(readr_example("challenge.csv"))
#> Parsed with column specification:
#> cols(
#>   x = col_integer(),
#>   y = col_character()
#> )
#> Warning: 1000 parsing failures.
#>  row col               expected             actual
#> 1001   x no trailing characters .23837975086644292
#> 1002   x no trailing characters .41167997173033655
#> 1003   x no trailing characters .7460716762579978 
#> 1004   x no trailing characters .723450553836301  
#> 1005   x no trailing characters .614524137461558  
#> .... ... ...................... ..................
#> See problems(...) for more details.
challenge <- read_csv(readr_example("challenge.csv"), guess_max = 1500)
#> Parsed with column specification:
#> cols(
#>   x = col_double(),
#>   y = col_date(format = "")
#> )

(If you want to suppress the printed specification, just provide the dummy spec col_types = cols())

You can now access the guessing algorithm from R: guess_parser() will tell you which parser readr will select.

guess_parser("1,234")
#> [1] "number"

# Were previously guessed as numbers
guess_parser(c(".", "-"))
#> [1] "character"
guess_parser(c("10W", "20N"))
#> [1] "character"

# Now uses the default time format
guess_parser("10:30")
#> [1] "time"

Date-time parsing improvements:

The date time parsers recognise three new format strings:

  • %I for 12 hour time format:
    library(hms)
    parse_time("1 pm", "%I %p")
    #> 13:00:00

    Note that parse_time() returns hms from the hms package, rather than a custom time class

  • %AD and %AT are “automatic” date and time parsers. They are both slightly less flexible than previous defaults. The automatic date parser requires a four digit year, and only accepts - and / as separators. The flexible time parser now requires colons between hours and minutes and optional seconds.
    parse_date("2010-01-01", "%AD")
    #> [1] "2010-01-01"
    parse_time("15:01", "%AT")
    #> 15:01:00

If the format argument is omitted in parse_date() or parse_time(), the default date and time formats specified in the locale will be used. These now default to %AD and %AT respectively. You may want to override in your standard locale() if the conventions are different where you live.

Low-level readers and writers

readr now contains a full set of efficient lower-level readers:

  • read_file() reads a file into a length-1 character vector; read_file_raw() reads a file into a single raw vector.
  • read_lines() reads a file into a character vector with one entry per line; read_lines_raw() reads into a list of raw vectors with one entry per line.

These are paired with write_lines() and write_file() to efficient write character and raw vectors back to disk.

Other changes

  • read_fwf() was overhauled to reliably read only a partial set of columns, to read files with ragged final columns (by setting the final position/width to NA), and to skip comments (with the comment argument).
  • readr contains an experimental API for reading a file in chunks, e.g. read_csv_chunked() and read_lines_chunked(). These allow you to work with files that are bigger than memory. We haven’t yet finalised the API so please use with care, and send us your feedback.
  • There are many otherbug fixes and other minor improvements. You can see a complete list in the release notes.

A big thanks goes to all the community members who contributed to this release: @antoine-lizee, @fpinter, @ghaarsma, @jennybc, @jeroenooms, @leeper, @LluisRamon, @noamross, and @tvedebrink.

We’re proud to announce version 1.1 of the tibble package. Tibbles are a modern reimagining of the data frame, keeping what time has shown to be effective, and throwing out what is not. Grab the latest version with:

install.packages("tibble")

There are three major new features:

  • A more consistent naming scheme
  • Changes to how columns are extracted
  • Tweaks to the output

There are many other small improvements and bug fixes: please see the release notes for a complete list.

A better naming scheme

It’s caused some confusion that you use data_frame() and as_data_frame() to create and coerce tibbles. It’s also more important to make the distinction between tibbles and data frames more clear as we evolve a little further away from the semantics of data frames.

Now, we’re consistently using “tibble” as the key word in creation, coercion, and testing functions:

tibble(x = 1:5, y = letters[1:5])
#> # A tibble: 5 x 2
#>       x     y
#>   <int> <chr>
#> 1     1     a
#> 2     2     b
#> 3     3     c
#> 4     4     d
#> 5     5     e
as_tibble(data.frame(x = runif(5)))
#> # A tibble: 5 x 1
#>           x
#>       <dbl>
#> 1 0.4603887
#> 2 0.4824339
#> 3 0.4546795
#> 4 0.5042028
#> 5 0.4558387
is_tibble(data.frame())
#> [1] FALSE

Previously tibble() was an alias for frame_data(). If you were using tibble() to create tibbles by rows, you’ll need to switch to frame_data(). This is a breaking change, but we believe that the new naming scheme will be less confusing in the long run.

Extracting columns

The previous version of tibble was a little too strict when you attempted to retrieve a column that did not exist: we had forgotten that many people check for the presence of column with is.null(df$x). This is bad idea because of partial matching, but it is common:

df1 <- data.frame(xyz = 1)
df1$x
#> [1] 1

Now, instead of throwing an error, tibble will return NULL. If you use $, common in interactive scripts, tibble will generate a warning:

df2 <- tibble(xyz = 1)
df2$x
#> Warning: Unknown column 'x'
#> NULL
df2[["x"]]
#> NULL

We also provide a convenient helper for detecting the presence/absence of a column:

has_name(df1, "x")
#> [1] FALSE
has_name(df2, "x")
#> [1] FALSE

Output tweaks

We’ve tweaked the output to have a shorter header, more information in the footer. We’re using # consistently to denote metadata, and we print missing character values as <NA> (instead of NA).

The example below shows the new rendering of the flights table.

nycflights13::flights
#> # A tibble: 336,776 x 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1   2013     1     1      517            515         2      830
#> 2   2013     1     1      533            529         4      850
#> 3   2013     1     1      542            540         2      923
#> 4   2013     1     1      544            545        -1     1004
#> 5   2013     1     1      554            600        -6      812
#> 6   2013     1     1      554            558        -4      740
#> 7   2013     1     1      555            600        -5      913
#> 8   2013     1     1      557            600        -3      709
#> 9   2013     1     1      557            600        -3      838
#> 10  2013     1     1      558            600        -2      753
#> # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <time>

Thanks to Lionel Henry for contributing an option for determining the number of printed extra columns: getOption("tibble.max_extra_cols"). This is particularly important for the ultra-wide tables often released by statistical offices and other institutions.

Expect the printed output to continue to evolve. In the next version, we hope to do better with very wide columns (e.g. from long strings), and to make better use of now unused horizontal space (e.g. from long column names).

httr 1.2.0 is now available on CRAN. The httr package makes it easy to talk to web APIs from R. Learn more in the quick start vignette. Install the latest version with:

install.packages("httr")

There are a few small new features:

  • New RETRY() function allows you to retry a request multiple times until it succeeds, if you you are trying to talk to an unreliable service. To avoid hammering the server, it uses exponential backoff with jitter, as described in https://www.awsarchitectureblog.com/2015/03/backoff.html.
  • DELETE() gains a body parameter.
  • encode = "raw" parameter to functions that accept bodies. This allows you to do your own encoding.
  • http_type() returns the content/mime type of a request, sans parameters.

There is one important bug fix:

  • No longer uses use custom requests for standard POST requests. This has the side-effect of properly following redirects after POST, fixing some login issues in rvest.

httr 1.2.1 includes a fix for a small bug that I discovered shortly after releasing 1.2.0.

For the complete list of improvements, please see the release notes.

We are pleased to announced that xml2 1.0.0 is now available on CRAN. Xml2 is a wrapper around the comprehensive libxml2 C library, and makes it easy to work with XML and HTML files in R. Install the latest version with:

install.packages("xml2")

There are three major improvements in 1.0.0:

  1. You can now modify and create XML documents.
  2. xml_find_first() replaces xml_find_one(), and provides better semantics for missing nodes.
  3. Improved namespace handling when working with XPath.

There are many other small improvements and bug fixes: please see the release notes for a complete list.

Modification and creation

xml2 now supports modification and creation of XML nodes. This includes new functions xml_new_document(), xml_new_child(), xml_new_sibling(), xml_set_namespace(), xml_remove(), xml_replace(), xml_root(), and replacement methods for xml_name(), xml_attr(), xml_attrs() and xml_text().

The basic process of creating an XML document by hand looks something like this:

root <- xml_new_document() %>% xml_add_child("root")

root %>% 
  xml_add_child("a1", x = "1", y = "2") %>% 
  xml_add_child("b") %>% 
  xml_add_child("c") %>% 
  invisible()

root %>% 
  xml_add_child("a2") %>% 
  xml_add_sibling("a3") %>% 
  invisible()

cat(as.character(root))
#> <?xml version="1.0"?>
#> <root><a1 x="1" y="2"><b><c/></b></a1><a2/><a3/></root>

For a complete description of creation and mutation, please see vignette("modification", package = "xml2").

xml_find_first()

xml_find_one() has been deprecated in favor of xml_find_first(). xml_find_first() now always returns a single node: if there are multiple matches, it returns the first (without a warning), and if there are no matches, it returns a new xml_missing object.

This makes it much easier to work with ragged/inconsistent hierarchies:

x1 <- read_xml("<a>
  <b></b>
  <b><c>See</c></b>
  <b><c>Sea</c><c /></b>
</a>")

c <- x1 %>% 
  xml_find_all(".//b") %>% 
  xml_find_first(".//c")
c
#> {xml_nodeset (3)}
#> [1] <NA>
#> [2] <c>See</c>
#> [3] <c>Sea</c>

Missing nodes are replaced by missing values in functions that return vectors:

xml_name(c)
#> [1] NA  "c" "c"
xml_text(c)
#> [1] NA    "See" "Sea"

XPath and namespaces

XPath is challenging to use if your document contains any namespaces:

x <- read_xml('
 <root>
   <doc1 xmlns = "http://foo.com"><baz /></doc1>
   <doc2 xmlns = "http://bar.com"><baz /></doc2>
 </root>
')
x %>% xml_find_all(".//baz")
#> {xml_nodeset (0)}

To make life slightly easier, the default xml_ns() object is automatically passed to xml_find_*():

x %>% xml_ns()
#> d1 <-> http://foo.com
#> d2 <-> http://bar.com
x %>% xml_find_all(".//d1:baz")
#> {xml_nodeset (1)}
#> [1] <baz/>

If you just want to avoid the hassle of namespaces altogether, we have a new nuclear option: xml_ns_strip():

xml_ns_strip(x)
x %>% xml_find_all(".//baz")
#> {xml_nodeset (2)}
#> [1] <baz/>
#> [2] <baz/>