You are currently browsing the monthly archive for August 2016.
I’m excited to announce forcats, a new package for categorical variables, or factors. Factors have a bad rap in R because they often turn up when you don’t want them. That’s because historically, factors were more convenient than character vectors, as discussed in stringsAsFactors: An unauthorized biography by Roger Peng, and stringsAsFactors = <sigh> by Thomas Lumley.
If you use packages from the tidyverse (like tibble and readr) you don’t need to worry about getting factors when you don’t want them. But factors are a useful data structure in their own right, particularly for modelling and visualisation, because they allow you to control the order of the levels. Working with factors in base R can be a little frustrating because of a handful of missing tools. The goal of forcats is to fill in those missing pieces so you can access the power of factors with a minimum of pain.
Install forcats with:
forcats provides two main types of tools to change either the values or the order of the levels. I’ll call out some of the most important functions below, using using the included
gss_cat dataset which contains a selection of categorical variables from the General Social Survey.
library(dplyr) library(ggplot2) library(forcats) gss_cat #> # A tibble: 21,483 × 9 #> year marital age race rincome partyid #> <int> <fctr> <int> <fctr> <fctr> <fctr> #> 1 2000 Never married 26 White $8000 to 9999 Ind,near rep #> 2 2000 Divorced 48 White $8000 to 9999 Not str republican #> 3 2000 Widowed 67 White Not applicable Independent #> 4 2000 Never married 39 White Not applicable Ind,near rep #> 5 2000 Divorced 25 White Not applicable Not str democrat #> 6 2000 Married 25 White $20000 - 24999 Strong democrat #> # ... with 2.148e+04 more rows, and 3 more variables: relig <fctr>, #> # denom <fctr>, tvhours <int>
Change level values
You can recode specified factor levels with
gss_cat %>% count(partyid) #> # A tibble: 10 × 2 #> partyid n #> <fctr> <int> #> 1 No answer 154 #> 2 Don't know 1 #> 3 Other party 393 #> 4 Strong republican 2314 #> 5 Not str republican 3032 #> 6 Ind,near rep 1791 #> # ... with 4 more rows gss_cat %>% mutate(partyid = fct_recode(partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat" )) %>% count(partyid) #> # A tibble: 10 × 2 #> partyid n #> <fctr> <int> #> 1 No answer 154 #> 2 Don't know 1 #> 3 Other party 393 #> 4 Republican, strong 2314 #> 5 Republican, weak 3032 #> 6 Independent, near rep 1791 #> # ... with 4 more rows
Note that unmentioned levels are left as is, and the order of the levels is preserved.
fct_lump() allows you to lump the rarest (or most common) levels in to a new “other” level. The default behaviour is to collapse the smallest levels in to other, ensuring that it’s still the smallest level. For the religion variable that tells us that Protestants out number all other religions, which is interesting, but we probably want more level.
gss_cat %>% mutate(relig = fct_lump(relig)) %>% count(relig) #> # A tibble: 2 × 2 #> relig n #> <fctr> <int> #> 1 Other 10637 #> 2 Protestant 10846
Alternatively you can supply a number of levels to keep,
n, or minimum proportion for inclusion,
prop. If you use negative values,
fct_lump()will change direction, and combine the most common values while preserving the rarest.
gss_cat %>% mutate(relig = fct_lump(relig, n = 5)) %>% count(relig) #> # A tibble: 6 × 2 #> relig n #> <fctr> <int> #> 1 Other 913 #> 2 Christian 689 #> 3 None 3523 #> 4 Jewish 388 #> 5 Catholic 5124 #> 6 Protestant 10846 gss_cat %>% mutate(relig = fct_lump(relig, prop = -0.10)) %>% count(relig) #> # A tibble: 12 × 2 #> relig n #> <fctr> <int> #> 1 No answer 93 #> 2 Don't know 15 #> 3 Inter-nondenominational 109 #> 4 Native american 23 #> 5 Christian 689 #> 6 Orthodox-christian 95 #> # ... with 6 more rows
Change level order
There are four simple helpers for common operations:
fct_relevel()is similar to
stats::relevel()but allows you to move any number of levels to the front.
fct_inorder()orders according to the first appearance of each level.
fct_infreq()orders from most common to rarest.
fct_rev()reverses the order of levels.
fct_reorder2() are useful for visualisations.
fct_reorder() reorders the factor levels by another variable. This is useful when you map a categorical variable to position, as shown in the following example which shows the average number of hours spent watching television across religions.
relig <- gss_cat %>% group_by(relig) %>% summarise( age = mean(age, na.rm = TRUE), tvhours = mean(tvhours, na.rm = TRUE), n = n() ) ggplot(relig, aes(tvhours, relig)) + geom_point() ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) + geom_point()
fct_reorder2() extends the same idea to plots where a factor is mapped to another aesthetic, like colour. The defaults are designed to make legends easier to read for line plots, as shown in the following example looking at marital status by age.
We’re proud to announce version 1.2.0 of the tibble package. Tibbles are a modern reimagining of the data frame, keeping what time has shown to be effective, and throwing out what is not. Grab the latest version with:
This is mostly a maintenance release, with the following major changes:
- More options for adding individual rows and (new!) columns
- Improved function names
- Minor tweaks to the output
There are many other small improvements and bug fixes: please see the release notes for a complete list.
Thanks to Jenny Bryan for
add_column() improvements and ideas, to William Dunlap for pointing out a bug with tibble’s implementation of
all.equal(), to Kevin Wright for pointing out a rare bug with
glimpse(), and to all the other contributors. Use the issue tracker to submit bugs or suggest ideas, your contributions are always welcome.
Adding rows and columns
There are now more options for adding individual rows, and columns can be added in a similar way, illustrated with this small tibble:
df <- tibble(x = 1:3, y = 3:1) df #> # A tibble: 3 × 2 #> x y #> <int> <int> #> 1 1 3 #> 2 2 2 #> 3 3 1
add_row() function allows control over where the new rows are added. In the following example, the row (4, 0) is added before the second row:
df %>% add_row(x = 4, y = 0, .before = 2) #> # A tibble: 4 × 2 #> x y #> <dbl> <dbl> #> 1 1 3 #> 2 4 0 #> 3 2 2 #> 4 3 1
Adding more than one row is now fully supported, although not recommended in general because it can be a bit hard to read.
df %>% add_row(x = 4:5, y = 0:-1) #> # A tibble: 5 × 2 #> x y #> <int> <int> #> 1 1 3 #> 2 2 2 #> 3 3 1 #> 4 4 0 #> 5 5 -1
Columns can now be added in much the same way with the new
df %>% add_column(z = -1:1, w = 0) #> # A tibble: 3 × 4 #> x y z w #> <int> <int> <int> <dbl> #> 1 1 3 -1 0 #> 2 2 2 0 0 #> 3 3 1 1 0
It also supports
df %>% add_column(z = -1:1, .after = 1) #> # A tibble: 3 × 3 #> x z y #> <int> <int> <int> #> 1 1 -1 3 #> 2 2 0 2 #> 3 3 1 1 df %>% add_column(w = 0:2, .before = "x") #> # A tibble: 3 × 3 #> w x y #> <int> <int> <int> #> 1 0 1 3 #> 2 1 2 2 #> 3 2 3 1
add_column() function will never alter your existing data: you can’t overwrite existing columns, and you can’t add new observations.
frame_data() is now
tribble(), which stands for “transposed tibble”. The old name still works, but will be deprecated eventually.
tribble( ~x, ~y, 1, "a", 2, "z" ) #> # A tibble: 2 × 2 #> x y #> <dbl> <chr> #> 1 1 a #> 2 2 z
We’ve tweaked the output again to use the multiply character
× instead of
x when printing dimensions (this still renders nicely on Windows.) We surround non-semantic column with backticks, and
dttm is now used instead of
time to distinguish
The example below shows the new rendering:
tibble(`date and time` = Sys.time(), time = hms::hms(minutes = 3)) #> # A tibble: 1 × 2 #> `date and time` time #> <dttm> <time> #> 1 2016-08-29 16:48:57 00:03:00
Expect the printed output to continue to evolve in next release. Stay tuned for a new function that reconstructs
tribble() calls from existing data frames.
I’m pleased to announce version 1.1.0 of stringr. stringr makes string manipulation easier by using consistent function and argument names, and eliminating options that you don’t need 95% of the time. To get started with stringr, check out the strings chapter in R for data science. Install it with:
This release is mostly bug fixes, but there are a couple of new features you might care out.
- There are three new datasets,
sentences, to help you practice your regular expression skills:
str_subset(fruit, "(..)\\1") #>  "banana" "coconut" "cucumber" "jujube" "papaya" #>  "salal berry" head(words) #>  "a" "able" "about" "absolute" "accept" "account" sentences #>  "The birch canoe slid on the smooth planks."
- More functions work with
str_subset()can detect boundaries, and
str_extract_all()pull out the components between boundaries. This is particularly useful if you want to extract logical constructs like words or sentences.
x <- "This is harder than you might expect, e.g. punctuation!" x %>% str_extract_all(boundary("word")) %>% .[] #>  "This" "is" "harder" "than" "you" #>  "might" "expect" "e.g" "punctuation" x %>% str_extract(boundary("sentence")) #>  "This is harder than you might expect, e.g. punctuation!"
str_view_all()create HTML widgets that display regular expression matches. This is particularly useful for teaching.
For a complete list of changes, please see the release notes.
Want to Master R? There’s no better time or place if you’re within an easy train, plane, automobile ride or a short jog of Hadley Wickham’s workshop on September 12th and 13th at the AMA Conference Center in New York City.
As of today, there are just 20+ seats left. Discounts are still available for academics (students or faculty) and for 5 or more attendees from any organization. Email email@example.com if you have any questions about the workshop that you don’t find answered on the registration page.
Hadley has no Master R workshops planned for Boston, Washington DC, New York City or any location in the Northeast in the next year. If you’ve always wanted to take Master R but haven’t found the time, well, there’s truly no better time!
P.S. We’ve arranged a “happy hour” reception after class on Monday the 12th. Be sure to set aside an hour or so after the first day to talk to your classmates and Hadley about what’s happening in R.
I’m pleased to announce tidyr 0.6.0. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has a simple convention: put variables in the columns and observations in the rows. You can learn more about it in the tidy data vignette. Install it with:
I mostly released this version to bundle up a number of small tweaks needed for R for Data Science. But there’s one nice new feature, contributed by Jan Schulz:
drop_na()drops rows containing missing values:
df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b")) df #> # A tibble: 3 × 2 #> x y #> <dbl> <chr> #> 1 1 a #> 2 2 <NA> #> 3 NA b # Called without arguments, it drops rows containing # missing values in any variable: df %>% drop_na() #> # A tibble: 1 × 2 #> x y #> <dbl> <chr> #> 1 1 a # Or you can restrict the variables it looks at, # using select() style syntax: df %>% drop_na(x) #> # A tibble: 2 × 2 #> x y #> <dbl> <chr> #> 1 1 a #> 2 2 <NA>
Please see the release notes for a complete list of changes.
The R package DT v0.2 is on CRAN now. You may install it from CRAN via
install.packages('DT') or update your R packages if you have already installed it before. It has been over a year since the last CRAN release of DT, and there have been a lot of changes in both DT and the upstream DataTables library. You may read the release notes to know all changes, and we want to highlight two major changes here:
- Two extensions “TableTools” and “ColVis” have been removed from DataTables, and a new extension named “Buttons” was added. See this page for examples.
- For tables in the server-side processing mode (the default mode for tables in Shiny), the selected row indices are integers instead of characters (row names) now. This is for consistency with the client-side mode (which returns integer indices). In many cases, it does not make much difference if you index an R object with integers or names, and we hope this will not be a breaking change to your Shiny apps.
In terms of new features added in the new version of DT, the most notable ones are:
- Besides row selections, you can also select columns or cells. Please note the implementation is not based on the “Select” extension of DataTables, so not all features of “Select” are available in DT. You can find examples of row/column/cell selections on this page.
- There are a number of new functions to modify an existing table instance in a Shiny app without rebuilding the full table widget. One significant advantage of this feature is it will be much faster and more efficient to update certain aspects of a table, e.g., you can change the table caption, or set the global search keyword of a table without making DT to create the whole table from scratch. You can even replace the data object behind the table on the fly (using
DT::replaceData()), and after the data is updated, the table state can be preserved (e.g., sorting and filtering can remain the same).
- A few formatting functions such as
formatString()were also added to the package.
readr 1.0.0 is now available on CRAN. readr makes it easy to read many types of rectangular data, including csv, tsv and fixed width files. Compared to base equivalents like
read.csv(), readr is much faster and gives more convenient output: it never converts strings to factors, can parse date/times, and it doesn’t munge the column names. Install the latest version with:
Releasing a version 1.0.0 was a deliberate choice to reflect the maturity and stability and readr, thanks largely to work by Jim Hester. readr is by no means perfect, but I don’t expect any major changes to the API in the future.
In this version we:
- Use a better strategy for guessing column types.
- Improved the default date and time parsers.
- Provided a full set of lower-level file and line readers and writers.
- Fixed many bugs.
The process by which readr guesses the types of columns has received a substantial overhaul to make it easier to fix problems when the initial guesses aren’t correct, and to make it easier to generate reproducible code. Now column specifications are printing by default when you read from a file:
mtcars2 <- read_csv(readr_example("mtcars.csv")) #> Parsed with column specification: #> cols( #> mpg = col_double(), #> cyl = col_integer(), #> disp = col_double(), #> hp = col_integer(), #> drat = col_double(), #> wt = col_double(), #> qsec = col_double(), #> vs = col_integer(), #> am = col_integer(), #> gear = col_integer(), #> carb = col_integer() #> )
The thought is that once you’ve figured out the correct column types for a file, you should make the parsing strict. You can do this either by copying and pasting the printed column specification or by saving the spec to disk:
# Once you've figured out the correct types mtcars_spec <- write_rds(spec(mtcars2), "mtcars2-spec.rds") # Every subsequent load mtcars2 <- read_csv( readr_example("mtcars.csv"), col_types = read_rds("mtcars2-spec.rds") ) # In production, you might want to throw an error if there # are any parsing problems. stop_for_problems(mtcars2)
You can now also adjust the number of rows that readr uses to guess the column types with
challenge <- read_csv(readr_example("challenge.csv")) #> Parsed with column specification: #> cols( #> x = col_integer(), #> y = col_character() #> ) #> Warning: 1000 parsing failures. #> row col expected actual #> 1001 x no trailing characters .23837975086644292 #> 1002 x no trailing characters .41167997173033655 #> 1003 x no trailing characters .7460716762579978 #> 1004 x no trailing characters .723450553836301 #> 1005 x no trailing characters .614524137461558 #> .... ... ...................... .................. #> See problems(...) for more details. challenge <- read_csv(readr_example("challenge.csv"), guess_max = 1500) #> Parsed with column specification: #> cols( #> x = col_double(), #> y = col_date(format = "") #> )
(If you want to suppress the printed specification, just provide the dummy spec
col_types = cols())
You can now access the guessing algorithm from R:
guess_parser() will tell you which parser readr will select.
guess_parser("1,234") #>  "number" # Were previously guessed as numbers guess_parser(c(".", "-")) #>  "character" guess_parser(c("10W", "20N")) #>  "character" # Now uses the default time format guess_parser("10:30") #>  "time"
Date-time parsing improvements:
The date time parsers recognise three new format strings:
%Ifor 12 hour time format:
library(hms) parse_time("1 pm", "%I %p") #> 13:00:00
hmsfrom the hms package, rather than a custom
%ATare “automatic” date and time parsers. They are both slightly less flexible than previous defaults. The automatic date parser requires a four digit year, and only accepts
/as separators. The flexible time parser now requires colons between hours and minutes and optional seconds.
parse_date("2010-01-01", "%AD") #>  "2010-01-01" parse_time("15:01", "%AT") #> 15:01:00
If the format argument is omitted in
parse_time(), the default date and time formats specified in the locale will be used. These now default to
%AT respectively. You may want to override in your standard
locale() if the conventions are different where you live.
Low-level readers and writers
readr now contains a full set of efficient lower-level readers:
read_file()reads a file into a length-1 character vector;
read_file_raw()reads a file into a single raw vector.
read_lines()reads a file into a character vector with one entry per line;
read_lines_raw()reads into a list of raw vectors with one entry per line.
These are paired with
write_file() to efficient write character and raw vectors back to disk.
read_fwf()was overhauled to reliably read only a partial set of columns, to read files with ragged final columns (by setting the final position/width to
NA), and to skip comments (with the
- readr contains an experimental API for reading a file in chunks, e.g.
read_lines_chunked(). These allow you to work with files that are bigger than memory. We haven’t yet finalised the API so please use with care, and send us your feedback.
- There are many otherbug fixes and other minor improvements. You can see a complete list in the release notes.