You are currently browsing hadleywickham’s articles.

I’m pleased to announced that the first version of xml2 is now available on CRAN. Xml2 is a wrapper around the comprehensive libxml2 C library that makes it easier to work with XML and HTML in R:

  • Read XML and HTML with read_xml() and read_html().
  • Navigate the tree with xml_children(), xml_siblings() and xml_parent(). Alternatively, use xpath to jump directly to the nodes you’re interested in with xml_find_one() and xml_find_all(). Get the full path to a node with xml_path().
  • Extract various components of a node with xml_text(), xml_attrs(), xml_attr(), and xml_name().
  • Convert to list with as_list().
  • Where appropriate, functions support namespaces with a global url -> prefix lookup table. See xml_ns() for more details.
  • Convert relative urls to absolute with url_absolute(), and transform in the opposite direction with url_relative(). Escape and unescape special characters with url_escape() and url_unescape().
  • Support for modifying and creating xml documents in planned in a future version.

This package owes a debt of gratitude to Duncan Temple Lang who’s XML package has made it possible to use XML with R for almost 15 years!

Usage

You can install it by running:

install.packages("xml2")

(If you’re on a mac, you might need to wait a couple of days – CRAN is busy rebuilding all the packages for R 3.2.0 so it’s running a bit behind.)

Here’s a small example working with an inline XML document:

library(xml2)
x <- read_xml("<foo>
  <bar>text <baz id = 'a' /></bar>
  <bar>2</bar>
  <baz id = 'b' /> 
</foo>")

xml_name(x)
#> [1] "foo"
xml_children(x)
#> {xml_nodeset (3)}
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>

# Find all baz nodes anywhere in the document
baz <- xml_find_all(x, ".//baz")
baz
#> {xml_nodeset (2)}
#> [1] <baz id="a"/>
#> [2] <baz id="b"/>
xml_path(baz)
#> [1] "/foo/bar[1]/baz" "/foo/baz"
xml_attr(baz, "id")
#> [1] "a" "b"

Development

Xml2 is still under active development. If notice any problems (including crashes), please try the development version, and if that doesn’t work, file an issue.

I’m pleased to announced that the first version of readxl is now available on CRAN. Readxl makes it easy to get tabular data out of excel. It:

  • Supports both the legacy .xls format and the modern xml-based .xlsx format. .xls support is made possible the with libxls C library, which abstracts away many of the complexities of the underlying binary format. To parse .xlsx, we use the insanely fast RapidXML C++ library.
  • Has no external dependencies so it’s easy to use on all platforms.
  • Re-encodes non-ASCII characters to UTF-8.
  • Loads datetimes into POSIXct columns. Both Windows (1900) and Mac (1904) date specifications are processed correctly.
  • Blank columns are automatically dropped.
  • Returns output with class c("tbl_df", "tbl", "data.frame") so if you also use dplyr you’ll get an enhanced print method (i.e. you’ll see just the first ten rows, not the first 10,000!).

You can install it by running:

install.packages("readxl")

There’s not really much to say about how to use it:

library(readxl)
# Use a excel file included in the package
sample <- system.file("extdata", "datasets.xlsx", package = "readxl")

# Read by position
head(read_excel(sample, 2))
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# Or by name:
excel_sheets(sample)
#> [1] "iris"     "mtcars"   "chickwts" "quakes"
head(read_excel(sample, "mtcars"))
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

You can see the documentation for more info on the col_names, col_types and na arguments.

Readxl is still under active development. If you have problems loading a dataset, please try the development version, and if that doesn’t work, file an issue.

I’m pleased to announced that readr is now available on CRAN. Readr makes it easy to read many types of tabular data:

  • Delimited files withread_delim(), read_csv(), read_tsv(), and read_csv2().
  • Fixed width files with read_fwf(), and read_table().
  • Web log files with read_log().

You can install it by running:

install.packages("readr")

Compared to the equivalent base functions, readr functions are around 10x faster. They’re also easier to use because they’re more consistent, they produce data frames that are easier to use (no more stringsAsFactors = FALSE!), they have a more flexible column specification, and any parsing problems are recorded in a data frame. Each of these features is described in more detail below.

Input

All readr functions work the same way. There are four important arguments:

  • file gives the file to read; a url or local path. A local path can point to a a zipped, bzipped, xzipped, or gzipped file – it’ll be automatically uncompressed in memory before reading. You can also pass in a connection or a raw vector.

    For small examples, you can also supply literal data: if file contains a new line, then the data will be read directly from the string. Thanks to data.table for this great idea!

    library(readr)
    read_csv("x,y\n1,2\n3,4")
    #>   x y
    #> 1 1 2
    #> 2 3 4
  • col_names: describes the column names (equivalent to header in base R). It has three possible values:
    • TRUE will use the the first row of data as column names.
    • FALSE will number the columns sequentially.
    • A character vector to use as column names.
  • col_types: overrides the default column types (equivalent to colClasses in base R). More on that below.
  • progress: By default, readr will display a progress bar if the estimated loading time is greater than 5 seconds. Use progress = FALSE to suppress the progress indicator.

Output

The output has been designed to make your life easier:

  • Characters are never automatically converted to factors (i.e. no more stringsAsFactors = FALSE!).
  • Column names are left as is, not munged into valid R identifiers (i.e. there is no check.names = TRUE). Use backticks to refer to variables with unusual names, e.g. df$`Income ($000)`.
  • The output has class c("tbl_df", "tbl", "data.frame") so if you also use dplyr you’ll get an enhanced print method (i.e. you’ll see just the first ten rows, not the first 10,000!).
  • Row names are never set.

Column types

Readr heuristically inspects the first 100 rows to guess the type of each columns. This is not perfect, but it’s fast and it’s a reasonable start. Readr can automatically detect these column types:

  • col_logical() [l], contains only T, F, TRUE or FALSE.
  • col_integer() [i], integers.
  • col_double() [d], doubles.
  • col_euro_double() [e], “Euro” doubles that use , as the decimal separator.
  • col_date() [D]: Y-m-d dates.
  • col_datetime() [T]: ISO8601 date times
  • col_character() [c], everything else.

You can manually specify other column types:

  • col_skip() [_], don’t import this column.
  • col_date(format) and col_datetime(format, tz), dates or date times parsed with given format string. Dates and times are rather complex, so they’re described in more detail in the next section.
  • col_numeric() [n], a sloppy numeric parser that ignores everything apart from 0-9, - and . (this is useful for parsing currency data).
  • col_factor(levels, ordered), parse a fixed set of known values into a (optionally ordered) factor.

There are two ways to override the default choices with the col_types argument:

  • Use a compact string: "dc__d". Each letter corresponds to a column so this specification means: read first column as double, second as character, skip the next two and read the last column as a double. (There’s no way to use this form with column types that need parameters.)
  • With a (named) list of col objects:
    read_csv("iris.csv", col_types = list(
      Sepal.Length = col_double(),
      Sepal.Width = col_double(),
      Petal.Length = col_double(),
      Petal.Width = col_double(),
      Species = col_factor(c("setosa", "versicolor", "virginica"))
    ))

    Any omitted columns will be parsed automatically, so the previous call is equivalent to:

    read_csv("iris.csv", col_types = list(
      Species = col_factor(c("setosa", "versicolor", "virginica"))
    )

Dates and times

One of the most helpful features of readr is its ability to import dates and date times. It can automatically recognise the following formats:

  • Dates in year-month-day form: 2001-10-20 or 2010/15/10 (or any non-numeric separator). It can’t automatically recongise dates in m/d/y or d/m/y format because they’re ambiguous: is 02/01/2015 the 2nd of January or the 1st of February?
  • Date times as ISO8601 form: e.g. 2001-02-03 04:05:06.07 -0800, 20010203 040506, 20010203 etc. I don’t support every possible variant yet, so please let me know if it doesn’t work for your data (more details in ?parse_datetime).

If your dates are in another format, don’t despair. You can use col_date() and col_datetime() to explicit specify a format string. Readr implements it’s own strptime() equivalent which supports the following format strings:

  • Year: \%Y (4 digits). \%y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
  • Month: \%m (2 digits), \%b (abbreviated name in current locale), \%B (full name in current locale).
  • Day: \%d (2 digits), \%e (optional leading space)
  • Hour: \%H
  • Minutes: \%M
  • Seconds: \%S (integer seconds), \%OS (partial seconds)
  • Time zone: \%Z (as name, e.g. America/Chicago), \%z (as offset from UTC, e.g. +0800)
  • Non-digits: \%. skips one non-digit charcater, \%* skips any number of non-digit characters.
  • Shortcuts: \%D = \%m/\%d/\%y, \%F = \%Y-\%m-\%d, \%R = \%H:\%M, \%T = \%H:\%M:\%S, \%x = \%y/\%m/\%d.

To practice parsing date times with out having to load the file each time, you can use parse_datetime() and parse_date():

parse_date("2015-10-10")
#> [1] "2015-10-10"
parse_datetime("2015-10-10 15:14")
#> [1] "2015-10-10 15:14:00 UTC"

parse_date("02/01/2015", "%m/%d/%Y")
#> [1] "2015-02-01"
parse_date("02/01/2015", "%d/%m/%Y")
#> [1] "2015-01-02"

Problems

If there are any problems parsing the file, the read_ function will throw a warning telling you how many problems there are. You can then use the problems() function to access a data frame that gives information about each problem:

csv <- "x,y
1,a
b,2
"

df <- read_csv(csv, col_types = "ii")
#> Warning: 2 problems parsing literal data. See problems(...) for more
#> details.
problems(df)
#>   row col   expected actual
#> 1   1   2 an integer      a
#> 2   2   1 an integer      b
df
#>    x  y
#> 1  1 NA
#> 2 NA  2

Helper functions

Readr also provides a handful of other useful functions:

  • read_lines() works the same way as readLines(), but is a lot faster.
  • read_file() reads a complete file into a string.
  • type_convert() attempts to coerce all character columns to their appropriate type. This is useful if you need to do some manual munging (e.g. with regular expressions) to turn strings into numbers. It uses the same rules as the read_* functions.
  • write_csv() writes a data frame out to a csv file. It’s quite a bit faster than write.csv() and it never writes row.names. It also escapes " embedded in strings in a way that read_csv() can read.

Development

Readr is still under very active development. If you have problems loading a dataset, please try the development version, and if that doesn’t work, file an issue.

I’m pleased to announced that the new haven package is now available on CRAN. Haven makes it easy to read data from SAS, SPSS and Stata. Haven has the same goal as the foreign package, but it:

  • Can read binary SAS7BDAT files.
  • Can read Stata13 files.
  • Always returns a data frame.

(Haven also has experimental support for writing SPSS and Stata data. This still has some rough edges but please try it out and report any problems that you find.)

Haven is a binding to the excellent ReadStat C library by Evan Miller. Haven wouldn’t be possible without his hard work – thanks Evan! I’d also like to thank Matt Shotwell who spend a lot of time reverse engineering the SAS binary data format, and Dennis Fisher who tested the SAS code with thousands of SAS files.

Usage

Using haven is easy:

  • Install it, install.packages("haven"),
  • Load it, library(haven),
  • Then pick the appropriate read function:
    • SAS: read_sas()
    • SPSS: read_sav() or read_por()
    • Stata: read_dta().

These only need the name of the path. (read_sas() optionally also takes the path to a catolog file.)

Output

All functions return a data frame:

  • The output also has class tbl_df which will improve the default print method (to only show the first ten rows and the variables that fit on one screen) if you have dplyr loaded. If you don’t use dplyr, it has no effect.
  • Variable labels are attached as an attribute to each variable. These are not printed (because they tend to be long), but if you have a preview version of RStudio, you’ll see them in the revamped viewer pane.
  • Missing values in numeric variables should be seemlessly converted. Missing values in character variables are converted to the empty string, "": if you want to convert them to missing values, use zap_empty().
  • Dates are converted in to Dates, and datetimes to POSIXcts. Time variables are read into a new class called hms which represents an offset in seconds from midnight. It has print() and format() methods to nicely display times, but otherwise behaves like an integer vector.
  • Variables with labelled values are turned into a new labelled class, as described next.

Labelled variables

SAS, Stata and SPSS all have the notion of a “labelled” variable. These are similar to factors, but:

  • Integer, numeric and character vectors can be labelled.
  • Not every value must be associated with a label.

Factors, by contrast, are always integers and every integer value must be associated with a label.

Haven provides a labelled class to model these objects. It doesn’t implement any common methods, but instead focusses of ways to turn a labelled variable into standard R variable:

  • as_factor(): turns labelled integers into factors. Any values that don’t have a label associated with them will become a missing value. (NB: there’s no way to make as.factor() work with labelled variables, so you’ll need to use this new function.)
  • zap_labels(): turns any labelled values into missing values. This deals with the common pattern where you have a continuous variable that has missing values indiciated by sentinel values.

If you have a use case that’s not covered by these function, please let me know.

Development

Haven is still under very active development. If you have problems loading a dataset, please try the development version, and if that doesn’t work, file an issue.

I’m very pleased to announce that Epoch.com has stepped up as a sponsor for the RMySQL package.

For the last 20 years, Epoch.com has built its Internet Payment Service Provider infrastructure on open source software. Their data team, led by Szilard Pafka, PhD, has been using R for nearly a decade, developing cutting-edge data visualization, machine learning and other analytical applications. According to Epoch, “We have always believed in the value of R and in the importance of contributing to the open source community.”

This sort of sponsorship is very important to me. While I already spend most of my time working on R packages, I don’t have the skills to fix every problem. Sponsorship allows me to hire outside experts. In this case, Epoch.com’s sponsorship allowed me to work with Jeroen Ooms to improve the build system for RMySQL so that a CRAN binary is available for every platform.

Is your company interested in sponsoring other infrastructure work that benefits the whole R community? If so, please get in touch.

I’m very pleased to announce that dplyr 0.4.0 is now available from CRAN. Get the latest version by running:

install.packages("dplyr")

dplyr 0.4.0 includes over 80 minor improvements and bug fixes, which are described in detail in the release notes. Here I wanted to draw your attention to two areas that have particularly improved since dplyr 0.3, two-table verbs and data frame support.

Two table verbs

dplyr now has full support for all two-table verbs provided by SQL:

  • Mutating joins, which add new variables to one table from matching rows in another: inner_join(), left_join(), right_join(), full_join(). (Support for non-equi joins is planned for dplyr 0.5.0.)
  • Filtering joins, which filter observations from one table based on whether or not they match an observation in the other table: semi_join(), anti_join().
  • Set operations, which combine the observations in two data sets as if they were set elements: intersect(), union(), setdiff().

Together, these verbs should allow you to solve 95% of data manipulation problems that involve multiple tables. If any of the concepts are unfamiliar to you, I highly recommend reading the two-table vignette (and if you still don’t understand, please let me know so I can make it better.)

Data frames

dplyr wraps data frames in a tbl_df class. These objects are structured in exactly the same way as regular data frames, but their behaviour has been tweaked a little to make them easier to work with. The new data_frames vignette describes how dplyr works with data frames in general, and below I highlight some of the features new in 0.4.0.

Printing

The biggest difference is printing: print.tbl_df() doesn’t try and print 10,000 rows! Printing got a lot of love in dplyr 0.4 and now:

  • All print() method methods invisibly return their input so you can interleave print() statements into a pipeline to see interim results.
  • If you’ve managed to produce a 0-row data frame, dplyr won’t try to print the data, but will tell you the column names and types:
    data_frame(x = numeric(), y = character())
    #> Source: local data frame [0 x 2]
    #> 
    #> Variables not shown: x (dbl), y (chr)
  • dplyr never prints row names since no dplyr method is guaranteed to preserve them:
    df <- data.frame(x = c(a = 1, b = 2, c = 3))
    df
    #>   x
    #> a 1
    #> b 2
    #> c 3
    df %>% tbl_df()
    #> Source: local data frame [3 x 1]
    #> 
    #>   x
    #> 1 1
    #> 2 2
    #> 3 3

    I don’t think using row names is a good idea because it violates one of the principles of tidy data: every variable should be stored in the same way.

    To make life a bit easier if you do have row names, you can use the new add_rownames() to turn your row names into a proper variable:

    df %>% 
      add_rownames()
    #>   rowname x
    #> 1       a 1
    #> 2       b 2
    #> 3       c 3

    (But you’re better off never creating them in the first place.)

  • options(dplyr.print_max) is now 20, so dplyr will never print more than 20 rows of data (previously it was 100). The best way to see more rows of data is to use View().

Coercing lists to data frames

When you have a list of vectors of equal length that you want to turn into a data frame, dplyr provides as_data_frame() as a simple alternative to as.data.frame(). as_data_frame() is considerably faster than as.data.frame() because it does much less:

l <- replicate(26, sample(100), simplify = FALSE)
names(l) <- letters
microbenchmark::microbenchmark(
  as_data_frame(l),
  as.data.frame(l)
)
#> Unit: microseconds
#>              expr      min        lq   median        uq      max neval
#>  as_data_frame(l)  101.856  112.0615  124.855  143.0965  254.193   100
#>  as.data.frame(l) 1402.075 1466.6365 1511.644 1635.1205 3007.299   100

It’s difficult to precisely describe what as.data.frame(x) does, but it’s similar to do.call(cbind, lapply(x, data.frame)) – it coerces each component to a data frame and then cbind()s them all together.

The speed of as.data.frame() is not usually a bottleneck in interactive use, but can be a problem when combining thousands of lists into one tidy data frame (this is common when working with data stored in json or xml).

Binding rows and columns

dplyr now provides bind_rows() and bind_cols() for binding data frames together. Compared to rbind() and cbind(), the functions:

  • Accept either individual data frames, or a list of data frames:
    a <- data_frame(x = 1:5)
    b <- data_frame(x = 6:10)
    
    bind_rows(a, b)
    #> Source: local data frame [10 x 1]
    #> 
    #>    x
    #> 1  1
    #> 2  2
    #> 3  3
    #> 4  4
    #> 5  5
    #> .. .
    bind_rows(list(a, b))
    #> Source: local data frame [10 x 1]
    #> 
    #>    x
    #> 1  1
    #> 2  2
    #> 3  3
    #> 4  4
    #> 5  5
    #> .. .

    If x is a list of data frames, bind_rows(x) is equivalent to do.call(rbind, x).

  • Are much faster:
    dfs <- replicate(100, data_frame(x = runif(100)), simplify = FALSE)
    microbenchmark::microbenchmark(
      do.call("rbind", dfs),
      bind_rows(dfs)
    )
    #> Unit: microseconds
    #>                   expr      min        lq   median        uq       max
    #>  do.call("rbind", dfs) 5344.660 6605.3805 6964.236 7693.8465 43457.061
    #>         bind_rows(dfs)  240.342  262.0845  317.582  346.6465  2345.832
    #>  neval
    #>    100
    #>    100

(Generally you should avoid bind_cols() in favour of a join; otherwise check carefully that the rows are in a compatible order).

List-variables

Data frames are usually made up of a list of atomic vectors that all have the same length. However, it’s also possible to have a variable that’s a list, which I call a list-variable. Because of data.frame()s complex coercion rules, the easiest way to create a data frame containing a list-column is with data_frame():

data_frame(x = 1, y = list(1), z = list(list(1:5, "a", "b")))
#> Source: local data frame [1 x 3]
#> 
#>   x        y         z
#> 1 1 <dbl[1]> <list[3]>

Note how list-variables are printed: a list-variable could contain a lot of data, so dplyr only shows a brief summary of the contents. List-variables are useful for:

  • Working with summary functions that return more than one value:
    qs <- mtcars %>%
      group_by(cyl) %>%
      summarise(y = list(quantile(mpg)))
    
    # Unnest input to collpase into rows
    qs %>% tidyr::unnest(y)
    #> Source: local data frame [15 x 2]
    #> 
    #>    cyl    y
    #> 1    4 21.4
    #> 2    4 22.8
    #> 3    4 26.0
    #> 4    4 30.4
    #> 5    4 33.9
    #> .. ...  ...
    
    # To extract individual elements into columns, wrap the result in rowwise()
    # then use summarise()
    qs %>% 
      rowwise() %>% 
      summarise(q25 = y[2], q75 = y[4])
    #> Source: local data frame [3 x 2]
    #> 
    #>     q25   q75
    #> 1 22.80 30.40
    #> 2 18.65 21.00
    #> 3 14.40 16.25
  • Keeping associated data frames and models together:
    by_cyl <- split(mtcars, mtcars$cyl)
    models <- lapply(by_cyl, lm, formula = mpg ~ wt)
    
    data_frame(cyl = c(4, 6, 8), data = by_cyl, model = models)
    #> Source: local data frame [3 x 3]
    #> 
    #>   cyl            data   model
    #> 1   4 <S3:data.frame> <S3:lm>
    #> 2   6 <S3:data.frame> <S3:lm>
    #> 3   8 <S3:data.frame> <S3:lm>

dplyr’s support for list-variables continues to mature. In 0.4.0, you can join and row bind list-variables and you can create them in summarise and mutate.

My vision of list-variables is still partial and incomplete, but I’m convinced that they will make pipeable APIs for modelling much eaiser. See the draft lowliner package for more explorations in this direction.

Bonus

My colleague, Garrett, helped me make a cheat sheet that summarizes the data wrangling features of dplyr 0.4.0. You can download it from RStudio’s new gallery of R cheat sheets.

Data wrangling cheatsheet

Jeroen Ooms and I are very pleased to announce a new version of RMySQL, the R package that allows you to talk to MySQL (and MariaDB) databases. We have taken over maintenance from Jeffrey Horner, who has done a great job of maintaining the package of the last few years, but no longer has time to look after it. Thanks for all your hard work Jeff!

Using RMySQL

library(DBI)

# Connect to a public database that I'm running on Google's 
# cloud SQL service. It contains a copy of the data in the
# datasets package.
con <-  dbConnect(RMySQL::MySQL(), 
  username = "public", 
  password = "F60RUsyiG579PeKdCH",
  host = "173.194.227.144", 
  port = 3306, 
  dbname = "datasets"
)

# Run a query
dbGetQuery(con, "SELECT * FROM mtcars WHERE cyl = 4 AND mpg < 23")
#>       row_names  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1    Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> 2      Merc 230 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> 3 Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> 4    Volvo 142E 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

# It's polite to let the database know when you're done
dbDisconnect(con)
#> [1] TRUE

It’s generally a bad idea to put passwords in your code, so instead of typing them directly, you can create a file called ~/.my.cnf that contains

[cloudSQL]
username=public
password=F60RUsyiG579PeKdCH
host=173.194.227.144
port=3306
database=datasets

Then you can connect with:

con <-  dbConnect(RMySQL::MySQL(), group = "cloudSQL")

Changes in this release

RMySQL 0.10.0 is mostly a cleanup release. RMySQL is one of the oldest packages on CRAN, and according to the timestamps, it is older than many recommended packages, and only slightly younger than MASS! That explains why a facelift was well overdue.

The most important change is an improvement to the build process so that CRAN binaries are now available for Windows and OS X Mavericks. This should make your life much easier if you’re on one of these platforms. We’d love your feedback on the new build scripts. There have been many problems in the past, so we’d like to know that this client works well across platforms and versions of MySQL server.

Otherwise, the changes update RMySQL for DBI 0.3 compatibility:

  • Internal mysql*() functions are no longer exported. Please use the corresponding DBI generics instead.
  • RMySQL gains transaction support with dbBegin(), dbCommit(), and dbRollback(). (But note that MySQL does not allow data definition language statements to be rolled back.)
  • Added method for dbFetch(). Please use this instead of fetch(). dbFetch() now returns a 0-row data frame (instead of an 0-col data frame) if there are no results.
  • Added methods for dbIsValid(). Please use these instead of isIdCurrent().
  • dbWriteTable() has been rewritten. It uses a better quoting strategy, throws errors on failure, and only automatically adds row names only if they’re strings. (NB: dbWriteTable() also has a method that allows you load files directly from disk – this is likely to be faster if your file is one of the formats supported.)

For a complete list of changes, please see the full release notes.

ggplot2 1.0.0

As you might have noticed, ggplot2 recently turned 1.0.0. This release incorporated a handful of new features and bug fixes, but most importantly reflects that ggplot2 is now a mature plotting system and it will not change significantly in the future.

This does not mean ggplot2 is dead! The ggplot2 community is rich and vibrant and the number of packages that build on top of ggplot2 continues to grow. We are committed to maintaining ggplot2 so that you can continue to rely on it for years to come.

The ggplot2 book

Since ggplot2 is now stable, and the ggplot2 book is over five years old and rather out of date, I’m also happy to announce that I’m working on a second edition. I’ll be ably assisted in this endeavour by Carson Sievert, who’s so far done a great job of converting the source to Rmd and updating many of the examples to work with ggplot2 1.0.0. In the coming months we’ll be rewriting the data chapter to reflect modern best practices (e.g. tidyr and dplyr), and adding sections about new features.

We’d love your help! The source code for the book is available on github. If you’ve spotted any mistakes in the first edition that you’d like to correct, we’d really appreciate a pull request. If there’s a particular section of the book that you think needs an update (or is just plain missing), please let us know by filing an issue. Unfortunately we can’t turn the book into a free website because of my agreement with the publisher, but at least you can now get easily get to the source.

httr 0.6.0 is now available on CRAN. The httr packages makes it easy to talk to web APIs from R. Learn more in the quick start vignette.

This release is mostly bug fixes and minor improvements. The most important are:

  • handle_reset(), which allows you to reset the default handle if you get the error “easy handle already used in multi handle”.
  • write_stream() which lets you process the response from a server as a stream of raw vectors (#143).
  • VERB() allows to you send a request with a custom http verb.
  • brew_dr() checks for common problems. It currently checks if your libcurl uses NSS. This is unlikely to work so it gives you some advice on how to fix the problem (thanks to Dirk Eddelbuettel for debugging this problem and suggesting a remedy).
  • Added support for Google OAuth2 service accounts. (#119, thanks to help from @siddharthab). See ?oauth_service_token for details.

I’ve also switched from RC to R6 (which should make it easier to extend OAuth classes for non-standard OAuth implementations), and tweaked the use of the backend SSL certificate details bundled with httr. See the release notes for complete details.

tidyr 0.2.0 is now available on CRAN. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has variables in columns and observations in rows, and is described in more detail in the tidy data vignette. Install tidyr with:

install.packages("tidyr")

There are three important additions to tidyr 0.2.0:

  • expand() is a wrapper around expand.grid() that allows you to generate all possible combinations of two or more variables. In conjunction with dplyr::left_join(), this makes it easy to fill in missing rows of data.
    sales <- dplyr::data_frame(
      year = rep(c(2012, 2013), c(4, 2)),
      quarter = c(1, 2, 3, 4, 2, 3), 
      sales = sample(6) * 100
    )
    
    # Missing sales data for 2013 Q1 & Q4
    sales
    #> Source: local data frame [6 x 3]
    #> 
    #>   year quarter sales
    #> 1 2012       1   400
    #> 2 2012       2   200
    #> 3 2012       3   500
    #> 4 2012       4   600
    #> 5 2013       2   300
    #> 6 2013       3   100
    
    # Missing values are now explicit
    sales %>% 
      expand(year, quarter) %>%
      dplyr::left_join(sales)
    #> Joining by: c("year", "quarter")
    #> Source: local data frame [8 x 3]
    #> 
    #>   year quarter sales
    #> 1 2012       1   400
    #> 2 2012       2   200
    #> 3 2012       3   500
    #> 4 2012       4   600
    #> 5 2013       1    NA
    #> 6 2013       2   300
    #> 7 2013       3   100
    #> 8 2013       4    NA
  • In the process of data tidying, it’s sometimes useful to have a column of a data frame that is a list of vectors. unnest() lets you simplify that column back down to an atomic vector, duplicating the original rows as needed. (NB: If you’re working with data frames containing lists, I highly recommend using dplyr’s tbl_df, which will display list-columns in a way that makes their structure more clear. Use dplyr::data_frame() to create a data frame wrapped with the tbl_df class.)
    raw <- dplyr::data_frame(
      x = 1:3,
      y = c("a", "d,e,f", "g,h")
    )
    # y is character vector containing comma separated strings
    raw
    #> Source: local data frame [3 x 2]
    #> 
    #>   x     y
    #> 1 1     a
    #> 2 2 d,e,f
    #> 3 3   g,h
    
    # y is a list of character vectors
    as_list <- raw %>% mutate(y = strsplit(y, ","))
    as_list
    #> Source: local data frame [3 x 2]
    #> 
    #>   x        y
    #> 1 1 <chr[1]>
    #> 2 2 <chr[3]>
    #> 3 3 <chr[2]>
    
    # y is a character vector; rows are duplicated as needed
    as_list %>% unnest(y)
    #> Source: local data frame [6 x 2]
    #> 
    #>   x y
    #> 1 1 a
    #> 2 2 d
    #> 3 2 e
    #> 4 2 f
    #> 5 3 g
    #> 6 3 h
  • separate() has a new extra argument that allows you to control what happens if a column doesn’t always split into the same number of pieces.
    raw %>% separate(y, c("trt", "B"), ",")
    #> Error: Values not split into 2 pieces at 1, 2
    raw %>% separate(y, c("trt", "B"), ",", extra = "drop")
    #> Source: local data frame [3 x 3]
    #> 
    #>   x trt  B
    #> 1 1   a NA
    #> 2 2   d  e
    #> 3 3   g  h
    raw %>% separate(y, c("trt", "B"), ",", extra = "merge")
    #> Source: local data frame [3 x 3]
    #> 
    #>   x trt   B
    #> 1 1   a  NA
    #> 2 2   d e,f
    #> 3 3   g   h

To read about the other minor changes and bug fixes, please consult the release notes.

reshape2 1.4.1

There’s also a new version of reshape2, 1.4.1. It includes three bug fixes for melt.data.frame() contributed by Kevin Ushey. Read all about them on the release notes and install it with:

install.packages("reshape2")
Follow

Get every new post delivered to your Inbox.

Join 12,630 other followers