You are currently browsing hadleywickham’s articles.

devtools 1.5 is now available on CRAN. It includes four new functions to make it easier to add useful infrastructure to packages:

  • add_test_infrastructure() will create testthat infrastructure when needed.

  • add_rstudio_project() adds an Rstudio project file to your package.

  • add_travis() adds a basic template for travis-ci.

  • add_build_ignore() makes it easy to add files to .Rbuildignore,
    escaping special characters as needed.

We’ve also bumped two dependencies: devtools now requires R 3.0.2 and roxygen2 3.0.0. We’ve also included many minor improvements and bug fixes, particularly for package installation. For example install_github() now prefers the safer github personal access token, and does a better job of installing the dependencies that you actually need. We also provide versions of help(), ? and system.file() that work with all packages, regardless of how they’re loaded. See a complete list of changes in the full release notes.

We’re very pleased to announce the release of httr 0.3. httr makes it
easy to work with modern web apis so that you can work with web data
almost as easily as local data. For example, this code shows how might
find the most recently asked question about R on stackoverflow:

# install.packages("httr")
library(httr)

# Find the most recent R questions on stackoverflow
r <- GET(
  "http://api.stackexchange.com",
  path = "questions",
  query = list(
    site = "stackoverflow.com",
    tagged = "r"
  )
)

# Check the request succeeded
stop_for_status(r)

# Automatically parse the json output
questions <- content(r)
questions$items[[1]]$title
#> [1] "Remove NAs from data frame without deleting entire rows/columns"

httr 0.3 recieved a major overhaul to OAuth support. OAuth is a modern
standard for authentication used when you want to allow a service (i.e R
package) access to your account on a website. This version of httr
provides an improved initial authentication experience and supports
caching so that you only need to authenticate once per project. A big
thanks goes to Craig Citro (Google) who contributed a lot of code and
ideas to make this possible.

httr 0.3 also includes many other bug fixes and minor improvements. You
can read about these in the github release notes.

dplyr 0.1.3 is now on CRAN. It fixes an incompatibility with the latest version of Rcpp, and a number of other bugs that were causing dplyr to crash R. See the full details in the release notes.

We’re pleased to announce a new major version of testthat. Version 0.8 comes with a new recommended structure for storing your tests. To better meet CRAN recommended practices, we now recommend that tests live in tests/testthat, instead of inst/tests. This makes it possible for users to choose whether or not to install tests. With this new structure, you’ll need to use test_check() instead of test_packages() in the test file (usually tests/testthat.R) that runs all testthat unit tests.

Another big improvement comes from Karl Forner. He contributed code which provides line numbers in test errors so you can see exactly where the problems are. There are also four new expectations (expect_null(), expected_named(), expect_more_than(), expect_less_than()) and many other minor improvements and bug fixes. For a complete list of changes, please see the github release. After release of 0.8 to CRAN, we discovered two small bugs. These were fixed in 0.8.1.

As always, you can install the latest version with install.packages("testthat").

We’re pleased to announce a new minor version of dplyr. This fixes a number of bugs that crashed R, and considerably improves the functionality of select(). You can now use named arguments to rename existing variables, and use new functions starts_with(), ends_with()contains(),  matches() and num_range() to select variables based on their names. Finally, select() now makes a shallow copy, substantially reducing its memory impact. I’ve also added the summarize() alias for people from countries who don’t spell correctly ;)

For a complete list of changes, please see the github release, and as always, you can install the latest version with install.packages("dplyr").

We’re pleased to announce a new minor version of dplyr. This fixes a few bugs that crashed R, adds a few minor new features (like a sort argument to tally()), and uses shallow copying in a few more places. There is one backward incompatible change: explain_tbl() has been renamed to explain. For a complete list of changes, please see the github release notice.

As always, you can install the latest version with install.packages("dplyr").

We’re pleased to announce a new version of roxygen2. The biggest news is that roxygen2 now recognises reference class method docstrings and will automatically add them to the documentation. 3.1.0 also offers a number of minor improvements and bug fixes, as listed on the github release notice.

As always, you can install the latest version with install.packages("roxygen2").

dplyr is a new package which provides a set of tools for efficiently manipulating datasets in R. dplyr is the next iteration of plyr, focussing on only data frames. dplyr is faster, has a more consistent API and should be easier to use. There are three key ideas that underlie dplyr:

  1. Your time is important, so Romain Francois has written the key pieces in Rcpp to provide blazing fast performance. Performance will only get better over time, especially once we figure out the best way to make the most of multiple processors.
  2. Tabular data is tabular data regardless of where it lives, so you should use the same functions to work with it. With dplyr, anything you can do to a local data frame you can also do to a remote database table. PostgreSQL, MySQL, SQLite and Google bigquery support is built-in; adding a new backend is a matter of implementing a handful of S3 methods.
  3. The bottleneck in most data analyses is the time it takes for you to figure out what to do with your data, and dplyr makes this easier by having individual functions that correspond to the most common operations (group_by, summarise, mutate, filter, select and arrange). Each function does one only thing, but does it well.

Lets compare plyr and dplyr with a little example, using the Batting dataset from the fantastic Lahman package which makes the complete Lahman baseball database easily accessible from R. Pretend we want to find the five players who have batted in the most games in all of baseball history.

In plyr, we might write code like this:

library(Lahman)
library(plyr)

games <- ddply(Batting, "playerID", summarise, total = sum(G))
head(arrange(games, desc(total)), 5)

We use ddply() to break up the Batting dataframe into pieces according to the playerID variable, then apply summarise() to reduce the player data to a single row. Each row in Batting represents one year of data for one player, so we figure out the total number of games with sum(G) and save it in a new variable called total. We sort the result so the most games come at the top and then use head() to pull off the first five.

In dplyr, the code is similar:

library(Lahman)
library(dplyr)

players <- group_by(Batting, playerID)
games <- summarise(players, total = sum(G))
head(arrange(games, desc(total)), 5)

But now grouping is now a top level operation performed by group_by(), and summarise() works directly on the grouped data, rather than being called from inside another function. The other big difference is speed. plyr took about 7s on my computer, and dplyr took 0.2s, a 35x speed-up. This is common when switching from plyr to dplyr, and for many operations you’ll see a 20x-1000x speedup.

dplyr provides another innovation over plyr: the ability to chain operations together from left to right with the %.% operator. This makes dplyr behave a little like a grammar of data manipulation:

Batting %.%
  group_by(playerID) %.%
  summarise(total = sum(G)) %.%
  arrange(desc(total)) %.%
  head(5)

Read more about it in the help, ?"%.%".

If this small example has whet your interest, you can learn more from the built-in vignettes. First install dplyr with install.packages("dplyr"), then run:

You can track development progress at http://github.com/hadley/dplyr, report bugs at http://github.com/hadley/dplyr/issues and get help with data manipulation challenges at https://groups.google.com/group/manipulatr. If you ask a question specifically about dplyr on StackOverflow, please tag it with dplyr and I’ll make sure to read it.

We’re pleased to announce a new version of roxygen2. The biggest news is that you can painlessly document your S4 classes, S4 methods and RC classes with roxygen2 – you can safely remove workarounds that used @alias and @usage, and simply rely on roxygen2 to do the right thing. Roxygen2 is also much smarter when it comes to S3: you can remove existing uses of @method, and can replace @S3method with @export.

Version 3.0 also includes many other improvements including better generation of usage, the ability to turn off wrapping in your Rd files and choose default roclets for a package, a safer roxygenise() (or roxyngenize() if you prefer) and many other bug fixes and improvement. See the full list on the github release.

As always, you can install the latest version with install.packages("roxygen2")

We’re pleased to announce two upcoming in-person training opportunities:

  • Advanced R programming. SF, Dec 16-17.Learn the most important topics from advanced R programming in person. One day one, you’ll learn about metaprograming, functional programming and object oriented programming in R, as well general best practices for programming. Taught by Hadley Wickham, RStudio’s Chief Scientist.
  • Public workshop. Boston, Jan 27-28.In this two-day workshop, you’ll get a comprehensive introduction to R, and you’ll be visualising, manipulating and modeling data in no time. Taught by Garrett Grolemund, RStudio Master Instructor

We have discounts available for students (66%) and academics (33%). Please contact Josh Paulson for details.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 519 other followers

RStudio is an affiliated project of the Foundation for Open Access Statistics

Follow

Get every new post delivered to your Inbox.

Join 519 other followers