You are currently browsing the category archive for the ‘Packages’ category.

I’ve released four new data packages to CRAN: babynames, fueleconomy, nasaweather and nycflights13. The goal of these packages is to provide some interesting, and relatively large, datasets to demonstrate various data analysis challenges in R. The package source code (on github, linked above) is fully reproducible so that you can see some data tidying in action, or make your own modifications to the data.

Below, I’ve listed the primary dataset found in each package. Most packages also include a number of supplementary datasets that provide additional information. Check out the docs for more details.

  • babynames::babynames: US baby name data for each year from 1880 to 2013, the number of children of each sex given each name. All names used 5 or more times are included. 1,792,091 rows, 5 columns (year, sex, name, n, prop). (Source: Social security administration).
  • fueleconomy::vehicles: Fuel economy data for all cars sold in the US from 1984 to 2015. 33,442 rows, 12 variables. (Source: Environmental protection agency)
  • nasaweather::atmos: Data from the 2006 ASA data expo. Contains monthly atmospheric measurements from Jan 1995 to Dec 2000 on 24 x 24 grid over Central America. 41,472 observations, 11 variables. (Source: ASA data expo)
  • nycflights13::flights: This package contains information about all flights that departed from NYC (i.e., EWR, JFK and LGA) in 2013: 336,776 flights with 16 variables. To help understand what causes delays, it also includes a number of other useful datasets: weather, planes, airports, airlines. (Source: Bureau of transportation statistics)

NB: since the datasets are large, I’ve tagged each data frame with the tbl_df class. If you don’t use dplyr, this has no effect. If you do use dplyr, this ensures that you won’t accidentally print thousands of rows of data. Instead, you’ll just see the first 10 rows and as many columns as will fit on screen. This makes interactive exploration much easier.

library(dplyr)
library(nycflights13)
flights
#> Source: local data frame [336,776 x 16]
#> 
#>    year month day dep_time dep_delay arr_time arr_delay carrier tailnum
#> 1  2013     1   1      517         2      830        11      UA  N14228
#> 2  2013     1   1      533         4      850        20      UA  N24211
#> 3  2013     1   1      542         2      923        33      AA  N619AA
#> 4  2013     1   1      544        -1     1004       -18      B6  N804JB
#> 5  2013     1   1      554        -6      812       -25      DL  N668DN
#> 6  2013     1   1      554        -4      740        12      UA  N39463
#> 7  2013     1   1      555        -5      913        19      B6  N516JB
#> 8  2013     1   1      557        -3      709       -14      EV  N829AS
#> 9  2013     1   1      557        -3      838        -8      B6  N593JB
#> 10 2013     1   1      558        -2      753         8      AA  N3ALAA
#> ..  ...   ... ...      ...       ...      ...       ...     ...     ...
#> Variables not shown: flight (int), origin (chr), dest (chr), air_time
#>   (dbl), distance (dbl), hour (dbl), minute (dbl)

We’re excited to announce a new release of Packrat, a tool for making R projects more isolated and reproducible by managing their package dependencies.

This release brings a number of exciting features to Packrat that significantly improve the user experience:

  • Automatic snapshots ensure that new packages installed in your project library are automatically tracked by Packrat.
  • Bundle and share your projects with packrat::bundle() and packrat::unbundle() — whether you want to freeze an analysis, or exchange it for collaboration with colleagues.
  • Packrat mode can now be turned on and off at will, allowing you to navigate between different Packrat projects in a single R session. Use packrat::on() to activate Packrat in the current directory, and packrat::off() to turn it off.
  • Local repositories (ie, directories containing R package sources) can now be specified for projects, allowing local source packages to be used in a Packrat project alongside CRAN, BioConductor and GitHub packages (see this and more with ?"packrat-options").

In addition, Packrat is now tightly integrated with the RStudio IDE, making it easier to manage project dependencies than ever. Download today’s RStudio IDE 0.98.978 release and try it out!

Packrat RStudio package pane integration

You can install the latest version of Packrat from GitHub with:

    devtools::install_github("rstudio/packrat")

Packrat will be coming to CRAN soon as well.

If you try it, we’d love to get your feedback. Leave a comment here or post in the packrat-discuss Google group.

 

tidyr is new package that makes it easy to “tidy” your data. Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modelling packages). The two most important properties of tidy data are:

  • Each column is a variable.
  • Each row is an observation.

Arranging your data in this way makes it easier to work with because you have a consistent way of referring to variables (as column names) and observations (as row indices). When use tidy data and tidy tools, you spend less time worrying about how to feed the output from one function into the input of another, and more time answering your questions about the data.

To tidy messy data, you first identify the variables in your dataset, then use the tools provided by tidyr to move them into columns. tidyr provides three main functions for tidying your messy data: gather(), separate() and spread().

gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. Other names for gather include melt (reshape2), pivot (spreadsheets) and fold (databases). Here’s an example how you might use gather() on a made-up dataset. In this experiment we’ve given three people two different drugs and recorded their heart rate:

library(tidyr)
library(dplyr)

messy <- data.frame(
  name = c("Wilbur", "Petunia", "Gregory"),
  a = c(67, 80, 64),
  b = c(56, 90, 50)
)
messy
#>      name  a  b
#> 1  Wilbur 67 56
#> 2 Petunia 80 90
#> 3 Gregory 64 50

We have three variables (name, drug and heartrate), but only name is currently in a column. We use gather() to gather the a and b columns into key-value pairs of drug and heartrate:

messy %>%
  gather(drug, heartrate, a:b)
#>      name drug heartrate
#> 1  Wilbur    a        67
#> 2 Petunia    a        80
#> 3 Gregory    a        64
#> 4  Wilbur    b        56
#> 5 Petunia    b        90
#> 6 Gregory    b        50

Sometimes two variables are clumped together in one column. separate() allows you to tease them apart (extract() works similarly but uses regexp groups instead of a splitting pattern or position). Take this example from stackoverflow (modified slightly for brevity). We have some measurements of how much time people spend on their phones, measured at two locations (work and home), at two times. Each person has been randomly assigned to either treatment or control.

set.seed(10)
messy <- data.frame(
  id = 1:4,
  trt = sample(rep(c('control', 'treatment'), each = 2)),
  work.T1 = runif(4),
  home.T1 = runif(4),
  work.T2 = runif(4),
  home.T2 = runif(4)
)

To tidy this data, we first use gather() to turn columns work.T1, home.T1, work.T2 and home.T2 into a key-value pair of key and time. (Only the first eight rows are shown to save space.)

tidier <- messy %>%
  gather(key, time, -id, -trt)
tidier %>% head(8)
#>   id       trt     key    time
#> 1  1 treatment work.T1 0.08514
#> 2  2   control work.T1 0.22544
#> 3  3 treatment work.T1 0.27453
#> 4  4   control work.T1 0.27231
#> 5  1 treatment home.T1 0.61583
#> 6  2   control home.T1 0.42967
#> 7  3 treatment home.T1 0.65166
#> 8  4   control home.T1 0.56774

Next we use separate() to split the key into location and time, using a regular expression to describe the character that separates them.

tidy <- tidier %>%
  separate(key, into = c("location", "time"), sep = "\\.") 
tidy %>% head(8)
#>   id       trt location time    time
#> 1  1 treatment     work   T1 0.08514
#> 2  2   control     work   T1 0.22544
#> 3  3 treatment     work   T1 0.27453
#> 4  4   control     work   T1 0.27231
#> 5  1 treatment     home   T1 0.61583
#> 6  2   control     home   T1 0.42967
#> 7  3 treatment     home   T1 0.65166
#> 8  4   control     home   T1 0.56774

The last tool, spread(), takes two columns (a key-value pair) and spreads them in to multiple columns, making “long” data wider. Spread is known by other names in other places: it’s cast in reshape2, unpivot in spreadsheets and unfold in databases. spread() is used when you have variables that form rows instead of columns. You need spread() less frequently than gather() or separate() so to learn more, check out the documentation and the demos.

Just as reshape2 did less than reshape, tidyr does less than reshape2. It’s designed specifically for tidying data, not general reshaping. In particular, existing methods only work for data frames, and tidyr never aggregates. This makes each function in tidyr simpler: each function does one thing well. For more complicated operations you can string together multiple simple tidyr and dplyr functions with %>%.

You can learn more about the underlying principles in my tidy data paper. To see more examples of data tidying, read the vignette, vignette("tidy-data"), or check out the demos, demo(package = "tidyr"). Alternatively, check out some of the great stackoverflow answers that use tidyr. Keep up-to-date with development at http://github.com/hadley/tidyr, report bugs at http://github.com/hadley/tidyr/issues and get help with data manipulation challenges at https://groups.google.com/group/manipulatr. If you ask a question specifically about tidyr on stackoverflow, please tag it with tidyr and I’ll make sure to read it.

The RStudio team recently rolled out new capabilities in RStudio, shiny, ggvis, dplyr, knitr, R Markdown, and packrat. The “Essential Tools for Data Science with R” free webinar series is the perfect place to learn more about the power of these R packages from the authors themselves.

Click to learn more and register for one or more webinar sessions. You must register for each separately. If you miss a live webinar or want to review them, recorded versions will be available to registrants within 30 days.

The Grammar and Graphics of Data Science
Live! Wednesday, July 30 at 11am Eastern Time US  Click to register

  • dplyr: a grammar of data manipulation – Hadley Wickham
  • ggvis: Interactive graphics in R – Winston Chang

Reproducible Reporting 
Live! Wednesday, August 13 at 11am Eastern Time US  Click to register

  • The Next Generation of R Markdown – Jeff Allen
  • Knitr Ninja – Yihui Xie
  • Packrat – A Dependency Management System for R – J.J. Allaire & Kevin Ushey

Interactive Reporting
Live! Wednesday, September 3 at 11am Eastern Time US  Click to register

  • Embedding Shiny Apps in R Markdown documents – Garrett Grolemund
  • Shiny: R made interactive – Joe Cheng

 

Our first public release of ggvis, version 0.3, is now available on CRAN. What is ggvis? It’s a new package for data visualization. Like ggplot2, it is built on concepts from the grammar of graphics, but it also adds interactivity, a new data pipeline, and it renders in a web browser. Our goal is to make an interface that’s flexible, so that you can compose new kinds of visualizations, yet simple, so that it’s accessible to all R users.

Update: there was an issue affecting interactive plots in version 0.3. Version 0.3.0.1 fixes the issue. The updated source package is now on CRAN, and Windows and Mac binary packages will be available shortly.

ggvis_movies

ggvis integrates with Shiny, so you can use dynamic, interactive ggvis graphics in Shiny applications. We hope that the combination of ggvis and Shiny will make it easy for you to create applications for interactive data exploration and presentation. ggvis plots are inherently reactive and they render in the browser, so they can take advantage of the capabilities provided by modern web browsers. You can use Shiny’s interactive components for interactivity as well as more direct forms of interaction with the plot, such as hovering, clicking, and brushing.

ggvis works seamlessly with R Markdown v2 and interactive documents, so you can easily add interactive graphics to your R Markdown documents:

shiny-doc-ggvis  ggvis_density

And don’t worry — ggvis isn’t only meant to be used with Shiny and interactive documents. Because the RStudio IDE is also a web browser, ggvis plots can display in the IDE, like any other R graphics:

ggvis in RStudio IDE

There’s much more to come with ggvis. To learn more, visit the ggvis website.

Please note that ggvis is still young, and lacks a number of important features from ggplot2. But we’re working hard on ggvis and expect many improvements in the months to come.

Shiny 0.10 is now available on CRAN.

Interactive documents

In this release, the biggest changes were under the hood to support the creation of interactive documents. If you haven’t had a chance to check out interactive documents, we really encourage you to do so—it may be the easiest way to learn Shiny.

New layout functions

Three new functions—flowLayout(), splitLayout(), and inputPanel()—were added for putting UI elements side by side.

  • flowPanel() lays out its children in a left-to-right, top-to-bottom arrangement.
  • splitLayout() evenly divides its horizontal space among its children (or unevenly divides if cellWidths argument is provided).
  • inputPanel() is like flowPanel(), but with a light grey background, and is intended for encapsulating small input controls wherever vertical space is at a premium.

A new logical argument inline was also added to checkboxGroupInput() and radioButtons() to arrange check boxes and radio buttons horizontally.

Custom validation error messages

Sometimes you don’t want your reactive expressions or output renderers in server.R to proceed unless certain input conditions are satisfied, e.g. a select input value has been chosen, or a sensible combination of inputs has been provided. In these cases, you might want to stop the render function quietly, or you might want to give the user a custom message. In shiny 0.10.0, we introduced the functions validate() and need() which you can use to enforce validation conditions. This won’t be the last word on input validation in Shiny, but it should be a lot safer and more convenient than how most of us have been doing it.

See the article Write error messages for your UI with validate for details and examples.

Sever-side processing for Selectize input

In the previous release of Shiny, we added support for Selectize, a powerful select box widget. At that time, our implementation passed all of the data to the web page and used JavaScript to do any paging, filtering, and sorting. It worked great for small numbers of items but didn’t scale well beyond a few thousand items.

For Shiny 0.10, we greatly improved the performance of our existing client-side Selectize binding, but also added a new mode that allows the paging, filtering, and sorting to all happen on the server. Only the results that are actually displayed are downloaded to the client. This approach works well for hundreds of thousands or millions of rows.

For more details and examples, see the article Using selectize input on shiny.rstudio.com.

htmltools

We also split off Shiny’s HTML generating library (tags and friends) into a separate htmltools package. If you’re writing a package that needs to generate HTML programmatically, it’s far easier and safer to use htmltools than to paste HTML strings together yourself. We’ll have more to share about htmltools in the months to come.

Other changes

  • New actionLink() input control: behaves like actionButton() but looks like a link
  • renderPlot() now calls print() on its result if it’s visible–no more explicit print() required for ggplot2
  • Sliders and select boxes now use a fixed horizontal size instead of filling up all available horizontal space; pass width="100%" if you need the old behavior
  • The session object that can be passed into a server function is now documented: see ?session
  • New reactive domains feature makes it easy to get callbacks when the current session ends, without having to pass session everywhere
  • Thanks to reactive domains, by default, observers now automatically stop executing when the Shiny session that created them ends
  • shinyUI and shinyServer

For the full list, you can take a look at the NEWS file. Please let us know if you have any comments or questions.

People rarely agree on a best authoring tool or language. Some people cannot live without \LaTeX{} because of the beauty and quality of its PDF output. Some \feel{} \uncomfortable{} \with{} \backslashes{}, and would rather live in another World Word. We have also witnessed the popularity of Markdown, an incredibly simple language (seriously? a LANGUAGE?) that has made reproducible research much easier.

Thinking of all these tools and languages, every developer will dream about “One ring to rule them all“. \section{}, <h1></h1>, ===, #, … Why cannot we write the first-level section header in a single way? Yes, we are aware of the danger of “adding yet another so-called universal standard that covers all the previous standards”. However, we believe Pandoc has done a fairly good job in terms of “yet another Markdown standard”. Standing on the shoulders of Pandoc, today we are excited to announce the second episode of our journey into the development of the tools for authoring dynamic documents:

The Return of R Markdown!

The R package markdown (plus knitr) was our first version of R Markdown. The primary output format was HTML, which certainly could not satisfy all users in the World Word. It did not have features like citations, footnotes, or metadata (title, author, and date, etc), either. When we were asked how one could convert Markdown to PDF/Word, we used to tell users to try Pandoc. The problem is that Pandoc’s great power comes with a lot of command line options (more than 70), and knitr has the same problem of too many options. That is why we created the second generation of R Markdown, represented by the rmarkdown package, to provide reasonably good defaults and an R-friendly interface to customize Pandoc options.

The new version of RStudio (v0.98.932) includes everything you need to use R Markdown v2 (including pandoc and the rmarkdown package). If you are not using RStudio you can install rmarkdown and pandoc separately as described here. To get started with a “Hello Word” example, simply click the menu File -> New File -> R Markdown in RStudio IDE. You can choose the output format from the drop-down menu on the toolbar.

R Markdown Formats

The built-in output formats include HTML, LaTeX/PDF, Word, Beamer slides, HTML5 presentations, and so on. Pandoc’s Markdown allows us to write richer content such as tables, citations, and footnotes. For power users who understand LaTeX/HTML, you can even embed raw LaTeX/HTML code in Markdown, and Pandoc is smart enough to process these raw fragments. If you cannot remember the possible options for a certain output format in the YAML metadata (data between --- and --- in the beginning of a document), you can use the Settings button on the toolbar.

Extensive documentation for R Markdown v2 and all of it’s supported output formats are available on the new R Markdown website at http://rmarkdown.rstudio.com.

We understand users will never be satisfied by our default templates, regardless of how hard we try to make them appealing. The rmarkdown package is fully customizable and extensible in the sense that you can define your custom templates and output formats. You want to contribute an article to The R Journal, or JSS (Journal of Statistical Software), but prefer writing in Markdown instead of LaTeX? No problem! Pandoc also supports many other output formats, and you want EPUB books, or a different type of HTML5 slides? No problem! Not satisfied with one single static output document? You can embed interactive widgets into R Markdown documents as well! Let there be Shiny! The more you learn about rmarkdown and Pandoc, the more freedom you will get.

For a brief video introduction, you may watch the talk below (jump to 18:30 if you only want to see the demos):

The rmarkdown package is open-source (GPL-3) and is both included in the RStudio IDE and available on GitHub. The package is not on CRAN yet, but will be there as soon as we make all the improvements requested by early users.

To clarify the relationship between rmarkdown and RStudio IDE, our IDE is absolutely not the only way to compile R Markdown documents. You are free to call functions in rmarkdown in any environment. Please check out the R package documentation, in particular, the render() function in rmarkdown.

Please let us know if you have any questions or comments, and your feedback is greatly appreciated. We hope you will enjoy R Markdown v2.

Keep Calm and Markdown

I’m very excited to announce dplyr 0.2. It has three big features:

  • improved piping courtesy of the magrittr package

  • a vastly more useful implementation of do()

  • five new verbs: sample_n(), sample_frac(), summarise_each(), mutate_each and glimpse().

These features are described in more detail below. To learn more about the 35 new minor improvements and bug fixes, please read the full release notes.

Improved piping

dplyr now imports %>% from the magrittr package by Stefan Milton Bache. I recommend that you use this instead of %.% because it is easier to type (since you can hold down the shift key) and is more flexible. With you %>%, you can control which argument on the RHS receives the LHS with the pronoun .. This makes %>% more useful with base R functions because they don’t always take the data frame as the first argument. For example you could pipe mtcars to xtabs() with:

mtcars %>% xtabs( ~ cyl + vs, data = .)

dplyr only exports %>% from magrittr, but magrittr contains many other useful functions. To use them, load magrittr explicitly with library(magrittr). For more details, see vignette("magrittr").
%.% will be deprecated in a future version of dplyr, but it won’t happen for a while. I’ve deprecated chain() to encourage a single style of dplyr usage: please use %>% instead.

Do

do() has been completely overhauled, and group_by() + do() is now equivalent in power to plyr::dlply(). There are two ways to use do(), either with multiple named arguments or a single unnamed arguments. If you use named arguments, each argument becomes a list-variable in the output. A list-variable can contain any arbitrary R object which makes this form of do() useful for storing models:

library(dplyr)
models %>% group_by(cyl) %>% do(model = lm(mpg ~ wt, data = .))
models %>% summarise(rsq = summary(model)$r.squared)

If you use an unnamed argument, the result should be a data frame. This allows you to apply arbitrary functions to each group.

mtcars %>% group_by(cyl) %>% do(head(., 1))

Note the use of the pronoun . to refer to the data in the current group.
do() also has an automatic progress bar. It appears if the computation takes longer than 2 seconds and estimates how long the job will take to complete.

New verbs

sample_n() randomly samples a fixed number of rows from a tbl; sample_frac() randomly samples a fixed fraction of rows. They currently only work for local data frames and data tables.
summarise_each() and mutate_each() make it easy to apply one or more functions to multiple columns in a tbl. These works for all srcs that summarise() and mutate() work for.
glimpse() makes it possible to see all the columns in a tbl, displaying as much data for each variable as can be fit on a single line.

We’re pleased to announce a new version of roxygen2. Roxygen2 allows you to write documentation comments that are automatically converted to R’s standard Rd format, saving you time and reducing duplication. This release is a major update that provides enhanced error handling and considerably safer default behaviour. Roxygen2 now adds a comment to all generated files so that you know they shouldn’t be edited by hand. This also ensures that roxygen2 will never overwrite a file that it did not create, and can automatically remove files that are no longer needed.

I’ve also written some vignettes to help you understand how to use roxygen2. Six new vignettes provide a comprehensive overview of using roxygen2 in practice. Run browseVignettes("roxygen2") to read them. In an effort to make roxygen2 easier to use and more consistent between package authors, I’ve made parsing considerably stricter, and made sure that all errors give you the line number of the associated roxygen block. Every input is now checked to make sure that it has (e.g. every { has a matching }). This should prevent frustrating errors that require careful reading of .Rd files. Similarly, @section titles and @export tags can now only span a single line as this prevents a number of common bugs.

Other features include two new tags @describeIn and @field, and you can document objects (like datasets) by documenting their name as a string. For example, to document a dataset called mydata, you can do:

#' Mydata set
#'
#' Some data I collected about myself
"mydata"

To see a complete list of all bug fixes and improvements, please see the release notes for roxygen2 4.0.0 for details. Roxygen2 4.0.1 fixed a couple of minor bugs and majorly improved the upgrade process.

reshape2 1.4 is now available on CRAN. This version adds a number of useful arguments and messages, but mostly importantly it gains a C++ implementation of melt.data.frame(). This new method should be much much faster (>10x) and does a better job of preserving existing attributes. For full details, see the release notes on github.

The C++ implementation of melt was contributed by Kevin Ushey, who we’re very pleased to announce has joined RStudio. You may be familiar with Kevin from his contributions to Rcpp, or his CRAN packages Kmisc and timeit.

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 613 other followers

RStudio is an affiliated project of the Foundation for Open Access Statistics

Follow

Get every new post delivered to your Inbox.

Join 613 other followers