You are currently browsing the monthly archive for March 2016.

Wes McKinney, Software Engineer, Cloudera
Hadley Wickham, Chief Scientist, RStudio

This past January, we (Hadley and Wes) met and discussed some of the systems challenges facing the Python and R open source communities. In particular, we wanted to see if there were some opportunities to collaborate on tools for improving interoperability between Python, R, and external compute and storage systems.

One thing that struck us was that while R’s data frames and Python’s pandas data frames utilize very different internal memory representations, they share a very similar semantic model. In both R and Panda’s, data frames are lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (factors), or string. Every column can have missing values.

Around this time, the open source community had just started the new Apache Arrow project, designed to improve data interoperability for systems dealing with columnar tabular data.

In discussing Apache Arrow in the context of Python and R, we wanted to see if we could use the insights from feather to design a very fast file format for storing data frames that could be used by both languages. Thus, the Feather format was born.

What is Feather?

Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:

  • Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible
  • Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too.
  • High read and write performance. When possible, Feather operations should be bound by local disk performance.

Code examples

The Feather API is designed to make reading and writing data frames as easy as possible. In R, the code might look like:

library(feather)
path <- "my_data.feather"
write_feather(df, path)
df <- read_feather(path)

Analogously, in Python, we have:

import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
df = feather.read_dataframe(path)

How fast is Feather?

Feather is extremely fast. Since Feather does not currently use any compression internally, it works best when used with solid-state drives as come with most of today’s laptop computers. For this first release, we prioritized a simple implementation and are thus writing unmodified Arrow memory to disk.

To give you an idea, here is a Python benchmark writing an approximately 800MB pandas DataFrame to disk:

import feather
import pandas as pd
import numpy as np
arr = np.random.randn(10000000) # 10% nulls
arr[::10] = np.nan
df = pd.DataFrame({'column_{0}'.format(i): arr for i in range(10)})
feather.write_dataframe(df, 'test.feather')

On Wes’s laptop (latest-gen Intel processor with SSD), this takes:

In [9]: %time df = feather.read_dataframe('test.feather')
CPU times: user 316 ms, sys: 944 ms, total: 1.26 s
Wall time: 1.26 s

In [11]: 800 / 1.26
Out[11]: 634.9206349206349

This is effective performance of over 600 MB/s. Of course, the performance you see will depend on your hardware configuration.

And in R (on Hadley’s laptop, which is very similar):

library(feather)

x <- runif(1e7)
x[sample(1e7, 1e6)] <- NA # 10% NAs
df <- as.data.frame(replicate(10, x))
write_feather(df, 'test.feather')

system.time(read_feather('test.feather'))
#>   user  system elapsed 
#>  0.731   0.287   1.020 

How can I get Feather?

The Feather source code is hosted at http://github.com/wesm/feather.

Installing Feather for R

Feather is currently available from github, and you can install with:

devtools::install_github("wesm/feather/R")

Feather uses C++11, so if you’re on windows, you’ll need the new gcc 4.93 toolchain. (All going well this will be included in R 3.3.0, which is scheduled for release on April 14. We’ll aim for a CRAN release soon after that).

Installing Feather for Python

For Python, you can install Feather from PyPI like so:

$ pip install feather-format

We will look into providing more installation options, such as conda builds, in the future.

What should you not use Feather for?

Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable between versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.

Feather, Apache Arrow, and the community

One of the great parts of Feather is that the file format is language agnostic. Other languages, such as Julia or Scala (for Spark users), can read and write the format without knowledge of details of Python or R.

Feather is one of the first projects to bring the tangible benefits of the Arrow spec to users in the form of an efficient, language-agnostic representation of tabular data on disk. Since Arrow does not provide for a file format, we are using Google’s Flatbuffers library (github.com/google/flatbuffers) to serialize column types and related metadata in a language-independent way in the file.

The Python interface uses Cython to expose Feather’s C++11 core to users, while the R interface uses Rcpp for the same task.

If you’re a data wrangler or data scientist, ODSC East in Boston from May 20-22 is a wonderful opportunity to get up-to-date on the latest open source tools and trends. R and RStudio will have a significant presence.

J.J. Allaire, RStudio founder and CEO, will talk about recent and upcoming improvements in R Markdown.

The creator of Shiny and CTO of RStudio, Joe Cheng, will review the progress made bridging modern web browsers and R, along with the newest updates to htmlwidgets and Shiny frameworks. In addition, Joe will join Zev Ross Spatial Analysis to offer a Shiny developer workshop for those interested in a deeper dive.

Other notable R speakers include Max Kuhn, the author of the Caret package for machine learning and Jared Lander, R contributor and author of R for Everyone.

For RStudio and R enthusiasts, ODSC has graciously offered discounted tickets.

We hope to see you there!

I’m pleased to announce tibble, a new package for manipulating and printing data frames in R. Tibbles are a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. The name comes from dplyr: originally you created these objects with tbl_df(), which was most easily pronounced as “tibble diff”.

Install tibble with:

install.packages("tibble")

This package extracts out the tbl_df class associated functions from dplyr. Kirill Müller extracted the code from dplyr, enhanced the tests, and added a few minor improvements.

Creating tibbles

You can create a tibble from an existing object with as_data_frame():

as_data_frame(iris)
#> Source: local data frame [150 x 5]
#> 
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           (dbl)       (dbl)        (dbl)       (dbl)  (fctr)
#> 1           5.1         3.5          1.4         0.2  setosa
#> 2           4.9         3.0          1.4         0.2  setosa
#> 3           4.7         3.2          1.3         0.2  setosa
#> 4           4.6         3.1          1.5         0.2  setosa
#> 5           5.0         3.6          1.4         0.2  setosa
#> 6           5.4         3.9          1.7         0.4  setosa
#> 7           4.6         3.4          1.4         0.3  setosa
#> 8           5.0         3.4          1.5         0.2  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 10          4.9         3.1          1.5         0.1  setosa
#> ..          ...         ...          ...         ...     ...

This works for data frames, lists, matrices, and tables.

You can also create a new tibble from individual vectors with data_frame():

data_frame(x = 1:5, y = 1, z = x ^ 2 + y)
#> Source: local data frame [5 x 3]
#> 
#>       x     y     z
#>   (int) (dbl) (dbl)
#> 1     1     1     2
#> 2     2     1     5
#> 3     3     1    10
#> 4     4     1    17
#> 5     5     1    26

data_frame() does much less than data.frame(): it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, and it never creates row.names(). You can read more about these features in the vignette, vignette("tibble").

You can define a tibble row-by-row with frame_data():

frame_data(
  ~x, ~y,  ~z,
  "a", 2,  3.6,
  "b", 1,  8.5
)
#> Source: local data frame [2 x 3]
#> 
#>       x     y     z
#>   (chr) (dbl) (dbl)
#> 1     a     2   3.6
#> 2     b     1   8.5

Tibbles vs data frames

There are two main differences in the usage of a data frame vs a tibble: printing, and subsetting.

Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from str():

library(nycflights13)
flights
#> Source: local data frame [336,776 x 16]
#> 
#>     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
#>    (int) (int) (int)    (int)     (dbl)    (int)     (dbl)   (chr)   (chr)
#> 1   2013     1     1      517         2      830        11      UA  N14228
#> 2   2013     1     1      533         4      850        20      UA  N24211
#> 3   2013     1     1      542         2      923        33      AA  N619AA
#> 4   2013     1     1      544        -1     1004       -18      B6  N804JB
#> 5   2013     1     1      554        -6      812       -25      DL  N668DN
#> 6   2013     1     1      554        -4      740        12      UA  N39463
#> 7   2013     1     1      555        -5      913        19      B6  N516JB
#> 8   2013     1     1      557        -3      709       -14      EV  N829AS
#> 9   2013     1     1      557        -3      838        -8      B6  N593JB
#> 10  2013     1     1      558        -2      753         8      AA  N3ALAA
#> ..   ...   ...   ...      ...       ...      ...       ...     ...     ...
#> Variables not shown: flight (int), origin (chr), dest (chr), air_time
#>   (dbl), distance (dbl), hour (dbl), minute (dbl).

Tibbles are strict about subsetting. If you try to access a variable that does not exist, you’ll get an error:

flights$yea
#> Error: Unknown column 'yea'

Tibbles also clearly delineate [ and [[: [ always returns another tibble, [[ always returns a vector. No more drop = FALSE!

class(iris[ , 1])
#> [1] "numeric"
class(iris[ , 1, drop = FALSE])
#> [1] "data.frame"
class(as_data_frame(iris)[ , 1])
#> [1] "tbl_df"     "tbl"        "data.frame"

Interacting with legacy code

A handful of functions are don’t work with tibbles because they expect df[, 1] to return a vector, not a data frame. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data frame:

class(as.data.frame(tbl_df(iris)))

The R Markdown package ships with a raft of output formats including HTML, PDF, MS Word, R package vignettes, as well as Beamer and HTML5 presentations. This isn’t the entire universe of available formats though (far from it!). R Markdown formats are fully extensible and as a result there are several R packages that provide additional formats. In this post we wanted to highlight a few of these packages, including:

  • tufte — Documents in the style of Edward Tufte
  • rticles — Formats for creating LaTeX based journal articles
  • rmdformats — Formats for creating HTML documents

We’ll also discuss how to create your own custom formats as well as re-usable document templates for existing formats.

Using Custom Formats

Custom R Markdown formats are just R functions which return a definition of the format’s behavior. For example, here’s the metadata for a document that uses the html_document format:

---
title: "My Document"
output: html_document
---

When rendering, R Markdown calls the rmarkdown::html_document function to get the definition of the output format. A custom format works just the same way but is also qualified with the name of the package that contains it. For example, here’s the metadata for a document that uses the tufte_handout format:

---
title: "My Document"
output: tufte::tufte_handout
---

Custom formats also typically register a template that helps you get started with using them. If you are using RStudio you can easily create a new document based on a custom format via the New R Markdown dialog:

Screen Shot 2016-03-21 at 11.16.04 AM

Tufte Handouts

The tufte package includes custom formats for creating documents in the style that Edward Tufte uses in his books and handouts. Tufte’s style is known for its extensive use of sidenotes, tight integration of graphics with text, and well-set typography. Formats for both LaTeX and HTML/CSS output are provided (these are in turn based on the work in tufte-latex and tufte-css). Here’s some example output from the LaTeX format:

If you want LaTeX/PDF output, you can use the tufte_handout format for handouts and tufte_book for books. For HTML output, you use the tufte_html format. For example:

---
title: "An Example Using the Tufte Style"
author: "John Smith"
output:
  tufte::tufte_handout: default
  tufte::tufte_html: default
---

You can install the tufte package from CRAN as follows:

install.packages("tufte")

See the tufte package website for additional documentation on using the Tufte custom formats.

Journal Articles

The rticles package provides a suite of custom R Markdown LaTeX formats and templates for various journal article formats, including:

Screen Shot 2016-03-21 at 11.48.40 AM

You can install the rticles package from CRAN as follows:

install.packages("rticles")

See the rticles repository for more details on using the formats included with the package. The source code of the rticles package is an excellent resource for learning how to create LaTeX based custom formats.

rmdformats Package

The rmdformats package from Julien Barnier includes three HTML based document formats that provide nice alternatives to the default html_document format that is included in the rmarkdown package. The readthedown format is inspired by the Read the docs Sphinx theme and is fully responsive, with collapsible navigation:

readthedown

The html_docco and html_clean formats both provide provide automatic thumbnails for figures with lightbox display, and html_clean provides an automatic and dynamic table of contents:

html_docco html_clean

You can install the rmdformats package from CRAN as follows:

install.packages("rmdformats")

See the rmdformats repository for documentation on using the readthedown, html_docco, and html_clean formats.

Creating New Formats

Hopefully checking out some of the custom formats described above has you inspired to create your very own new formats. The R Markdown website includes documentation on how to create a custom format. In addition, the source code of the tufte, rticles, and rmdformats packages provide good examples to work from.

Short of creating a brand new format, it’s also possible to create a re-usable document template that shows up within the RStudio New R Markdown dialog box. This would be appropriate if an existing template met your needs but you wanted to have an easy way to create documents with a pre-set list of options and skeletal content. See the article on document templates for additional details on how to do this.

A new release of the rmarkdown package is now available on CRAN. This release features some long-requested enhancements to the HTML document format, including:

  1. The ability to have a floating (i.e. always visible) table of contents.
  2. Folding and unfolding for R code (to easily show and hide code for either an entire document or for individual chunks).
  3. Support for presenting content within tabbed sections (e.g. several plots could each have their own tab).
  4. Five new themes including “lumen”, “paper”, “sandstone”, “simplex”, & “yeti”.

There are also three new formats for creating GitHub, OpenDocument, and RTF documents as well as a number of smaller enhancements and bug fixes (see the package NEWS for all of the details).

Floating TOC

You can specify the toc_float option to float the table of contents to the left of the main document content. The floating table of contents will always be visible even when the document is scrolled. For example:

---
title: &quot;Habits&quot;
output:
  html_document:
    toc: true
    toc_float: true
---

Here’s what the floating table of contents looks like on one of the R Markdown website’s pages:

FloatingTOC

Code Folding

When the knitr chunk option echo = TRUE is specified (the default behavior) the R source code within chunks is included within the rendered document. In some cases it may be appropriate to exclude code entirely (echo = FALSE) but in other cases you might want the code available but not visible by default.

The code_folding: hide option enables you to include R code but have it hidden by default. Users can then choose to show hidden R code chunks either indvidually or document wide. For example:

---
title: &quot;Habits&quot;
output:
  html_document:
    code_folding: hide
---

Here’s the default HTML document template with code folding enabled. Note that each chunk has it’s own toggle for showing or hiding code and there is also a global menu for operating on all chunks at once.

Screen Shot 2016-03-21 at 7.27.40 AM

Note that you can specify code_folding: show to still show all R code by default but then allow users to hide the code if they wish.

Tabbed Sections

You can organize content using tabs by applying the .tabset class attribute to headers within a document. This will cause all sub-headers of the header with the .tabset attribute to appear within tabs rather than as standalone sections. For example:

## Sales Report {.tabset}

### By Product

(tab content)

### By Region

(tab content)

Here’s what tabbed sections look like within a rendered document:

Screen Shot 2016-03-21 at 7.43.38 AM

Authoring Enhancements

We also shouldn’t fail to mention that the most recent release of RStudio included several enhancements to R Markdown document editing. There’s now an optional outline view that enables quick navigation across larger documents:

Screen Shot 2015-12-22 at 9.27.34 AM

We also also added inline UI to code chunks for running individual chunks, running all previous chunks, and specifying various commonly used knit options:

Screen Shot 2015-12-22 at 9.30.11 AM

What’s Next

We’ve got lots of additional work planned for R Markdown including new document formats, additional authoring enhancements in RStudio, and some new tools to make it easier to publish and manage documents created with R Markdown. More details to follow soon!

 

Support for building R projects on Travis has recently undergone improvements which we hope will make it an even better tool for the R community. Feature highlights include:

  • Support for Travis’ container-based infrastructure.
  • Package dependency caching (on the container-based builds).
  • Building with multiple R versions (R-devel, R-release (3.2.3) and R-oldrel (3.1.3)).
  • Log filtering to improve readability and hide less relevant information.
  • Updated dependencies TexLive (2015) and pandoc (1.15.2).

See the Travis documentation on building an R project for complete details on the available options.

Using the container-based infrastructure with package caching is now recommended for nearly all projects. There are more compute and network resources available for container based builds, which means they start processing in less time and run faster. The package caching makes package installation comparable or faster than using binary packages.

A minimal .travis.yml file that is suitable for most cases is

language: r
sudo: false
cache: packages

New packages can omit sudo: false, as it is the default for new repositories. However older repositories will have to explicitly set sudo: false to use the container based infrastructure.

If your package depends on development packages that are not on CRAN (such as GitHub) we recommend you use the Remotes: annotation in your package DESCRPITION file. This will allow your package and dependencies to be easily installed by devtools::install_github() as well as on Travis (Examples). It is generally no longer necessary to use r_github_packages, r_packages, r_binary_packages, etc. as this can be handled with Remotes.

If you need system dependencies, first check to see if they’re available with the apt-addon and include them in your .travis.yml. This will allow you to install them without sudo and still use the container based infrastructure.

addons:
  apt:
    packages:
      - libv8-dev

We hope these improvements will make your use of Travis with R simple and useful. Please file any issues found at https://github.com/travis-ci/travis-ci/issues and mention @craigcitro, @hadley and @jimhester in the issue.

I’m very pleased to announce the release of ggplot2 2.1.0, scales 0.4.0, and gtable 0.2.0. These are set of relatively minor updates that fix a whole bunch of little problems that crept in during the last big update. The most important changes are described below.

  1. When mapping an aesthetic to a constant the default guide title is the name of the aesthetic (i.e. “colour”), not the value (i.e. “loess”). This is a really handy technique for labelling individual layers:
    ggplot(mpg, aes(displ, 1 / hwy)) +
      geom_point() + 
      geom_smooth(method = lm, aes(colour = "lm"), se = FALSE) + 
      geom_smooth(aes(colour = "loess"), se = FALSE)

    unnamed-chunk-2-1

  2. stat_bin() (which powers geom_histogram() and geom_freqpoly()), has been overhauled to use the same algorithm as ggvis. This has considerably better parameters and defaults thanks to the work of Randall Pruim. Changes include:
    • Better arguments and a better algorithm for determining the origin. You can now specify either boundary (i.e. the position of the left or right side) or the center of a bin. origin has been deprecated in favour of these arguments.
    • drop is deprecated in favour of pad, which adds extra 0-count bins at either end, as is needed for frequency polygons. geom_histogram() defaults to pad = FALSE which considerably improves the default limits for the histogram, especially when the bins are big.
    • The default algorithm does a (somewhat) better job at picking nice widths and origins across a wider range of input data.

    You can see the impact of these changes on the following two histograms:

    ggplot(diamonds, aes(carat)) + 
      geom_histogram(binwidth = 1)    
    ggplot(diamonds, aes(carat)) + 
      geom_histogram(binwidth = 1, boundary = 0)

    unnamed-chunk-3-1

  3. All layer functions (geom_*() + stat_*()) functions now have a consistent argument order: data, mapping, then geom/stat/position, then ..., then layer specific arguments, then common layer arguments. This might break some code if you were relying on partial name matching, but in the long-term should make ggplot2 easier to use. In particular, you can now set the n parameter in geom_density2d() without it partially matching na.rm.
  4. For geoms with both colour and fill, alpha once again only affects fill. alpha was changed to modify both colour and fill in 2.0.0, but I’ve reverted it to the old behaviour because it was causing pain for quite a few people.

You can see a full list of changes in the release notes.