You are currently browsing the monthly archive for December 2014.

Give yourself the gift of “mastering” R to start 2015!

Join RStudio Chief Data Scientist Hadley Wickham at the Westin San Francisco on January 19 and 20 for this rare opportunity to learn from one of the R community’s most popular and innovative authors and package developers.

As of this post, the workshop is two-thirds sold out. If you’re in or near California and want to boost your R programming skills, this is Hadley’s only West Coast public workshop planned for 2015.

Register here: http://rstudio-sfbay.eventbrite.com/

Today we’re excited to announce htmlwidgets, a new framework that brings the best of JavaScript data visualization libraries to R. There are already several packages that take advantage of the framework (leaflet, dygraphs, networkD3, DataTables, and rthreejs) with hopefully many more to come.

An htmlwidget works just like an R plot except it produces an interactive web visualization. A line or two of R code is all it takes to produce a D3 graphic or Leaflet map. Widgets can be used at the R console as well as embedded in R Markdown reports and Shiny web applications. Here’s an example of using leaflet directly from the R console:

rconsole.2x

When printed at the console the leaflet widget displays in the RStudio Viewer pane. All of the tools typically available for plots are also available for widgets, including history, zooming, and export to file/clipboard (note that when not running within RStudio widgets will display in an external web browser).

Here’s the same widget in an R Markdown report. Widgets automatically print as HTML within R Markdown documents and even respect the default knitr figure width and height.

rmarkdown.2x

Widgets also provide Shiny output bindings so can be easily used within web applications. Here’s the same widget in a Shiny application:

shiny.2x

Bringing JavaScript to R

The htmlwidgets framework is a collaboration between Ramnath Vaidyanathan (rCharts), Kenton Russell (Timely Portfolio), and RStudio. We’ve all spent countless hours creating bindings between R and the web and were motivated to create a framework that made this as easy as possible for all R developers.

There are a plethora of libraries available that create attractive and fully interactive data visualizations for the web. However, the programming interface to these libraries is JavaScript, which places them outside the reach of nearly all statisticians and analysts. htmlwidgets makes it extremely straightforward to create an R interface for any JavaScript library.

Here are a few widget libraries that have been built so far:

  • leaflet, a library for creating dynamic maps that support panning and zooming, with various annotations like markers, polygons, and popups.
  • dygraphs, which provides rich facilities for charting time-series data and includes support for many interactive features including series/point highlighting, zooming, and panning.
  • networkD3, a library for creating D3 network graphs including force directed networks, Sankey diagrams, and Reingold-Tilford tree networks.
  • DataTables, which displays R matrices or data frames as interactive HTML tables that support filtering, pagination, and sorting.
  • rthreejs, which features 3D scatterplots and globes based on WebGL.

All of these libraries combine visualization with direct interactivity, enabling users to explore data dynamically. For example, time-series visualizations created with dygraphs allow dynamic panning and zooming:

NewHavenTemps

Learning More

To learn more about the framework and see a showcase of the available widgets in action check out the htmlwidgets web site. To learn more about building your own widgets, install the htmlwidgets package from CRAN and check out the developer documentation.

 

httr 0.6.0 is now available on CRAN. The httr packages makes it easy to talk to web APIs from R. Learn more in the quick start vignette.

This release is mostly bug fixes and minor improvements. The most important are:

  • handle_reset(), which allows you to reset the default handle if you get the error “easy handle already used in multi handle”.
  • write_stream() which lets you process the response from a server as a stream of raw vectors (#143).
  • VERB() allows to you send a request with a custom http verb.
  • brew_dr() checks for common problems. It currently checks if your libcurl uses NSS. This is unlikely to work so it gives you some advice on how to fix the problem (thanks to Dirk Eddelbuettel for debugging this problem and suggesting a remedy).
  • Added support for Google OAuth2 service accounts. (#119, thanks to help from @siddharthab). See ?oauth_service_token for details.

I’ve also switched from RC to R6 (which should make it easier to extend OAuth classes for non-standard OAuth implementations), and tweaked the use of the backend SSL certificate details bundled with httr. See the release notes for complete details.

tidyr 0.2.0 is now available on CRAN. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has variables in columns and observations in rows, and is described in more detail in the tidy data vignette. Install tidyr with:

install.packages("tidyr")

There are three important additions to tidyr 0.2.0:

  • expand() is a wrapper around expand.grid() that allows you to generate all possible combinations of two or more variables. In conjunction with dplyr::left_join(), this makes it easy to fill in missing rows of data.
    sales <- dplyr::data_frame(
      year = rep(c(2012, 2013), c(4, 2)),
      quarter = c(1, 2, 3, 4, 2, 3), 
      sales = sample(6) * 100
    )
    
    # Missing sales data for 2013 Q1 & Q4
    sales
    #> Source: local data frame [6 x 3]
    #> 
    #>   year quarter sales
    #> 1 2012       1   400
    #> 2 2012       2   200
    #> 3 2012       3   500
    #> 4 2012       4   600
    #> 5 2013       2   300
    #> 6 2013       3   100
    
    # Missing values are now explicit
    sales %>% 
      expand(year, quarter) %>%
      dplyr::left_join(sales)
    #> Joining by: c("year", "quarter")
    #> Source: local data frame [8 x 3]
    #> 
    #>   year quarter sales
    #> 1 2012       1   400
    #> 2 2012       2   200
    #> 3 2012       3   500
    #> 4 2012       4   600
    #> 5 2013       1    NA
    #> 6 2013       2   300
    #> 7 2013       3   100
    #> 8 2013       4    NA
  • In the process of data tidying, it’s sometimes useful to have a column of a data frame that is a list of vectors. unnest() lets you simplify that column back down to an atomic vector, duplicating the original rows as needed. (NB: If you’re working with data frames containing lists, I highly recommend using dplyr’s tbl_df, which will display list-columns in a way that makes their structure more clear. Use dplyr::data_frame() to create a data frame wrapped with the tbl_df class.)
    raw <- dplyr::data_frame(
      x = 1:3,
      y = c("a", "d,e,f", "g,h")
    )
    # y is character vector containing comma separated strings
    raw
    #> Source: local data frame [3 x 2]
    #> 
    #>   x     y
    #> 1 1     a
    #> 2 2 d,e,f
    #> 3 3   g,h
    
    # y is a list of character vectors
    as_list <- raw %>% mutate(y = strsplit(y, ","))
    as_list
    #> Source: local data frame [3 x 2]
    #> 
    #>   x        y
    #> 1 1 <chr[1]>
    #> 2 2 <chr[3]>
    #> 3 3 <chr[2]>
    
    # y is a character vector; rows are duplicated as needed
    as_list %>% unnest(y)
    #> Source: local data frame [6 x 2]
    #> 
    #>   x y
    #> 1 1 a
    #> 2 2 d
    #> 3 2 e
    #> 4 2 f
    #> 5 3 g
    #> 6 3 h
  • separate() has a new extra argument that allows you to control what happens if a column doesn’t always split into the same number of pieces.
    raw %>% separate(y, c("trt", "B"), ",")
    #> Error: Values not split into 2 pieces at 1, 2
    raw %>% separate(y, c("trt", "B"), ",", extra = "drop")
    #> Source: local data frame [3 x 3]
    #> 
    #>   x trt  B
    #> 1 1   a NA
    #> 2 2   d  e
    #> 3 3   g  h
    raw %>% separate(y, c("trt", "B"), ",", extra = "merge")
    #> Source: local data frame [3 x 3]
    #> 
    #>   x trt   B
    #> 1 1   a  NA
    #> 2 2   d e,f
    #> 3 3   g   h

To read about the other minor changes and bug fixes, please consult the release notes.

reshape2 1.4.1

There’s also a new version of reshape2, 1.4.1. It includes three bug fixes for melt.data.frame() contributed by Kevin Ushey. Read all about them on the release notes and install it with:

install.packages("reshape2")

We’ve teamed up with DataCamp to make a self-paced online course that teaches ggvis, the newest data visualization package by Hadley Wickham and Winston Chang. The ggvis course pairs challenging exercises, interactive feedback, and “to the point” videos to let you learn ggvis in a guided way.

In the course, you will learn how to make and customize graphics with ggvis. You’ll learn the commands and syntax that ggvis uses to build graphics, and you’ll learn the theory that underlies ggvis. ggvis implements the grammar of graphics, a logical method for building graphs that is easy to use and to extend. Finally, since this is ggvis, you’ll learn to make interactive graphics with sliders and other user controls.

The first part of the tutorial is available for free, so you can start learning immediately.

(Posted on behalf of Stefan Milton Bache)

Sometimes it’s the small things that make a big difference. For me, the introduction of our awkward looking friend, %>%, was one such little thing. I’d never suspected that it would have such an impact on the way quite a few people think and write R (including my own), or that pies would be baked (see here) and t-shirts printed (e.g. here) in honor of the successful three-char-long and slightly overweight operator. Of course a big part of the success is the very fruitful relationship with dplyr and its powerful verbs.

Quite some time went by without any changes to the CRAN version of magrittr. But many ideas have been evaluated and tested, and now we are happy to finally bring an update which brings both some optimization and a few nifty features — we hope that we have managed to strike a balance between simplicity and usefulness and that you will benefit from this update. You can install it now with:

install.packages("magrittr")

The underlying evaluation model is more coherent in this release; this makes the new features more natural extensions and improves performance somewhat. Below I’ll recap some of the important new features, which include functional sequences, a few specialized supplementary operators and better lambda syntax.

Functional sequences

The basic (pseudo) usage of the pipe operator goes something like this:

awesome_data <-
  raw_interesting_data %>%
  transform(somehow) %>%
  filter(the_good_parts) %>%
  finalize

This statement has three parts: an input, an output, and a sequence transformations. That’s suprisingly close to the definition of a function, so in magrittr is really just a convenient way of of defining and applying a function.
A new really useful feature of magrittr 1.5 makes that explicit: you can use %>% to not only produce values but also to produce functions (or functional sequences)! It’s really all the same, except sometimes the function is applied instantly and produces a result, and sometimes it is not, in which case the function itself is returned. In this case, there is no initial value, so we replace that with the dot placeholder. Here is how:

mae <- . %>% abs %>% mean(na.rm = TRUE)
mae(rnorm(10))
#> [1] 0.5605

That’s equivalent to:

mae <- function(x) {
  mean(abs(x), na.rm = TRUE)
}

Even for a short function, this is more compact, and is easier to read as it is defined linearly from left to right.
There are some really cool use cases for this: functionals! Consider how clean it is to pass a function to lapply or aggregate!

info <-
  files %>%
  lapply(. %>% read_file %>% extract(the_goodies))

Functions made this way can be indexed with [ to get a new function containing only a subset of the steps.

Lambda expressions

The new version makes it clearer that each step is really just a single-statement body of a unary function. What if we need a little more than one command to make a satisfactory “step” in a chain? Before, one might either define a function outside the chain, or even anonymously inside the chain, enclosing the entire definition in parentheses. Now extending that one command is like extending a standard one-command function: enclose whatever you’d like in braces, and that’s it:

value %>%
  foo %>% {
    x <- bar(.)
    y <- baz(.)
    x * y
  } %>%
  and_whatever

As usual, the name of the argument to that unary function is ..

Nested function calls

In this release the dot (.) will work also in nested function calls on the right-hand side, e.g.:

1:5 %>% 
  paste(letters[.])
#> [1] "1 a" "2 b" "3 c" "4 d" "5 e"

When you use . inside a function call, it’s used in addition to, not instead of, . at the top-level. For example, the previous command is equivalent to:

1:5 %>% 
  paste(., letters[.])
#> [1] "1 a" "2 b" "3 c" "4 d" "5 e"

If you don’t want this behaviour, wrap the function call in {:

1:5 %>% {
  paste(letters[.])
}
#> [1] "a" "b" "c" "d" "e"

A few of %>%’s friends

We also introduce a few operators. These are supplementary operators that just make some situations more comfortable.
The tee operator, %T>%, enables temporary branching in a pipeline to apply a few side-effect commands to the current value, like plotting or logging, and is inspired by the Unix tee command. The only difference to %>% is that %T>% returns the left-hand side rather than the result of applying the right-hand side:

value %>%
  transform %T>%
  plot %>%
  transform(even_more)

This is a shortcut for:

value %>%
  transform %>%
  { plot(.); . } %>%
  transform(even_more)

because plot() doesn’t normally return anything that can be piped along!
The exposition operator, %$%, is a wrapper around with(),
which makes it easy to refer to the variables inside a data frame:

mtcars %$%
  plot(mpg, wt)

Finally, we also have %<>%, the compound assignment pipe operator. This must be the first operator in the chain, and it will assign the result of the pipeline to the left-hand side name or expression. It’s purpose is to shorten expressions like this:

data$some_variable <-
  data$some_variable %>%
  transform

and turn them into something like this:

data$some_variable %<>%
  transform

Even a small example like x %<>% sort has its appeal!
In summary there is a few new things to get to know; but magrittr is like it always was. Just a little coolr!