“Master” R in Washington DC this September!

Join RStudio Chief Data Scientist Hadley Wickham at the AMA – Executive Conference Center in Arlington, VA on September 14 and 15, 2015 for this rare opportunity to learn from one of the R community’s most popular and innovative authors and package developers.

It will be at least another year before Hadley returns to teach his class on the East Coast, so don’t miss this opportunity to learn from him in person. The venue is conveniently located next to Ronald Reagan Washington National Airport and a short distance from the Metro. Attendance is limited. Past events have sold out.

Register today!

testthat 0.10.0 is now available on CRAN. Testthat makes it easy to turn the informal testing that you’re already doing into formal automated tests. Learn more at http://r-pkgs.had.co.nz/tests.html. Install the latest version with:


There are four big changes in this release:

  • test_check() uses a new reporter specifically designed for R CMD check. It displays a summary at the end of the tests, designed to be <13 lines long so test failures in R CMD check display are as useful as possible.
  • New skip_if_not_installed() skips tests if a package isn’t installed: this is useful if you want tests to skip if a suggested package isn’t installed.
  • The expect_that(a, equals(b)) style of testing has been soft-deprecated in favour of expect_equals(a, b). It will keep working, but it’s no longer demonstrated in the documentation, and new expectations will only be available in expect_equal(a, b) style.
  • compare() is now documented and exported: compare is used to display test failures for expect_equal(), and is designed to help you spot exactly where the failure occured. It currently has methods for character and numeric vectors.

There were a number of other minor improvements and bug fixes. See the release notes for a complete list.

This is a guest post by Vincent Warmerdam of koaning.io.​

SparkR preview in Rstudio

Apache Spark is the hip new technology on the block. It allows you to write scripts in a functional style and the technology behind it will allow you to run iterative tasks very quickly on a cluster of machines. It’s benchmarked to be quicker than hadoop for most machine learning use cases (by a factor between 10-100) and soon Spark will also have support for the R language. As of April 2015, SparkR has been merged into Apache Spark and is shipping with a new version in an upcoming release (1.4) due early summer 2015. In the meanwhile, you can use this tutorial to go ahead and get familiar with the current version of SparkR.

**Disclaimer** : although you will be able to use this tutorial to write Spark jobs right now with R, the new api due this summer will most likely have breaking changes.

Running Spark Locally

The main feature of Spark is the resilient distributed dataset, which is a dataset that can be queried in memory, in parallel on a cluster of machines. You don’t need a cluster of machines to get started with Spark though. Even on a single machine, Spark is able to efficiently use any configured resources. To keep it simple we will ignore this configuration for now and do a quick one-click install. You can use devtools to download and install Spark with SparkR.

install_github("amplab-extras/SparkR-pkg", subdir="pkg")

This might take a while. But after the installation, the following R code will run Spark jobs for you:


sc <- sparkR.init(master="local")

sc %>% 
  parallelize(1:100000) %>%

This small program generates a list, gives it to Spark (which turns it into an RDD, Spark’s Resilient Distributed Dataset structure) and then counts the number of items in it. SparkR exposes the RDD API of Spark as distributed lists in R, which plays very nicely with magrittr. As long as you follow the API, you don’t need to worry much about parallelizing for performance for your programs.

A more elaborate example

Spark also allows for grouped operations, which might remind you a bit of dplyr.

nums = runif(100000) * 10

sc %>% 
  parallelize(nums) %>% 
  map(function(x) round(x)) %>%
  filterRDD(function(x) x %% 2) %>% 
  map(function(x) list(x, 1)) %>%
  reduceByKey(function(x,y) x + y, 1L) %>% 

The Spark API will look very ‘functional’ to programmers used to functional programming languages (which should come to no suprise if you know that Spark is written in Scala). This script will do the following;

  1. it will create a RRD Spark object from the original data
  2. it will map each number to a rounded number
  3. it will filter all even numbers out or the RDD
  4. next it will create key/value pairs that can be counted
  5. it then reduces the key value pairs (the 1L is the number of partitions for the resulting RDD)
  6. and it collects the results

Spark will have started running services locally on your computer, which can be viewed at http://localhost:4040/stages/. You should be able to see all the jobs you’ve run here. You will also see which jobs have failed with the error log.

Bootstrapping with Spark

These examples are nice, but you can also use the power of Spark for more common data science tasks. Let’s sample a dataset to generate a large RDD, which we will then summarise via bootstrapping. Instead of parallelizing numbers, I will now parallelize dataframe samples.

sc <- sparkR.init(master="local")

sample_cw <- function(n, s){
  ChickWeight[sample(nrow(ChickWeight), n), ]

data_rdd <- sc %>%
  parallelize(1:200, 20) %>% 
  map(function(s) sample_cw(250, s))

For the parallelize function we can assign the number of partitions Spark can use for the resulting RDD. My s argument ensures that each partition will use a different random seed when sampling. This data_rdd is useful, because it can be reused for multiple purposes.

You can use it to estimate the mean of the weight.

data_rdd %>% 
  map(function(x) mean(x$weight)) %>% 
  collect %>% 
  as.numeric %>% 
  hist(20, main="mean weight, bootstrap samples")

Or you can use it to perform bootstrapped regressions.

train_lm <- function(data_in){
  lm(data=data_in, weight ~ Time)

coef_rdd <- data_rdd %>% 
  map(train_lm) %>% 
  map(function(x) x$coefficients) 

get_coef <- function(k) { 
  code_rdd %>%  
    map(function(x) x[k]) %>% 
    collect %>%

df <- data.frame(intercept = get_coef(1), time_coef = get_coef(2))
df$intercept %>% hist(breaks = 30, main="beta coef for intercept")
df$time_coef %>% hist(breaks = 30, main="beta coef for time")

The slow part of this tree of operations is the creation of the data, because this has to occur locally through R. A more common use case for Spark would be to load a large dataset from S3 which connects to a large EC2 cluster of Spark machines.

More power?

Running Spark locally is nice and will already allow for parallelism, but the real profit can be gained by running Spark on a cluster of computers. The nice thing is that Spark automatically comes with a script that will automate the provisioning of a Spark cluster on Amazon AWS.

To get a cluster started; start up an EC2 cluster with the supplied ec2 folder from Apache’s Spark github repo. A more elaborate tutorial can be found here, but if you already are an Amazon user, provisioning a cluster on Amazon is as simple as calling a one-liner:

./spark-ec2 \
--key-pair=spark-df \
--identity-file=/path/spark-df.pem \
--region=eu-west-1 \
-s 3 \
--instance-type c3.2xlarge \
launch my-spark-cluster

If you ssh in the master node that has just been setup you can run the following code:

cd /root
git clone https://github.com/amplab-extras/SparkR-pkg.git
cd SparkR-pkg
SPARK_VERSION=1.2.1 ./install-dev.sh
cp -a /root/SparkR-pkg/lib/SparkR /usr/share/R/library/
/root/spark-ec2/copy-dir /root/SparkR-pkg
/root/spark/sbin/slaves.sh cp -a /root/SparkR-pkg/lib/SparkR /usr/share/R/library/

Launch SparkR on a Cluster

Finally to launch SparkR and connect to the Spark EC2 cluster, we run the following code on the master machine:

MASTER=spark://:7077 ./sparkR

The hostname can be retrieved using:

cat /root/spark-ec2/cluster-url

You can check on the status of your cluster via Spark’s Web UI at http://:8080.

The future

Everything described in this document is subject to changes with the next Spark release, but should help you feel familiar on how Spark works. There will be R support for Spark, less so for low level RDD operations but more so for its distributed machine learning algorithms as well as DataFrame objects.

The support for R in the Spark universe might be a game changer. R has always been great on doing exploratory and interactive analysis on small to medium datasets. With the addition of Spark, R can become a more viable tool for big datasets.

June is the current planned release date for Spark 1.4 which will allow R users to run data frame operations in parallel on the distributed memory of a cluster of computers. All of which is completely open source.

It will be interesting to see what possibilities this brings for the R community.

We’re pleased to announce that the final version of RStudio v0.99 is available for download now. Highlights of the release include:

  • A new data viewer with support for large datasets, filtering, searching, and sorting.
  • Complete overhaul of R code completion with many new features and capabilities.
  • The source editor now provides code diagnostics (errors, warnings, etc.) as you work.
  • User customizable code snippets for automating common editing tasks.
  • Tools for Rcpp: completion, diagnostics, code navigation, find usages, and automatic indentation.
  • Many additional source editor improvements including multiple cursors, tab re-ordering, and several new themes.
  • An enhanced Vim mode with visual block selection, macros, marks, and subset of : commands.

There are also lots of smaller improvements and bug fixes across the product. Check out the v0.99 release notes for details on all of the changes.

Data Viewer

We’ve completely overhauled the data viewer with many new capabilities including live update, sorting and filtering, full text searching, and no row limit on viewed datasets.


See the data viewer documentation for more details.

Code Completion

Previously RStudio only completed variables that already existed in the global environment. Now completion is done based on source code analysis so is provided even for objects that haven’t been fully evaluated:


Completions are also provided for a wide variety of specialized contexts including dimension names in [ and [[:


Code Diagnostics

We’ve added a new inline code diagnostics feature that highlights various issues in your R code as you edit.

For example, here we’re getting a diagnostic that notes that there is an extra parentheses:

Screen Shot 2015-04-08 at 12.04.14 PM

Here the diagnostic indicates that we’ve forgotten a comma within a shiny UI definition:


A wide variety of diagnostics are supported, including optional diagnostics for code style issues (e.g. the inclusion of unnecessary whitespace). Diagnostics are also available for several other languages including C/C++, JavaScript, HTML, and CSS. See the code diagnostics documentation for additional details.

Code Snippets

Code snippets are text macros that are used for quickly inserting common snippets of code. For example, the fun snippet inserts an R function definition:

Insert Snippet

If you select the snippet from the completion list it will be inserted along with several text placeholders which you can fill in by typing and then pressing Tab to advance to the next placeholder:

Screen Shot 2015-04-07 at 10.44.39 AM

Other useful snippets include:

  • lib, req, and source for the library, require, and source functions
  • df and mat for defining data frames and matrices
  • if, el, and ei for conditional expressions
  • apply, lapply, sapply, etc. for the apply family of functions
  • sc, sm, and sg for defining S4 classes/methods.

See the code snippets documentation for additional details.

Try it Out

RStudio v0.99 is available for download now. We hope you enjoy the new release and as always please let us know how it’s working and what else we can do to make the product better.

HadleyWickhamHSJoin RStudio Chief Data Scientist Hadley Wickham at the University of Illinois at Chicago, on Wednesday May 27th & 28th for this rare opportunity to learn from one of the R community’s most popular and innovative authors and package developers.

As of this post, the workshop is two-thirds sold out. If you’re in or near Chicago and want to boost your R programming skills, this is Hadley’s only Central US public workshop planned for 2015.

Register here: https://rstudio-chicago.eventbrite.com

Devtools 1.8 is now available on CRAN. Devtools makes it so easy to build a package that it becomes your default way to organise code, data and documentation. You can learn more about developing packages at http://r-pkgs.had.co.nz/.

Get the latest version of devtools with:


There are three main improvements:

  • More helpers to get you up and running with package development as quickly as possible.

  • Better tools for package installation (including checking that all dependencies are up to date).

  • Improved reverse dependency checking for CRAN packages.

There were many other minor improvements and bug fixes. See the release notes for complete list of changes. The last release announcement was for devtools 1.6 since there weren’t many big changes in devtools 1.7. I’ve included the most important points in this announcement labelled with [1.7]. ## Helpers

The number of functions designed to get you up and going with package development continues to grow. This version sees the addition of:

  • dr_devtools(), which runs some common diagnostics: are you using the latest version of R and devtools? Similarly, dr_github() checks for common git/github configuration problems.

  • lint() runs lintr::lint_package() to check the style of package code [1.7].

  • use_code_of_conduct() adds a contributor code of conduct from http://contributor-covenant.org.

  • use_cran_badge() adds a CRAN status badge that you can copy into a README file. Green indicates package is on CRAN. Packages not yet submitted or accepted to CRAN get a red badge.

  • use_cran_comments() creates a cran-comments.md template and adds it to .Rbuildignore to help with CRAN submissions. [1.7]

  • use_coveralls() allows you to easily add test coverage with coveralls.

  • use_git() sets up a package to use git, initialising the repo and checking the existing files.

  • use_test() adds a new test file in tests/testthat.

  • use_readme_rmd() sets up a template to generate a README.md from a README.Rmd with knitr. [1.7]

Package installation and info

When developing packages it’s common to run into problems because you’ve updated a package, but you’ve forgotten to update it’s dependencies (install.packages() doesn’t this automatically). The new package_deps() solves this problem by finding all recursive dependencies of a package and determining if they’re out of date:

# Find out which dependencies are out of date
# Update them

This code is used in install_deps() and revdep_check() – devtools is now aggressive about updating packages, which should avoid potential problems in CRAN submissions.
New update_packages() uses these tools to install a package (and its dependencies) only if they’re not already installed and current.

Reverse dependency checking

Devtools 1.7 included considerable improvement to reverse dependency checking. This sort of checking is important if your package gets popular, and is used by other CRAN packages. Before submitting updates to CRAN, you need to make sure that you have not broken the CRAN packages that use your package. Read more about it in the R packages book. To get started, run use_revdep(), then run the code in revdep/check.R.

We’ve blogged previously about various improvements we’ve made to the source editor in RStudio v0.99 including enhanced code completion, snippets, diagnostics, and an improved Vim mode. Besides these larger scale features we’ve made lots of smaller improvements that we also wanted to highlight. You can try out all of these features now in the RStudio v0.99 preview release.

Multiple Cursors

You can now create and use multiple cursors within RStudio. Multiple cursors can be created in a variety of ways:

  • Press Ctrl + Alt + {Up/Down} to create a new cursor in the pressed direction,
  • Press Ctrl + Alt + Shift + {Direction} to move a second cursor in the specified direction,
  • Use Alt and drag with the mouse to create a rectangular selection,
  • Use Alt + Shift and click to create a rectangular selection from the current cursor position to the clicked position.

RStudio also makes use of multiple cursors in its Find / Replace toolbar now. After entering a search term, if you press the All button, all items matching that search term are selected.

Screen Shot 2015-05-05 at 2.58.17 PM

You can then begin typing to replace each match with a new term—each matched entry will be updated as you type.

Rearrangeable Tabs

You can (finally!) move tabs around in the Source pane by clicking and dragging. In the below example, the file ‘file_4.R’ is currently selected and being dragged into place.


New, Improved Editor Themes

A number of new editor themes have been added to RStudio, and older editor themes have been tweaked to ensure that brackets are given a distinct color from text for further legibility.


Select / Expand Selection

You can use Ctrl + Shift + E to select everything within the nearest pair of opening and closing brackets, or use Ctrl + Alt + Shift + E to expand the selection up to the next closing bracket.

Screen Shot 2015-05-05 at 2.56.45 PM

Fuzzy Navigation

You can use CTRL + . to quickly navigate between files and symbols within a project. Previously, this search utility performed prefix matching, and so it was difficult to use with long file / symbol names. Now, the CTRL + . navigator uses fuzzy matching to narrow the candidate set down based on subsequence matching, which makes it easier to navigate when many files share a common prefix—for example, to test- files for a project managing its tests with testthat.

Screen Shot 2015-05-05 at 2.41.11 PM

Insert Roxygen Skeleton

RStudio now provides a means for inserting a Roxygen documentation skeleton above functions. The skeleton generator is smart enough to understand plain R functions, as well as S4 generics, methods and classes—it will automatically fill in documentation for available parameters and slots.


 More Languages

We’ve also added syntax highlighting modes for many new languages including Clojure, CoffeeScript, C#, Graphviz, Go, Groovy, Haskell, Java, Julia, Lisp, Lua, Matlab, Perl, Ruby, Rust, Scala, and Stan. There’s also some basic keyword and text based code completion for several languages including JavaScript, HTML, CSS, Python, and SQL.

Try it Out

You can try out all of the new editor features by downloading the latest preview release of RStudio. As always, let us know how the new features are working as well as what else you’d like to see us do.

I’m very excited to announce the 1.0.0 release of the stringr package. If you haven’t heard of stringr before, it makes string manipulation easier by:

  • Using consistent function and argument names: all functions start with str_, and the first argument is always the input string This makes stringr easier to learn and easy to use with the pipe.
  • Eliminating options that you don’t need 95% of the time.

To get started with stringr, check out the new vignette.

What’s new?

The biggest change in this release is that stringr is now powered by the stringi package instead of base R. This has two big benefits: stringr is now much faster, and has much better unicode support.

If you’ve used stringi before, you might wonder why stringr is still necessary: stringi does everything that stringr does, and much much more. There are two reasons that I think stringr is still important:

  1. Lots of people use it already, so this update will give many people a performance boost for free.
  2. The smaller API of stringr makes it a little easier to learn.

That said, once you’ve learned stringr, using stringi should be easy, so it’s a great place to start if you need a tool that doesn’t exist in stringr.

New features and functions

  • str_replace_all() gains a convenient syntax for applying multiple pairs of pattern and replacement to the same vector:
    x <- c("abc", "def")
    str_replace_all(x, c("[ad]" = "!", "[cf]" = "?"))
    #> [1] "!b?" "!e?"
  • str_subset() keeps values that match a pattern:
    x <- c("abc", "def", "jhi", "klm", "nop")
    str_subset(x, "[aeiou]")
    #> [1] "abc" "def" "jhi" "nop"
  • str_order() and str_sort() sort and order strings in a specified locale. str_conv() to converts strings from specified encoding to UTF-8.
    # The vowels come before the consonants in Hawaiian
    str_sort(letters[1:10], locale = "haw")
    #>  [1] "a" "e" "i" "b" "c" "d" "f" "g" "h" "j"
  • New modifier boundary() allows you to count, locate and split by character, word, line and sentence boundaries.
    words <- c("These are   some words. Some more words.")
    str_count(words, boundary("word"))
    #> [1] 7
    str_split(words, boundary("word"))
    #> [[1]]
    #> [1] "These" "are"   "some"  "words" "Some"  "more"  "words"

There were two minor changes to make stringr a little more consistent:

  • str_c() now returns a zero length vector if any of its inputs are zero length vectors. This is consistent with all other functions, and standard R recycling rules. Similarly, using str_c("x", NA) now yields NA. If you want "xNA", use str_replace_na() on the inputs.
  • str_match() now returns NA if an optional group doesn’t match (previously it returned “”). This is more consistent with str_extract() and other match failures.


Stringr is over five years old and is quite stable (the last release was over two years ago). Although I’ve endeavoured to make the change to stringi as seemless as possible, it’s likely that it has created some new bugs. If you have problems, please try the development version, and if that doesn’t help, file an issue on github.

Soon after the announcement of htmlwidgets, Rich Iannone released the DiagrammeR package, which makes it easy to generate graph and flowchart diagrams using text in a Markdown-like syntax. The package is very flexible and powerful, and includes:

  1. Rendering of Graphviz graph visualizations (via viz.js)
  2. Creating diagrams and flowcharts using mermaid.js
  3. Facilities for mapping R objects into graphs, diagrams, and flowcharts.

We’re very excited about the prospect of creating sophisticated diagrams using an easy to author plain-text syntax, and built some special authoring support for DiagrammeR into RStudio v0.99 (which you can download a preview release of now).

Graphviz Meets R

If you aren’t familiar with Graphviz, it’s a tool for rendering DOT (a plain text graph description language). DOT draws directed graphs as hierarchies. Its features include well-tuned layout algorithms for placing nodes and edge splines, edge labels, “record” shapes with “ports” for drawing data structures, and cluster layouts (see http://www.graphviz.org/pdf/dotguide.pdf for an introductory guide).

DiagrammeR can render any DOT script. For example, with the following source file (“boxes.dot”):

Screen Shot 2015-04-30 at 12.35.17 PM

You can render the diagram with:



Since the diagram is an htmlwidget it can be used at the R console, within R Markdown documents, and within Shiny applications. Within RStudio you can preview a Graphviz or mermaid source file the same way you source an R script via the Preview button or the Ctrl+Shift+Enter keyboard shortcut.

This simple example only scratches the surface of what’s possible, see the DiagrammeR Graphviz documentation for more details and examples.

Diagrams with mermaid.js

Support for mermaid.js in DiagrammeR enables you to create several other diagram types not supported by Graphviz. For example, here’s the code required to create a sequence diagram:


You can render the diagram with:



See the DigrammeR mermaid.js documentation for additional details.

Generating Diagrams from R Code

Both of the examples above illustrating creating diagrams by direct editing of DOT and mermaid scripts. The latest version of DiagrammeR (v0.6, just released to CRAN) also includes facilities for generating diagrams from R code. This can be done in a couple of ways:

  1. Using text substitution, whereby you create placeholders within the diagram script and substitute their values from R objects. See the documentation on Graphviz Substitution for more details.
  2. Using the graphviz_graph function you can specify nodes and edges directly using a data frame.

Future versions of DiagrammeR are expected to include additional features to support direct generation of diagrams from R.

Publishing with DiagrammeR

Diagrams created with DiagrammeR act a lot like R plots however there’s an important difference: they are rendered as HTML content rather than using an R graphics device. This has the following implications for how they can be published and re-used:

  1. Within RStudio you can save diagrams as an image (PNG, BMP, etc.) or copy them to clipboard for re-use in other applications.
  2. For a more reproducible workflow, diagrams can be embedded within R Markdown documents just like plots (all of the required HTML and JS is automatically included). Note that because the diagrams depend on HTML and JavaScript for rendering they can only be used in HTML based output formats (they don’t work in PDFs or MS Word documents).
  3. From within RStudio you can also publish diagrams to RPubs or save them as standalone web pages.


See the DiagrammeR documentation on I/O for additional details.

Try it Out

To get started with DiagrammeR check out the excellent collection of demos and documentation on the project website. To take advantage of the new RStudio features that support DiagrammeR you should download the latest RStudio v0.99 Preview Release.




In RStudio v0.99 we’ve made a major investment in R source code analysis. This work resulted in significant improvements in code completion, and in the latest preview release enable a new inline code diagnostics feature that highlights various issues in your R code as you edit.

For example, here we’re getting a diagnostic that notes that there is an extra parentheses:

Screen Shot 2015-04-08 at 12.04.14 PM

Here the diagnostic indicates that we’ve forgotten a comma within a shiny UI definition:


This diagnostic flags an unknown parameter to a function call:

Screen Shot 2015-04-08 at 11.50.07 AM

This diagnostic indicates that we’ve referenced a variable that doesn’t exist and suggests a fix based on another variable in scope:

Screen Shot 2015-04-08 at 4.23.49 PM

A wide variety of diagnostics are supported, including optional diagnostics for code style issues (e.g. the inclusion of unnecessary whitespace). Diagnostics are also available for several other languages including C/C++, JavaScript, HTML, and CSS.

Configuring Diagnostics

By default, code in the current source file is checked whenever it is saved, as well as if the keyboard is idle for a period of time. You can tweak this behavior using the Code -> Diagnostics options:


Note that several of the available diagnostics are disabled by default. This is because we’re in the process of refining their behavior to eliminate “false negatives” where correct code is flagged as having a problem. We’ll continue to improve these diagnostics and enable them by default when we feel they are ready.

Trying it Out

You can try out the new code diagnostics by downloading the latest preview release of RStudio. This feature is a work in progress and we’re particularly interested in feedback on how well it works. Please also let us know if there are common coding problems which you think we should add new diagnostics for. We hope you try out the preview and let us know how we can make it better.



Get every new post delivered to your Inbox.

Join 19,111 other followers