You are currently browsing the category archive for the ‘Training’ category.

RStudio will again teach the new essentials for doing (big) data science in R at this year’s Strata NYC conference, September 29 2015 (  You will learn from Garrett Grolemund, Yihui Xie, and Nathan Stephens who are all working on fascinating new ways to keep the R ecosystem apace of the challenges facing those who work with data.

Topics include:

  • R Quickstart: Wrangle, transform, and visualize data
    Instructor: Garrett Grolemund (90 minutes)
  • Work with Big Data in R
    Instructor: Nathan Stephens (90 minutes)
  • Reproducible Reports with Big Data
    Instructor: Yihui Xie (90 minutes)
  • Interactive Shiny Applications built on Big Data
    Instructor: Garrett Grolemund (90 minutes)

If you plan to stay for the full Strata Conference+Hadoop World be sure to look us up at booth 633 during the Expo Hall hours. We’ll have the latest books from RStudio authors and “shiny” t-shirts to win. Share with us what you’re doing with RStudio and get your product and company questions answered by RStudio employees.

See you in New York City! (

The articles section on has lots of great advice for Shiny developers.

A recent article by Dean Attali demonstrates how to save data from a Shiny app to persistent storage structures, like local files, servers, databases, and more. When you do this, your data remains after the app has closed, which opens new doors for data collection and analysis.


Read Dean’s article and more at

We’ve added two new tools that make it even easier to learn Shiny.

Video tutorial


The How to Start with Shiny training video provides a new way to teach yourself Shiny. The video covers everything you need to know to build your own Shiny apps. You’ll learn:

  • The architecture of a Shiny app
  • A template for making apps quickly
  • The basics of building Shiny apps
  • How to add sliders, drop down menus, buttons, and more to your apps
  • How to share Shiny apps
  • How to control reactions in your apps to
    • update displays
    • trigger code
    • reduce computation
    • delay reactions
  • How to add design elements to your apps
  • How to customize the layout of an app
  • How to style your apps with CSS

Altogether, the video contains two hours and 25 minutes of material organized around a navigable table of contents.

Best of all, the video tutorial is completely free. The video is the result of our recent How to Start Shiny webinar series. Thank you to everyone who attended and made the series a success!

Watch the new video tutorial here.

New cheat sheet

The new Shiny cheat sheet provides an up-to-date reference to the most important Shiny functions.


The cheat sheet replaces the previous cheat sheet, adding new sections on single-file apps, reactivity, CSS and more. The new sheet also gave us a chance to apply some of the things we’ve learned about making cheat sheets since the original Shiny cheat sheet came out.

Get the new Shiny cheat sheet here.

“Master” R in Washington DC this September!

Join RStudio Chief Data Scientist Hadley Wickham at the AMA – Executive Conference Center in Arlington, VA on September 14 and 15, 2015 for this rare opportunity to learn from one of the R community’s most popular and innovative authors and package developers.

It will be at least another year before Hadley returns to teach his class on the East Coast, so don’t miss this opportunity to learn from him in person. The venue is conveniently located next to Ronald Reagan Washington National Airport and a short distance from the Metro. Attendance is limited. Past events have sold out.

Register today!

This is a guest post by Vincent Warmerdam of​

SparkR preview in Rstudio

Apache Spark is the hip new technology on the block. It allows you to write scripts in a functional style and the technology behind it will allow you to run iterative tasks very quickly on a cluster of machines. It’s benchmarked to be quicker than hadoop for most machine learning use cases (by a factor between 10-100) and soon Spark will also have support for the R language. As of April 2015, SparkR has been merged into Apache Spark and is shipping with a new version in an upcoming release (1.4) due early summer 2015. In the meanwhile, you can use this tutorial to go ahead and get familiar with the current version of SparkR.

**Disclaimer** : although you will be able to use this tutorial to write Spark jobs right now with R, the new api due this summer will most likely have breaking changes.

Running Spark Locally

The main feature of Spark is the resilient distributed dataset, which is a dataset that can be queried in memory, in parallel on a cluster of machines. You don’t need a cluster of machines to get started with Spark though. Even on a single machine, Spark is able to efficiently use any configured resources. To keep it simple we will ignore this configuration for now and do a quick one-click install. You can use devtools to download and install Spark with SparkR.

install_github("amplab-extras/SparkR-pkg", subdir="pkg")

This might take a while. But after the installation, the following R code will run Spark jobs for you:


sc <- sparkR.init(master="local")

sc %>% 
  parallelize(1:100000) %>%

This small program generates a list, gives it to Spark (which turns it into an RDD, Spark’s Resilient Distributed Dataset structure) and then counts the number of items in it. SparkR exposes the RDD API of Spark as distributed lists in R, which plays very nicely with magrittr. As long as you follow the API, you don’t need to worry much about parallelizing for performance for your programs.

A more elaborate example

Spark also allows for grouped operations, which might remind you a bit of dplyr.

nums = runif(100000) * 10

sc %>% 
  parallelize(nums) %>% 
  map(function(x) round(x)) %>%
  filterRDD(function(x) x %% 2) %>% 
  map(function(x) list(x, 1)) %>%
  reduceByKey(function(x,y) x + y, 1L) %>% 

The Spark API will look very ‘functional’ to programmers used to functional programming languages (which should come to no suprise if you know that Spark is written in Scala). This script will do the following;

  1. it will create a RRD Spark object from the original data
  2. it will map each number to a rounded number
  3. it will filter all even numbers out or the RDD
  4. next it will create key/value pairs that can be counted
  5. it then reduces the key value pairs (the 1L is the number of partitions for the resulting RDD)
  6. and it collects the results

Spark will have started running services locally on your computer, which can be viewed at http://localhost:4040/stages/. You should be able to see all the jobs you’ve run here. You will also see which jobs have failed with the error log.

Bootstrapping with Spark

These examples are nice, but you can also use the power of Spark for more common data science tasks. Let’s sample a dataset to generate a large RDD, which we will then summarise via bootstrapping. Instead of parallelizing numbers, I will now parallelize dataframe samples.

sc <- sparkR.init(master="local")

sample_cw <- function(n, s){
  ChickWeight[sample(nrow(ChickWeight), n), ]

data_rdd <- sc %>%
  parallelize(1:200, 20) %>% 
  map(function(s) sample_cw(250, s))

For the parallelize function we can assign the number of partitions Spark can use for the resulting RDD. My s argument ensures that each partition will use a different random seed when sampling. This data_rdd is useful, because it can be reused for multiple purposes.

You can use it to estimate the mean of the weight.

data_rdd %>% 
  map(function(x) mean(x$weight)) %>% 
  collect %>% 
  as.numeric %>% 
  hist(20, main="mean weight, bootstrap samples")

Or you can use it to perform bootstrapped regressions.

train_lm <- function(data_in){
  lm(data=data_in, weight ~ Time)

coef_rdd <- data_rdd %>% 
  map(train_lm) %>% 
  map(function(x) x$coefficients) 

get_coef <- function(k) { 
  code_rdd %>%  
    map(function(x) x[k]) %>% 
    collect %>%

df <- data.frame(intercept = get_coef(1), time_coef = get_coef(2))
df$intercept %>% hist(breaks = 30, main="beta coef for intercept")
df$time_coef %>% hist(breaks = 30, main="beta coef for time")

The slow part of this tree of operations is the creation of the data, because this has to occur locally through R. A more common use case for Spark would be to load a large dataset from S3 which connects to a large EC2 cluster of Spark machines.

More power?

Running Spark locally is nice and will already allow for parallelism, but the real profit can be gained by running Spark on a cluster of computers. The nice thing is that Spark automatically comes with a script that will automate the provisioning of a Spark cluster on Amazon AWS.

To get a cluster started; start up an EC2 cluster with the supplied ec2 folder from Apache’s Spark github repo. A more elaborate tutorial can be found here, but if you already are an Amazon user, provisioning a cluster on Amazon is as simple as calling a one-liner:

./spark-ec2 \
--key-pair=spark-df \
--identity-file=/path/spark-df.pem \
--region=eu-west-1 \
-s 3 \
--instance-type c3.2xlarge \
launch my-spark-cluster

If you ssh in the master node that has just been setup you can run the following code:

cd /root
git clone
cd SparkR-pkg
cp -a /root/SparkR-pkg/lib/SparkR /usr/share/R/library/
/root/spark-ec2/copy-dir /root/SparkR-pkg
/root/spark/sbin/ cp -a /root/SparkR-pkg/lib/SparkR /usr/share/R/library/

Launch SparkR on a Cluster

Finally to launch SparkR and connect to the Spark EC2 cluster, we run the following code on the master machine:

MASTER=spark://:7077 ./sparkR

The hostname can be retrieved using:

cat /root/spark-ec2/cluster-url

You can check on the status of your cluster via Spark’s Web UI at http://:8080.

The future

Everything described in this document is subject to changes with the next Spark release, but should help you feel familiar on how Spark works. There will be R support for Spark, less so for low level RDD operations but more so for its distributed machine learning algorithms as well as DataFrame objects.

The support for R in the Spark universe might be a game changer. R has always been great on doing exploratory and interactive analysis on small to medium datasets. With the addition of Spark, R can become a more viable tool for big datasets.

June is the current planned release date for Spark 1.4 which will allow R users to run data frame operations in parallel on the distributed memory of a cluster of computers. All of which is completely open source.

It will be interesting to see what possibilities this brings for the R community.

HadleyWickhamHSJoin RStudio Chief Data Scientist Hadley Wickham at the University of Illinois at Chicago, on Wednesday May 27th & 28th for this rare opportunity to learn from one of the R community’s most popular and innovative authors and package developers.

As of this post, the workshop is two-thirds sold out. If you’re in or near Chicago and want to boost your R programming skills, this is Hadley’s only Central US public workshop planned for 2015.

Register here:

data visualization cheatsheet

We’ve added a new cheatsheet to our collection. Data Visualization with ggplot2 describes how to build a plot with ggplot2 and the grammar of graphics. You will find helpful reminders of how to use:

  • geoms
  • stats
  • scales
  • coordinate systems
  • facets
  • position adjustments
  • legends, and
  • themes

The cheatsheet also documents tips on zooming.

Download the cheatsheet here.

Bonus – Frans van Dunné of Innovate Online has provided Spanish translations of the Data Wrangling, R Markdown, Shiny, and Package Development cheatsheets. Download them at the bottom of the cheatsheet gallery.


We’ve added a new cheatsheet to our collection! Package Development with devtools will help you find the most useful functions for building packages in R. The cheatsheet will walk you through the steps of building a package from:

  • Setting up the package structure
  • Adding a DESCRIPTION file
  • Writing code
  • Writing tests
  • Writing documentation with roxygen
  • Adding data sets
  • Building a NAMESPACE, and
  • Including vignettes

The sheet focuses on Hadley Wickham’s devtools package, and it is a useful supplement to Hadley’s book R Packages, which you can read online at

Download the sheet here.

Bonus – Vivian Zhang of SupStat Analytics has kindly translated the existing Data Wrangling, R Markdown, and Shiny cheatsheets into Chinese. You can download the translations at the bottom of the cheatsheet gallery.

Great news for Shiny and R Markdown enthusiasts!

An Interactive Reporting Workshop with Shiny and R Markdown is coming to a city near you. Act fast as only 20 seats are available for each workshop.

You can find out more / register by clicking on the link for your city!

East Coast West Coast
March 2 – Washington, DC April 15 – Los Angeles, CA
March 4 – New York, NY April 17 – San Francisco, CA
March 6 – Boston, MA April 20 – Seattle, WA

You’ll want to take this workshop if…

You have some experience working with R already. You should have written a number of functions, and be comfortable with R’s basic data structures (vectors, matrices, arrays, lists, and data frames).

You will learn from…

The workshop is taught by Garrett Grolemund. Garrett is the Editor-in-Chief of, the development center for the Shiny R package. He is also the author of Hands-On Programming with R as well as Data Science with R, a forthcoming book by O’Reilly Media. Garrett works as a Data Scientist and Chief Instructor for RStudio, Inc. GitHub

Give yourself the gift of “mastering” R to start 2015!

Join RStudio Chief Data Scientist Hadley Wickham at the Westin San Francisco on January 19 and 20 for this rare opportunity to learn from one of the R community’s most popular and innovative authors and package developers.

As of this post, the workshop is two-thirds sold out. If you’re in or near California and want to boost your R programming skills, this is Hadley’s only West Coast public workshop planned for 2015.

Register here:


Get every new post delivered to your Inbox.

Join 19,244 other followers