To paraphrase Yogi Berra, “Predicting is hard, especially about the future”. In 1993, when Ross Ihaka and Robert Gentleman first started working on R, who would have predicted that it would be used by millions in a world that increasingly rewards data literacy? It’s impossible to know where R will go in the next 20 years, but at RStudio we’re working hard to make sure the future is bright.

Today, we’re excited to announce our participation in the R Consortium, a new 501(c)6 nonprofit organization. The R Consortium is a collaboration between the R Foundation, RStudio, Microsoft, TIBCO, Google, Oracle, HP and others. It’s chartered to fund and inspire ideas that will enable R to become an even better platform for science, research, and industry. The R Consortium complements the R Foundation by providing a convenient funding vehicle for the many commercial beneficiaries of R to give back to the community, and will provide the resources to embark on ambitious new projects to make R even better.

We believe the R Consortium is critically important to the future of R and despite our small size, we chose to join it at the highest contributor level (alongside Microsoft). Open source is a key component of our mission and giving back to the community is extremely important to us.

The community of R users and developers have a big stake in the language and its long-term success. We all want free and open source R to continue thriving and growing for the next 20 years and beyond. The fact that so many of the technology industry’s largest companies are willing to stand behind R as part of the consortium is remarkable and we think bodes incredibly well for the future of R.

We are excited to announce that a new package leaflet has been released on CRAN. The R package leaflet is an interface to the JavaScript library Leaflet to create interactive web maps. It was developed on top of the htmlwidgets framework, which means the maps can be rendered in R Markdown (v2) documents, Shiny apps, and RStudio IDE / the R console. Please see http://rstudio.github.io/leaflet for the full documentation. To install the package, run

install.packages('leaflet')

We quietly introduced this package in December when we announced htmlwidgets, but in the months since then we’ve added a lot of new features and launched a new set of documentation. If you haven’t looked at leaflet lately, now is a great time to get reacquainted!

The Map Widget

The basic usage of this package is that you create a map widget using the leaflet() function, and add layers to the map using the layer functions such as addTiles(), addMarkers(), and so on. Adding layers can be done through the pipe operator %>% from magrittr (you are not required to use %>%, though):

library(leaflet)

m <- leaflet() %>%
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addMarkers(lng=174.768, lat=-36.852,
    popup="The birthplace of R")
m  # Print the map

leaflet-basic

There are a variety of layers that you can add to a map widget, including:

  • Map tiles
  • Markers / Circle Markers
  • Polygons / Rectangles
  • Lines
  • Popups
  • GeoJSON / TopoJSON
  • Raster Images
  • Color Legends
  • Layer Groups and Layer Control

There are a sets of methods to manipulate the attributes of a map, such as setView() and fitBounds(), etc. You can find the details from the help page ?setView.

Read the rest of this entry »

We are happy to announce a new package DT is available on CRAN now. DT is an interface to the JavaScript library DataTables based on the htmlwidgets framework, to present rectangular R data objects (such as data frames and matrices) as HTML tables. You can filter, search, and sort the data in the table. See http://rstudio.github.io/DT/ for the full documentation and examples of this package. To install the package, run

install.packages('DT')
# run DT::datatable(iris) to see a "hello world" example

DataTables

The main function in this package is datatable(), which returns a table widget that can be rendered in R Markdown documents, Shiny apps, and the R console. It is easy to customize the style (cell borders, row striping, and row highlighting, etc), theme (default or Bootstrap), row/column names, table caption, and so on.

Read the rest of this entry »

We’re pleased to announce d3heatmap, our new package for generating interactive heat maps using d3.js and htmlwidgets. Tal Galili, author of dendextend, collaborated with us on this package.

d3heatmap is designed to have a familiar feature set and API for anyone who has used heatmap or heatmap.2 to create static heatmaps. You can specify dendrogram, clustering, and scaling options in the same way.

d3heatmap includes the following features:

  • Shows the row/column/value under the mouse cursor
  • Click row/column labels to highlight
  • Drag a rectangle over the image to zoom in
  • Works from the R console, in RStudio, with R Markdown, and with Shiny

Installation

install.packages("d3heatmap")

Examples

Here’s a very simple example (source: flowingdata):

library(d3heatmap)
url <- "http://datasets.flowingdata.com/ppg2008.csv"
nba_players <- read.csv(url, row.names = 1)
d3heatmap(nba_players, scale = "column")

d3heatmap

You can easily customize the colors using the colors parameter. This can take an RColorBrewer palette name, a vector of colors, or a function that takes (potentially scaled) data points as input and returns colors.

Read the rest of this entry »

It’s time to share “What’s New” in shinyapps.io!

  • Custom Domainshost your shiny applications on your own domain (Professional Plan only).  Learn more.
  • Bigger applications – include up to 1GB of data with your application bundles!
  • Bigger packages – until now shinyapps.io could only support installation of packages under 100MB; now it’s 1GB! (attention users of BioConductor packages especially)
  • Better locale detection – the newest shinyapps package now detects and maps your locale appropriately if it varies from the locale of shinyapps.io  (you will need to update  your shinyapps package)
  • Application deletion – you can now delete applications permanently. First archive your application, then delete it. Note: Be careful; all deletes are permanent.
  • Transfer / Rename accounts – select a different name for your account or transfer control to another shinyapps.io account holder.
  • “What’s New” is New your dashboard displays the latest enhancements to shinyapps.io under…you guessed it… “What’s New”!

We’ve added two new tools that make it even easier to learn Shiny.

Video tutorial

01-How-to-start

The How to Start with Shiny training video provides a new way to teach yourself Shiny. The video covers everything you need to know to build your own Shiny apps. You’ll learn:

  • The architecture of a Shiny app
  • A template for making apps quickly
  • The basics of building Shiny apps
  • How to add sliders, drop down menus, buttons, and more to your apps
  • How to share Shiny apps
  • How to control reactions in your apps to
    • update displays
    • trigger code
    • reduce computation
    • delay reactions
  • How to add design elements to your apps
  • How to customize the layout of an app
  • How to style your apps with CSS

Altogether, the video contains two hours and 25 minutes of material organized around a navigable table of contents.

Best of all, the video tutorial is completely free. The video is the result of our recent How to Start Shiny webinar series. Thank you to everyone who attended and made the series a success!

Watch the new video tutorial here.

New cheat sheet

The new Shiny cheat sheet provides an up-to-date reference to the most important Shiny functions.

shiny-cheatsheet

The cheat sheet replaces the previous cheat sheet, adding new sections on single-file apps, reactivity, CSS and more. The new sheet also gave us a chance to apply some of the things we’ve learned about making cheat sheets since the original Shiny cheat sheet came out.

Get the new Shiny cheat sheet here.

Shiny 0.12 has been released to CRAN!

Compared to version 0.11.1, the major changes are:

  • Interactive plots with base graphics and ggplot2
  • Switch from RJSONIO to jsonlite

For a full list of changes and bugfixes in this version, see the NEWS file.

To install the new version of Shiny, run:

install.packages(c("shiny", "htmlwidgets"))

htmlwidgets is not required, but shiny 0.12 will not work with older versions of htmlwidgets, so it’s a good idea to install a fresh copy along with Shiny.

Interactive plots with base graphics and ggplot2

Excluding points

The major new feature in this version of Shiny is the ability to create interactive plots using R’s base graphics or ggplot2. Adding interactivity is easy: it just requires using one option in plotOutput(), and then the information about mouse events will be available via the input object.

You can use mouse events to read mouse coordinates, select or deselect points, and implement zooming. Here are some example applications:

For more information, see the Interactive Plots articles in the Shiny Dev Center, and the demo apps in the gallery.

Switch from RJSONIO to jsonlite

Shiny uses the JSON format to send data between the server (running R) and the client web browser (running JavaScript).

In previous versions of Shiny, the data was serialized to/from JSON using the RJSONIO package. However, as of 0.12.0, Shiny switched from RJSONIO to jsonlite. The reasons for this are that jsonlite has better-defined conversion behavior, and it has better performance because much of it is now implemented in C.

For the vast majority of users, this will have no impact on existing Shiny apps.

The htmlwidgets package has also switched to jsonlite, and any Shiny apps that use htmlwidgets also require an upgrade to that package.

A note about Data Tables

The version we just released to CRAN is actually 0.12.1; the previous version, 0.12.0, was released three weeks ago and deprecated Shiny’s dataTableOutput and renderDataTable functions and instructed you to migrate to the nascent DT package instead. (We’ll talk more about DT in a future blog post.)

User feedback has indicated this transition was too sudden and abrupt, so we’ve undeprecated these functions in 0.12.1. We’ll continue to support these functions until DT has had more time to mature.

“Master” R in Washington DC this September!

Join RStudio Chief Data Scientist Hadley Wickham at the AMA – Executive Conference Center in Arlington, VA on September 14 and 15, 2015 for this rare opportunity to learn from one of the R community’s most popular and innovative authors and package developers.

It will be at least another year before Hadley returns to teach his class on the East Coast, so don’t miss this opportunity to learn from him in person. The venue is conveniently located next to Ronald Reagan Washington National Airport and a short distance from the Metro. Attendance is limited. Past events have sold out.

Register today!

testthat 0.10.0 is now available on CRAN. Testthat makes it easy to turn the informal testing that you’re already doing into formal automated tests. Learn more at http://r-pkgs.had.co.nz/tests.html. Install the latest version with:

install.packages("testthat")

There are four big changes in this release:

  • test_check() uses a new reporter specifically designed for R CMD check. It displays a summary at the end of the tests, designed to be <13 lines long so test failures in R CMD check display are as useful as possible.
  • New skip_if_not_installed() skips tests if a package isn’t installed: this is useful if you want tests to skip if a suggested package isn’t installed.
  • The expect_that(a, equals(b)) style of testing has been soft-deprecated in favour of expect_equals(a, b). It will keep working, but it’s no longer demonstrated in the documentation, and new expectations will only be available in expect_equal(a, b) style.
  • compare() is now documented and exported: compare is used to display test failures for expect_equal(), and is designed to help you spot exactly where the failure occured. It currently has methods for character and numeric vectors.

There were a number of other minor improvements and bug fixes. See the release notes for a complete list.

This is a guest post by Vincent Warmerdam of koaning.io.​

SparkR preview in Rstudio

Apache Spark is the hip new technology on the block. It allows you to write scripts in a functional style and the technology behind it will allow you to run iterative tasks very quickly on a cluster of machines. It’s benchmarked to be quicker than hadoop for most machine learning use cases (by a factor between 10-100) and soon Spark will also have support for the R language. As of April 2015, SparkR has been merged into Apache Spark and is shipping with a new version in an upcoming release (1.4) due early summer 2015. In the meanwhile, you can use this tutorial to go ahead and get familiar with the current version of SparkR.

**Disclaimer** : although you will be able to use this tutorial to write Spark jobs right now with R, the new api due this summer will most likely have breaking changes.

Running Spark Locally

The main feature of Spark is the resilient distributed dataset, which is a dataset that can be queried in memory, in parallel on a cluster of machines. You don’t need a cluster of machines to get started with Spark though. Even on a single machine, Spark is able to efficiently use any configured resources. To keep it simple we will ignore this configuration for now and do a quick one-click install. You can use devtools to download and install Spark with SparkR.

library(devtools)
install_github("amplab-extras/SparkR-pkg", subdir="pkg")

This might take a while. But after the installation, the following R code will run Spark jobs for you:

library(magrittr)
library(SparkR)

sc <- sparkR.init(master="local")

sc %>% 
  parallelize(1:100000) %>%
  count

This small program generates a list, gives it to Spark (which turns it into an RDD, Spark’s Resilient Distributed Dataset structure) and then counts the number of items in it. SparkR exposes the RDD API of Spark as distributed lists in R, which plays very nicely with magrittr. As long as you follow the API, you don’t need to worry much about parallelizing for performance for your programs.

A more elaborate example

Spark also allows for grouped operations, which might remind you a bit of dplyr.

nums = runif(100000) * 10

sc %>% 
  parallelize(nums) %>% 
  map(function(x) round(x)) %>%
  filterRDD(function(x) x %% 2) %>% 
  map(function(x) list(x, 1)) %>%
  reduceByKey(function(x,y) x + y, 1L) %>% 
  collect

The Spark API will look very ‘functional’ to programmers used to functional programming languages (which should come to no suprise if you know that Spark is written in Scala). This script will do the following;

  1. it will create a RRD Spark object from the original data
  2. it will map each number to a rounded number
  3. it will filter all even numbers out or the RDD
  4. next it will create key/value pairs that can be counted
  5. it then reduces the key value pairs (the 1L is the number of partitions for the resulting RDD)
  6. and it collects the results

Spark will have started running services locally on your computer, which can be viewed at http://localhost:4040/stages/. You should be able to see all the jobs you’ve run here. You will also see which jobs have failed with the error log.

Bootstrapping with Spark

These examples are nice, but you can also use the power of Spark for more common data science tasks. Let’s sample a dataset to generate a large RDD, which we will then summarise via bootstrapping. Instead of parallelizing numbers, I will now parallelize dataframe samples.

sc <- sparkR.init(master="local")

sample_cw <- function(n, s){
  set.seed(s)
  ChickWeight[sample(nrow(ChickWeight), n), ]
}

data_rdd <- sc %>%
  parallelize(1:200, 20) %>% 
  map(function(s) sample_cw(250, s))

For the parallelize function we can assign the number of partitions Spark can use for the resulting RDD. My s argument ensures that each partition will use a different random seed when sampling. This data_rdd is useful, because it can be reused for multiple purposes.

You can use it to estimate the mean of the weight.

data_rdd %>% 
  map(function(x) mean(x$weight)) %>% 
  collect %>% 
  as.numeric %>% 
  hist(20, main="mean weight, bootstrap samples")

Or you can use it to perform bootstrapped regressions.

train_lm <- function(data_in){
  lm(data=data_in, weight ~ Time)
}

coef_rdd <- data_rdd %>% 
  map(train_lm) %>% 
  map(function(x) x$coefficients) 

get_coef <- function(k) { 
  code_rdd %>%  
    map(function(x) x[k]) %>% 
    collect %>%
    as.numeric
}

df <- data.frame(intercept = get_coef(1), time_coef = get_coef(2))
df$intercept %>% hist(breaks = 30, main="beta coef for intercept")
df$time_coef %>% hist(breaks = 30, main="beta coef for time")

The slow part of this tree of operations is the creation of the data, because this has to occur locally through R. A more common use case for Spark would be to load a large dataset from S3 which connects to a large EC2 cluster of Spark machines.

More power?

Running Spark locally is nice and will already allow for parallelism, but the real profit can be gained by running Spark on a cluster of computers. The nice thing is that Spark automatically comes with a script that will automate the provisioning of a Spark cluster on Amazon AWS.

To get a cluster started; start up an EC2 cluster with the supplied ec2 folder from Apache’s Spark github repo. A more elaborate tutorial can be found here, but if you already are an Amazon user, provisioning a cluster on Amazon is as simple as calling a one-liner:

./spark-ec2 \
--key-pair=spark-df \
--identity-file=/path/spark-df.pem \
--region=eu-west-1 \
-s 3 \
--instance-type c3.2xlarge \
launch my-spark-cluster

If you ssh in the master node that has just been setup you can run the following code:

cd /root
git clone https://github.com/amplab-extras/SparkR-pkg.git
cd SparkR-pkg
SPARK_VERSION=1.2.1 ./install-dev.sh
cp -a /root/SparkR-pkg/lib/SparkR /usr/share/R/library/
/root/spark-ec2/copy-dir /root/SparkR-pkg
/root/spark/sbin/slaves.sh cp -a /root/SparkR-pkg/lib/SparkR /usr/share/R/library/

Launch SparkR on a Cluster

Finally to launch SparkR and connect to the Spark EC2 cluster, we run the following code on the master machine:

MASTER=spark://:7077 ./sparkR

The hostname can be retrieved using:

cat /root/spark-ec2/cluster-url

You can check on the status of your cluster via Spark’s Web UI at http://:8080.

The future

Everything described in this document is subject to changes with the next Spark release, but should help you feel familiar on how Spark works. There will be R support for Spark, less so for low level RDD operations but more so for its distributed machine learning algorithms as well as DataFrame objects.

The support for R in the Spark universe might be a game changer. R has always been great on doing exploratory and interactive analysis on small to medium datasets. With the addition of Spark, R can become a more viable tool for big datasets.

June is the current planned release date for Spark 1.4 which will allow R users to run data frame operations in parallel on the distributed memory of a cluster of computers. All of which is completely open source.

It will be interesting to see what possibilities this brings for the R community.

Follow

Get every new post delivered to your Inbox.

Join 19,075 other followers