The Joint Statistics Meetings starting August 8 is the biggest meetup for statisticians in the world. Navigating the sheer quantity of interesting talks is challenging – there can be up to 50 sessions going on at a time!

To prepare for Seattle, we asked RStudio’s Chief Data Scientist Hadley Wickham for his top session picks. Here are 9 talks, admittedly biased towards R, graphics, and education, that really interested him and might interest you, too.

Check out these talks

Undergraduate Curriculum: The Pathway to Sustainable Growth in Our Discipline
Sunday, 1400-1550, CC-607
Statistics with Computing in the Evolving Undergraduate Curriculum
Sunday 1600-1750
In these back to back sessions, learn how statisticians are rising to the challenge of teaching computing and big(ger) data.

Recent Advances in Interactive Graphics for Data Analysis
Monday,1030-1220, CC-608
This is an exciting session discussing innovations in interactive visualisation, and it’s telling that all of them connect with R. Hadley will be speaking about ggvis in this session.

Preparing Students to Work in Industry
Monday, 1400-1550, CC-4C4
If you’re a student about to graduate, we bet there will be some interesting discussion for you here.

Stat computing and graphics mixer
Monday, 1800-2000, S-Ravenna
This is advertised as a business meeting, but don’t be confused. It’s the premier social event for anyone interested in computing or visualisation!

Statistical Computing and Graphics Student Paper Competition
Tuesday, 0830-1020, CC-308
Hear this year’s winners of the student paper award talk about visualising phylogenetic data, multiple change point detection, capture-recapture data and teaching intro stat with R. All four talks come with accompanying R packages!

The Statistics Identity Crisis: Are We Really Data Scientists?
Tuesday, 0830-1020, CC-609
This session, organised by Jeffrey Leek, features an all-star cast of Alyssa Frazee, Chris Volinsky, Lance Waller, and Jenny Bryan.

Doing Good with Data Viz
Wednesday, 0830-1020, CC-2B
Hear Jake Porway, Patrick Ball, and Dino Citraro talk about using data to do good. Of all the sessions, you shouldn’t miss, this is the one you really shouldn’t miss. (But unfortunately it conflicts with another great session – you have difficult choices ahead.)

Statistics at Scale: Applications from Tech Companies
Wednesday, 0830-1020, CC-204
Hilary Parker has organised a fantastic session where you’ll learn how companies like Etsy, Microsoft, and Facebook do statistics at scale. Get there early because this session is going to be PACKED!

There are hundreds of sessions at the JSM, so no doubt we’ve missed other great ones. If you think we’ve missed a “don’t miss” session, please add it to the comments so others can find it.

Visit RStudio at booth #435

In between session times you’ll find the RStudio team hanging out at booth #435 in the expo center (at the far back on the right). Please stop by and say hi! We’ll have stickers to show your R pride and printed copies of many of our cheatsheets.

See you in Seattle!

We’ve added a new articles section to the R Markdown development center at rmarkdown.rstudio.com/articles.html. Here you can find expert advice and tips on how to use R Markdown efficiently.

In one of the first articles, Richard Layton of graphdoctor.com explains the best tips for using R Markdown to generate Microsoft Word documents. You’ll learn how to

  • set Word styles
  • tweak margins
  • handle relative paths
  • make better tables
  • add bibliographies, and more

Check it out; and then check back often.

The articles section on shiny.rstudio.com has lots of great advice for Shiny developers.

A recent article by Dean Attali demonstrates how to save data from a Shiny app to persistent storage structures, like local files, servers, databases, and more. When you do this, your data remains after the app has closed, which opens new doors for data collection and analysis.

persistent-storage

Read Dean’s article and more at shiny.rstudio.com/articles

Today’s guest post is written by Vincent Warmerdam of GoDataDriven and is reposted with Vincent’s permission from blog.godatadriven.com. You can learn more about how to use SparkR with RStudio at the 2015 EARL Conference in Boston November 2-4, where Vincent will be speaking live.

This document contains a tutorial on how to provision a spark cluster with RStudio. You will need a machine that can run bash scripts and a functioning account on AWS. Note that this tutorial is meant for Spark 1.4.0. Future versions will most likely be provisioned in another way but this should be good enough to help you get started. At the end of this tutorial you will have a fully provisioned spark cluster that allows you to handle simple dataframe operations on gigabytes of data within RStudio.

AWS prep

Make sure you have an AWS account with billing. Next make sure that you have downloaded your .pem files and that you have your keys ready.

Spark Startup

Next go and get spark locally on your machine from the spark homepage. It’s a pretty big blob. Unzip it once it is downloaded go to the ec2 folder in the spark folder. Run the following command from the command line.

./spark-ec2 \
--key-pair=spark-df \
--identity-file=/Users/code/Downloads/spark-df.pem \
--region=eu-west-1 \
-s 1 \
--instance-type c3.2xlarge \
launch mysparkr

This script will use your keys to connect to amazon and setup a spark standalone cluster for you. You can specify what type of machines you want to use as well as how many and where on amazon. You will only need to wait until everything is installed, which can take up to 10 minutes. More info can be found here.
When the command signals that it is done, you can ssh into your machine via the command line.
./spark-ec2 -k spark-df -i /Users/code/Downloads/spark-df.pem --region=eu-west-1 login mysparkr
Once you are in your amazon machine you can immediately run SparkR from the terminal.

chmod u+w /root/spark/
./spark/bin/sparkR 

As just a toy example, you should be able to confirm that the following code already works.

ddf <- createDataFrame(sqlContext, faithful) 
head(ddf)
printSchema(ddf)

This ddf dataframe is no ordinary dataframe object. It is a distributed dataframe, one that can be distributed across a network of workers such that we could query it for parallelized commands through spark.

Spark UI

This R command you have just run launches a spark job. Spark has a webui so you can keep track of the cluster. To visit the web-ui, first confirm on what IP-address the master node is via this command:

curl icanhazip.com

You can now visit the webui via your browser.

<master-node-ip>:4040

From here you can view anything you may want to know about your spark clusters (like executor status, job process and even a DAG visualisation).

This is a good moment to stand still and realize that this on it’s own right is already very cool. We can start up a spark cluster in 15 minutes and use R to control it. We can specify how many servers we need by only changing a number on the command line and without any real developer effort we gain access to all this parallelizing power.
Still, working from a terminal might not be too productive. We’d prefer to work with a GUI and we would like some basic plotting functionality when working with data. So let’s install RStudio and get some tools connected.

RStudio setup

Get out of the SparkR shell by entering q(). Next, download and install Rstudio.
wget http://download2.rstudio.org/rstudio-server-rhel-0.99.446-x86_64.rpm
sudo yum install --nogpgcheck -y rstudio-server-rhel-0.99.446-x86_64.rpm
rstudio-server restart
While this is installing. Make sure the TCP connection on the 8787 port is open in the AWS security group setting for the master node. A recommended setting is to only allow access from your ip.

Then, add a user that can access RStudio. We make sure that this user can also access all the RStudio files.

adduser analyst
passwd analyst

You also need to do this (the details of why are a bit involved). These edits need to be made because the analyst user doesn’t have root permissions.
chmod a+w /mnt/spark
chmod a+w /mnt2/spark
sed -e 's/^ulimit/#ulimit/g' /root/spark/conf/spark-env.sh > /root/spark/conf/spark-env2.sh
mv /root/spark/conf/spark-env2.sh /root/spark/conf/spark-env.sh
ulimit -n 1000000
When this is known, point the browser to <master-ip-adr>:8787. Then login in as analyst.

Loading data from S3

Let’s confirm that we can now play with the RStudio stack by downloading some libraries and having it run against a data that lives on S3.
small_file = "s3n://<AWS-ID>:<AWS-SECRET-KEY>@<bucket_name>/data.json"
dist_df <- read.df(sqlContext, small_file, "json") %>% cache
This dist_df is now a distributed dataframe, which has a different api than the normal R dataframe but is similar to dplyr.
head(summarize(groupBy(dist_df, df$type), count = n(df$auc)))
Also, we can install magrittr to make our code look a lot nicer.

local_df <- dist_df %>% 
  groupBy(df$type) %>% 
  summarize(count = n(df$id)) %>% 
  collect

The collect method pulls the distributed dataframe back into a normal dataframe on a single machine so you can use plotting methods on it again and use R as you would normally. A common use case would be to use spark to sample or aggregate a large dataset which can then be further explored in R.
Again, if you want to view the spark ui for these jobs you can just go to:

<master-node-ip>:4040

A more complete stack

Unfortunately this stack has an old version of R (we need version 3.2 to get the newest version of ggplot2/dplyr). Also, as of right now there isn’t support for the machine learning libraries yet. These are known issues at the moment and version 1.5 should show some fixes. Version 1.5 will also feature RStudio installation as part of the ec2 stack.
Another issue is that the namespace of dplyr currently conflicts with sparkr, time will tell how this gets resolved. Same would go for other data features like windowing function and more elaborate data types.

Killing the cluster

When you are done with the cluster, you only need to exit the ssh connection and run the following command:
./spark-ec2 -k spark-df -i /Users/code/Downloads/spark-df.pem --region=eu-west-1 destroy mysparkr

Conclusion

The economics of spark are very interesting. We only pay amazon for the time that we are using Spark as a compute engine. All other times we’d only pay for S3. This means that if we analyse for 8 hours, we’d only pay for 8 hours. Spark is also very flexible in that it allows us to continue coding in R (or python or scala) without having to learn multiple domain specific languages or frameworks like in hadoop. Spark makes big data really simple again.
This document is meant to help you get started with Spark and RStudio but in a production environment there are a few things you still need to account for:

  • security, our web connection is not done through https, even though we are telling amazon to only use our ip, we may be at security risk if there is a man in the middle listening .
  • multiple users, this setup will work fine for a single user but if multiple users are working on such a cluster you may need to rethink some steps with regards to user groups, file access and resource management.
  • privacy, this setup works well for ec2 but if you have sensitive, private user data then you may need to do this on premise because the data cannot leave your own datacenter. Most install steps would be the same but the initial installation of Spark would require the most work. See the docs for more information.

Spark is an amazing tool, expect more features in the future.

Possible Gotya

Hanging

It can happen that the ec2 script hangs in the Waiting for cluster to enter 'ssh-ready' state part. This can happen if you use amazon a lot. To prevent this you may want to remove some lines in ~/.ssh/known_hosts. More info here. Another option is to add the following lines to your ~/.ssh/config file.

# AWS EC2 public hostnames (changing IPs)
Host *.compute.amazonaws.com 
  StrictHostKeyChecking no
  UserKnownHostsFile /dev/null

To paraphrase Yogi Berra, “Predicting is hard, especially about the future”. In 1993, when Ross Ihaka and Robert Gentleman first started working on R, who would have predicted that it would be used by millions in a world that increasingly rewards data literacy? It’s impossible to know where R will go in the next 20 years, but at RStudio we’re working hard to make sure the future is bright.

Today, we’re excited to announce our participation in the R Consortium, a new 501(c)6 nonprofit organization. The R Consortium is a collaboration between the R Foundation, RStudio, Microsoft, TIBCO, Google, Oracle, HP and others. It’s chartered to fund and inspire ideas that will enable R to become an even better platform for science, research, and industry. The R Consortium complements the R Foundation by providing a convenient funding vehicle for the many commercial beneficiaries of R to give back to the community, and will provide the resources to embark on ambitious new projects to make R even better.

We believe the R Consortium is critically important to the future of R and despite our small size, we chose to join it at the highest contributor level (alongside Microsoft). Open source is a key component of our mission and giving back to the community is extremely important to us.

The community of R users and developers have a big stake in the language and its long-term success. We all want free and open source R to continue thriving and growing for the next 20 years and beyond. The fact that so many of the technology industry’s largest companies are willing to stand behind R as part of the consortium is remarkable and we think bodes incredibly well for the future of R.

We are excited to announce that a new package leaflet has been released on CRAN. The R package leaflet is an interface to the JavaScript library Leaflet to create interactive web maps. It was developed on top of the htmlwidgets framework, which means the maps can be rendered in R Markdown (v2) documents, Shiny apps, and RStudio IDE / the R console. Please see http://rstudio.github.io/leaflet for the full documentation. To install the package, run

install.packages('leaflet')

We quietly introduced this package in December when we announced htmlwidgets, but in the months since then we’ve added a lot of new features and launched a new set of documentation. If you haven’t looked at leaflet lately, now is a great time to get reacquainted!

The Map Widget

The basic usage of this package is that you create a map widget using the leaflet() function, and add layers to the map using the layer functions such as addTiles(), addMarkers(), and so on. Adding layers can be done through the pipe operator %>% from magrittr (you are not required to use %>%, though):

library(leaflet)

m <- leaflet() %>%
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addMarkers(lng=174.768, lat=-36.852,
    popup="The birthplace of R")
m  # Print the map

leaflet-basic

There are a variety of layers that you can add to a map widget, including:

  • Map tiles
  • Markers / Circle Markers
  • Polygons / Rectangles
  • Lines
  • Popups
  • GeoJSON / TopoJSON
  • Raster Images
  • Color Legends
  • Layer Groups and Layer Control

There are a sets of methods to manipulate the attributes of a map, such as setView() and fitBounds(), etc. You can find the details from the help page ?setView.

Read the rest of this entry »

We are happy to announce a new package DT is available on CRAN now. DT is an interface to the JavaScript library DataTables based on the htmlwidgets framework, to present rectangular R data objects (such as data frames and matrices) as HTML tables. You can filter, search, and sort the data in the table. See http://rstudio.github.io/DT/ for the full documentation and examples of this package. To install the package, run

install.packages('DT')
# run DT::datatable(iris) to see a "hello world" example

DataTables

The main function in this package is datatable(), which returns a table widget that can be rendered in R Markdown documents, Shiny apps, and the R console. It is easy to customize the style (cell borders, row striping, and row highlighting, etc), theme (default or Bootstrap), row/column names, table caption, and so on.

Read the rest of this entry »

We’re pleased to announce d3heatmap, our new package for generating interactive heat maps using d3.js and htmlwidgets. Tal Galili, author of dendextend, collaborated with us on this package.

d3heatmap is designed to have a familiar feature set and API for anyone who has used heatmap or heatmap.2 to create static heatmaps. You can specify dendrogram, clustering, and scaling options in the same way.

d3heatmap includes the following features:

  • Shows the row/column/value under the mouse cursor
  • Click row/column labels to highlight
  • Drag a rectangle over the image to zoom in
  • Works from the R console, in RStudio, with R Markdown, and with Shiny

Installation

install.packages("d3heatmap")

Examples

Here’s a very simple example (source: flowingdata):

library(d3heatmap)
url <- "http://datasets.flowingdata.com/ppg2008.csv"
nba_players <- read.csv(url, row.names = 1)
d3heatmap(nba_players, scale = "column")

d3heatmap

You can easily customize the colors using the colors parameter. This can take an RColorBrewer palette name, a vector of colors, or a function that takes (potentially scaled) data points as input and returns colors.

Read the rest of this entry »

It’s time to share “What’s New” in shinyapps.io!

  • Custom Domainshost your shiny applications on your own domain (Professional Plan only).  Learn more.
  • Bigger applications – include up to 1GB of data with your application bundles!
  • Bigger packages – until now shinyapps.io could only support installation of packages under 100MB; now it’s 1GB! (attention users of BioConductor packages especially)
  • Better locale detection – the newest shinyapps package now detects and maps your locale appropriately if it varies from the locale of shinyapps.io  (you will need to update  your shinyapps package)
  • Application deletion – you can now delete applications permanently. First archive your application, then delete it. Note: Be careful; all deletes are permanent.
  • Transfer / Rename accounts – select a different name for your account or transfer control to another shinyapps.io account holder.
  • “What’s New” is New your dashboard displays the latest enhancements to shinyapps.io under…you guessed it… “What’s New”!

We’ve added two new tools that make it even easier to learn Shiny.

Video tutorial

01-How-to-start

The How to Start with Shiny training video provides a new way to teach yourself Shiny. The video covers everything you need to know to build your own Shiny apps. You’ll learn:

  • The architecture of a Shiny app
  • A template for making apps quickly
  • The basics of building Shiny apps
  • How to add sliders, drop down menus, buttons, and more to your apps
  • How to share Shiny apps
  • How to control reactions in your apps to
    • update displays
    • trigger code
    • reduce computation
    • delay reactions
  • How to add design elements to your apps
  • How to customize the layout of an app
  • How to style your apps with CSS

Altogether, the video contains two hours and 25 minutes of material organized around a navigable table of contents.

Best of all, the video tutorial is completely free. The video is the result of our recent How to Start Shiny webinar series. Thank you to everyone who attended and made the series a success!

Watch the new video tutorial here.

New cheat sheet

The new Shiny cheat sheet provides an up-to-date reference to the most important Shiny functions.

shiny-cheatsheet

The cheat sheet replaces the previous cheat sheet, adding new sections on single-file apps, reactivity, CSS and more. The new sheet also gave us a chance to apply some of the things we’ve learned about making cheat sheets since the original Shiny cheat sheet came out.

Get the new Shiny cheat sheet here.

Follow

Get every new post delivered to your Inbox.

Join 19,113 other followers