You are currently browsing the monthly archive for September 2016.
I’m planning to release ggplot2 2.2.0 in early November. In preparation, I’d like to announce that a release candidate is now available: version 126.96.36.19901. Please try it out, and file an issue on GitHub if you discover any problems. I hope we can find and fix any major issues before the official release.
Install the pre-release version with:
# install.packages("devtools") devtools::install_github("hadley/ggplot2")
If you discover a major bug that breaks your plots, please file a minimal reprex, and then roll back to the released version with:
ggplot2 2.2.0 will be a relatively major release including:
- Subtitles and captions.
- A large rewrite of the facetting system.
- Improved theme options.
- Better stacking
- Numerous bug fixes and minor improvements.
The majority of this work was carried out by Thomas Pedersen, who I was lucky to have as my “ggplot2 intern” this summer. Make sure to check out other visualisation packages: ggraph, ggforce, and tweenr.
The facet and layout implementation has been moved to ggproto and received a large rewrite and refactoring. This will allow others to create their own facetting systems, as descrbied in the Extending ggplot2 vignette. Along with the rewrite a number of features and improvements has been added, most notably:
- Functions in facetting formulas, thanks to Dan Ruderman.
- Axes were dropped when the panels in
facet_wrap()did not completely fill the rectangle. Now, an axis is drawn underneath the hanging panels:
ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_wrap(~class)
- It is now possible to set the position of the axes through the
positionargument in the scale constructor:
ggplot(mpg, aes(displ, hwy)) + geom_point() + scale_x_continuous(position = "top") + scale_y_continuous(position = "right")
- You can display a secondary axis that is a one-to-one transformation of the primary axis with the
ggplot(mpg, aes(displ, hwy)) + geom_point() + scale_y_continuous( "mpg (US)", sec.axis = sec_axis(~ . * 1.20, name = "mpg (UK)") )
- Strips can be placed on any side, and the placement with respect to axes can be controlled with the
ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_wrap(~ drv, strip.position = "bottom") + theme( strip.placement = "outside", strip.background = element_blank(), strip.text = element_text(face = "bold") ) + xlab(NULL)
- Blank elements can now be overridden again so you get the expected behavior when setting e.g.
arrowargument that lets you put arrows on axes.
arrow <- arrow(length = unit(0.4, "cm"), type = "closed") ggplot(mpg, aes(displ, hwy)) + geom_point() + theme_minimal() + theme( axis.line = element_line(arrow = arrow) )
- Control of legend styling has been improved. The whole legend area can be aligned according to the plot area and a box can be drawn around all legends:
ggplot(mpg, aes(displ, hwy, shape = drv, colour = fl)) + geom_point() + theme( legend.justification = "top", legend.box.margin = margin(3, 3, 3, 3, "mm"), legend.box.background = element_rect(colour = "grey50") )
legend.marginhave been renamed to
legend.spacingrespectively as this better indicates their roles. A new
legend.marginhas been actually controls the margin around each legend.
- When computing the height of titles ggplot2, now inclues the height of the descenders (i.e. the bits
ythat hang underneath). This makes improves the margins around titles, particularly the y axis label. I have also very slightly increased the inner margins of axis titles, and removed the outer margins.
- The default themes has been tweaked by Jean-Olivier Irisson making them better match
- Lastly, the
theme()function now has named arguments so autocomplete and documentation suggestions are vastly improved.
position_fill() now stack values in the reverse order of the grouping, which makes the default stack order match the legend.
avg_price <- diamonds %>% group_by(cut, color) %>% summarise(price = mean(price)) %>% ungroup() %>% mutate(price_rel = price - mean(price)) ggplot(avg_price) + geom_col(aes(x = cut, y = price, fill = color))
(Note also the new
geom_col() which is short-hand for
geom_bar(stat = "identity"), contributed by Bob Rudis.)
Additionally, you can now stack negative values:
ggplot(avg_price) + geom_col(aes(x = cut, y = price_rel, fill = color))
The overall ordering cannot necessarily be matched in the presence of negative values, but the ordering on either side of the x-axis will match.
If you want to stack in the opposite order, try
Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:
- Interactively manipulate Spark data using both dplyr and SQL (via DBI).
- Filter and aggregate Spark datasets then bring them into R for analysis and visualization.
- Orchestrate distributed machine learning from R using either Spark MLlib or H2O SparkingWater.
- Create extensions that call the full Spark API and provide interfaces to Spark packages.
- Integrated support for establishing Spark connections and browsing Spark data frames within the RStudio IDE.
We’re also excited to be working with several industry partners. IBM is incorporating sparklyr into their Data Science Experience, Cloudera is working with us to ensure that sparklyr meets the requirements of their enterprise customers, and H2O has provided an integration between sparklyr and H2O Sparkling Water.
You can install sparklyr from CRAN as follows:
You should also install a local version of Spark for development purposes:
library(sparklyr) spark_install(version = "1.6.2")
If you use the RStudio IDE, you should also download the latest preview release of the IDE which includes several enhancements for interacting with Spark.
Extensive documentation and examples are available at http://spark.rstudio.com.
Connecting to Spark
You can connect to both local instances of Spark as well as remote Spark clusters. Here we’ll connect to a local instance of Spark:
library(sparklyr) sc <- spark_connect(master = "local")
The returned Spark connection (
sc) provides a remote dplyr data source to the Spark cluster.
You can copy R data frames into Spark using the dplyr copy_to function (more typically though you’ll read data within the Spark cluster using the spark_read family of functions). For the examples below we’ll copy some datasets from R into Spark (note that you may need to install the nycflights13 and Lahman packages in order to execute this code):
library(dplyr) iris_tbl <- copy_to(sc, iris) flights_tbl <- copy_to(sc, nycflights13::flights, "flights") batting_tbl <- copy_to(sc, Lahman::Batting, "batting")
We can now use all of the available dplyr verbs against the tables within the cluster. Here’s a simple filtering example:
# filter by departure delay flights_tbl %>% filter(dep_delay == 2)
Introduction to dplyr provides additional dplyr examples you can try. For example, consider the last example from the tutorial which plots data on flight delays:
delay <- flights_tbl %>% group_by(tailnum) %>% summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>% filter(count > 20, dist < 2000, !is.na(delay)) %>% collect() # plot delays library(ggplot2) ggplot(delay, aes(dist, delay)) + geom_point(aes(size = count), alpha = 1/2) + geom_smooth() + scale_size_area(max_size = 2)
dplyr window functions are also supported, for example:
batting_tbl %>% select(playerID, yearID, teamID, G, AB:H) %>% arrange(playerID, yearID, teamID) %>% group_by(playerID) %>% filter(min_rank(desc(H)) <= 2 & H > 0)
For additional documentation on using dplyr with Spark see the dplyr section of the sparklyr website.
It’s also possible to execute SQL queries directly against tables within a Spark cluster. The
spark_connection object implements a DBI interface for Spark, so you can use
dbGetQuery to execute SQL and return the result as an R data frame:
library(DBI) iris_preview <- dbGetQuery(sc, "SELECT * FROM iris LIMIT 10")
You can orchestrate machine learning algorithms in a Spark cluster via either Spark MLlib or via the H2O Sparkling Water extension package. Both provide a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows.
In this example we’ll use ml_linear_regression to fit a linear regression model. We’ll use the built-in
mtcars dataset, and see if we can predict a car’s fuel consumption (
mpg) based on its weight (
wt) and the number of cylinders the engine contains (
cyl). We’ll assume in each case that the relationship between
mpg and each of our features is linear.
# copy mtcars into spark mtcars_tbl <- copy_to(sc, mtcars) # transform our data set, and then partition into 'training', 'test' partitions <- mtcars_tbl %>% filter(hp >= 100) %>% mutate(cyl8 = cyl == 8) %>% sdf_partition(training = 0.5, test = 0.5, seed = 1099) # fit a linear model to the training dataset fit <- partitions$training %>% ml_linear_regression(response = "mpg", features = c("wt", "cyl"))
For linear regression models produced by Spark, we can use
summary() to learn a bit more about the quality of our fit, and the statistical significance of each of our predictors.
Spark machine learning supports a wide array of algorithms and feature transformations, and as illustrated above it’s easy to chain these functions together with dplyr pipelines. To learn more see the Spark MLlib section of the sparklyr website.
H2O Sparkling Water
Let’s walk the same
mtcars example, but in this case use H2O’s machine learning algorithms via the H2O Sparkling Water extension. The dplyr code used to prepare the data is the same, but after partitioning into test and training data we call
h2o.glm rather than
# convert to h20_frame (uses the same underlying rdd) training <- as_h2o_frame(partitions$training) test <- as_h2o_frame(partitions$test) # fit a linear model to the training dataset fit <- h2o.glm(x = c("wt", "cyl"), y = "mpg", training_frame = training, lamda_search = TRUE) # inspect the model print(fit)
For linear regression models produced by H2O, we can use either
summary() to learn a bit more about the quality of our fit. The
summary() method returns some extra information about scoring history and variable importance.
To learn more see the H2O Sparkling Water section of the sparklyr website.
The facilities used internally by sparklyr for its dplyr and machine learning interfaces are available to extension packages. Since Spark is a general purpose cluster computing system there are many potential applications for extensions (e.g. interfaces to custom machine learning pipelines, interfaces to 3rd party Spark packages, etc.).
We’re excited to see what other sparklyr extensions the R community creates. To learn more see the Extensions section of the sparklyr website.
The latest RStudio Preview Release of the RStudio IDE includes integrated support for Spark and the sparklyr package, including tools for:
- Creating and managing Spark connections
- Browsing the tables and columns of Spark DataFrames
- Previewing the first 1,000 rows of Spark DataFrames
Once you’ve installed the sparklyr package, you should find a new Spark pane within the IDE. This pane includes a New Connection dialog which can be used to make connections to local or remote Spark instances:
Once you’ve connected to Spark you’ll be able to browse the tables contained within the Spark cluster:
The Spark DataFrame preview uses the standard RStudio data viewer:
The RStudio IDE features for sparklyr are available now as part of the RStudio Preview Release. The final version of RStudio IDE that includes integrated support for sparklyr will ship within the next few weeks.
We’re very pleased to be joined in this announcement by IBM, Cloudera, and H2O, who are working with us to ensure that sparklyr meets the requirements of enterprise customers and is easy to integrate with current and future deployments of Spark.
“With our latest contributions to Apache Spark and the release of sparklyr, we continue to emphasize R as a primary data science language within the Spark community. Additionally, we are making plans to include sparklyr in Data Science Experience to provide the tools data scientists are comfortable with to help them bring business-changing insights to their companies faster,” said Ritika Gunnar, vice president of Offering Management, IBM Analytics.
“At Cloudera, data science is one of the most popular use cases we see for Apache Spark as a core part of the Apache Hadoop ecosystem, yet the lack of a compelling R experience has limited data scientists’ access to available data and compute,” said Charles Zedlewski, vice president, Products at Cloudera. “We are excited to partner with RStudio to help bring sparklyr to the enterprise, so that data scientists and IT teams alike can get more value from their existing skills and infrastructure, all with the security, governance, and management our customers expect.”
“At H2O.ai, we’ve been focused on bringing the best of breed open source machine learning to data scientists working in R & Python. However, the lack of robust tooling in the R ecosystem for interfacing with Apache Spark has made it difficult for the R community to take advantage of the distributed data processing capabilities of Apache Spark.
We’re excited to work with RStudio to bring the ease of use of dplyr and the distributed machine learning algorithms from H2O’s Sparkling Water to the R community via the sparklyr & rsparkling packages”
We’ve just released Shiny Server and Shiny Server Pro 1.4.6. Relative to 1.4.2, our previously blogged-about version, the 1.4.6 release primarily includes bug fixes, and mitigations for low-severity security issues found by penetration testing. The full list of changes is after the jump.
If you’re running a Shiny Server Pro release that is older than 1.4.3 and are configured to use SSL/TLS, it’s especially important that you upgrade, as the versions of Node.js that are bundled with Shiny Server Pro 1.4.3 and earlier include vulnerable versions of OpenSSL.
Shiny Server (Open Source): Download now
The tidyverse is a set of packages that work in harmony because they share common data representations and API design. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.
The best place to learn about all the packages in the tidyverse and how they fit together is R for Data Science. Expect to hear more about the tidyverse in the coming months as I work on improved package websites, making citation easier, and providing a common home for discussions about data analysis with the tidyverse.
You can install tidyverse with
This will install the core tidyverse packages that you are likely to use in almost every analysis:
- ggplot2, for data visualisation.
- dplyr, for data manipulation.
- tidyr, for data tidying.
- readr, for data import.
- purrr, for functional programming.
- tibble, for tibbles, a modern re-imagining of data frames.
It also installs a selection of other tidyverse packages that you’re likely to use frequently, but probably not in every analysis. This includes packages for data manipulation:
- DBI, for databases.
- haven, for SPSS, SAS and Stata files.
- httr, for web apis.
- jsonlite for JSON.
- readxl, for
- rvest, for web scraping.
- xml2, for XML.
These packages will be installed along with tidyverse, but you’ll load them explicitly with
library(tidyverse) will load the core tidyverse packages: ggplot2, tibble, tidyr, readr, purrr, and dplyr. You also get a condensed summary of conflicts with other packages you have loaded:
library(tidyverse) #> Loading tidyverse: ggplot2 #> Loading tidyverse: tibble #> Loading tidyverse: tidyr #> Loading tidyverse: readr #> Loading tidyverse: purrr #> Loading tidyverse: dplyr #> Conflicts with tidy packages --------------------------------------- #> filter(): dplyr, stats #> lag(): dplyr, stats
You can see conflicts created later with
library(MASS) #> #> Attaching package: 'MASS' #> The following object is masked from 'package:dplyr': #> #> select tidyverse_conflicts() #> Conflicts with tidy packages -------------------------------------- #> filter(): dplyr, stats #> lag(): dplyr, stats #> select(): dplyr, MASS
And you can check that all tidyverse packages are up-to-date with
tidyverse_update() #> The following packages are out of date: #> * broom (0.4.0 -> 0.4.1) #> * DBI (0.4.1 -> 0.5) #> * Rcpp (0.12.6 -> 0.12.7) #> Update now? #> #> 1: Yes #> 2: No
I am pleased to announced lubridate 1.6.0. Lubridate is designed to make working with dates and times as pleasant as possible, and is maintained by Vitalie Spinu. You can install the latest version with:
This release includes a range of bug fixes and minor improvements. Some highlights from this release include:
duration()constructors now accept character strings and allow a very flexible specification of timespans:
period("3H 2M 1S") #>  "3H 2M 1S" duration("3 hours, 2 mins, 1 secs") #>  "10921s (~3.03 hours)" # Missing numerals default to 1. # Repeated units are summed period("hour minute minute") #>  "1H 2M 0S"
Period and duration parsing allows for arbitrary abbreviations of time units as long as the specification is unambiguous. For single letter specs,
strptime()rules are followed, so
These same rules allows you to compare strings and durations/periods:
"2mins 1 sec" > period("2mins") #>  TRUE
- Date time rounding (with
ceiling_date()) now supports unit multipliers, like “3 days” or “2 months”:
ceiling_date(ymd_hms("2016-09-12 17:10:00"), unit = "5 minutes") #>  "2016-09-12 17:10:00 UTC"
- The behavior of
Dateobjects is now more intuitive. In short, dates are now interpreted as time intervals that are physically part of longer unit intervals:
|day1| ... |day31|day1| ... |day28| ... | January | February | ...
That means that rounding up
2000-01-01by a month is done to the boundary between January and February which, i.e.
ceiling_date(ymd("2000-01-01"), unit = "month") #>  "2000-02-01"
This behavior is controlled by the
- It is now possible to compare
ymd_hms("2000-01-01 00:00:01") > ymd("2000-01-01") #>  TRUE
- C-level parsing now handles English months and AM/PM indicator regardless of your locale. This means that English date-times are now always handled by lubridate C-level parsing and you don’t need to explicitly switch the locale.
- New parsing function
yq()allows you to parse a year + quarter:
yq("2016-02") #>  "2016-04-01"
qformat is available in all lubridate parsing functions.
A new Shiny release is upon us! There are many new exciting features, bug fixes, and library updates. We’ll just highlight the most important changes here, but you can browse through the full changelog for details. This will likely be the last release before shiny 1.0, so get out your party hats!
To install it, you can just run:
Shiny now supports bookmarkable state: users can save the state of an application and get a URL which will restore the application with that state. There are two types of bookmarking: encoding the state in a URL, and saving the state to the server. With an encoded state, the entire state of the application is contained in the URL’s query string. You can see this in action with this app: https://gallery.shinyapps.io/113-bookmarking-url/. An example of a bookmark URL for this app is https://gallery.shinyapps.io/113-bookmarking-url/?inputs&n=200. When the state is saved to the server, the URL might look something like: https://gallery.shinyapps.io/bookmark-saved/?state_id=d80625dc681e913a (note that this URL is not for an active app).
Important note: Saved-to-server bookmarking currently works with Shiny Server Open Source. Support on Shiny Server Pro, RStudio Connect, and shinyapps.io is under development and testing. However, URL-encoded bookmarking works on all hosting platforms.
Shiny can now display notifications on the client browser by using the
showNotification() function. Use this demo app to play around with the notification API. For more, see our article about notifications.
If your Shiny app contains computations that take a long time to complete, a progress bar can improve the user experience by communicating how far along the computation is, and how much is left. Progress bars were added in Shiny 0.10.2. In Shiny 0.14, we’ve changed them to use the notifications system, which gives them a different look.
Important note: If you were already using progress bars and had customized them with your own CSS, you can add the
style = "old" argument to your
withProgress() call (or
Progress$new()). This will result in the same appearance as before. You can also call
shinyOptions(progress.style = "old") in your app’s server function to make all progress indicators use the old styling.
Shiny has now built-in support for displaying modal dialogs like the one below (live app here):
To learn more about modal dialogs in Shiny, read the article about them.
Sometimes in a Shiny app, arbitrary HTML UI may need to be created on-the-fly in response to user input. The existing
renderUI functions let you continue using reactive logic to call UI functions and make the results appear in a predetermined place in the UI. The
removeUI functions, which are used in the server code, allow you to use imperative logic to add and remove arbitrary chunks of HTML (all independent from one another), as many times as you want, whenever you want, wherever you want. This option may be more convenient when you want to, for example, add a new model to your app each time the user selects a different option (and leave previous models unchanged, rather than substitute the previous one for the latest one).
See this simple demo app of how one could use
removeUI to insert and remove text elements using a queue. Also see this other app that demonstrates how to insert and remove a few common Shiny input objects. Finally, this app shows how to dynamically insert modules using
Documentation for connecting to an external database
Many Shiny users have asked about best practices for accessing external databases from their Shiny applications. Although database access has long been possible using various database connector packages in R, it can be challenging to use them robustly in the dynamic environment that Shiny provides. So far, it has been mostly up to application authors to find the appropriate database drivers and to discover how to manage the database connections within an application. In order to demystify this process, we wrote a series of articles (first one here) that covers the basics of connecting to an external database, as well as some security precautions to keep in mind (e.g. how to avoid SQL injection attacks).
There are a few packages that you should look at if you’re using a relational database in a Shiny app: the
DBI packages (both featured in the article linked to above), and the brand new
pool package, which provides a further layer of abstraction to make it easier and safer to use either
pool is not yet on CRAN. In particular,
pool will take care of managing connections, preventing memory leaks, and ensuring the best performance. See this
pool basics article and the more advanced-level article if you’re feeling adventurous! (Both of these articles contain Shiny app examples that use
DBI to connect to an external MySQL database.) If you are more comfortable with
DBI, don’t miss the article about the integration of
If you’re new to databases in the Shiny world, we recommend using
pool if possible. If you need greater control than
dplyr offers (for example, if you need to modify data in the database or use transactions), then use
pool package was introduced to make your life easier, but in no way constrains you, so we don’t envision any situation in which you’d be better off not using it. The only caveat is that
pool is not yet on CRAN, so you may prefer to wait for that.
There are many more minor features, small improvements, and bug fixes than we can cover here, so we’ll just mention a few of the more noteworthy ones. (For more, you can see the full changelog.).
- Error Sanitization: you now have the option to sanitize error messages; in other words, the content of the original error message can be suppressed so that it doesn’t leak any sensitive information. To sanitize errors everywhere in your app, just add
options(shiny.sanitize.errors = TRUE)somewhere in your app. Read this article for more, or play with the demo app.
- Code Diagnostics: if there is an error parsing
global.R, Shiny will search the code for missing commas, extra commas, and unmatched braces, parens, and brackets, and will print out messages pointing out those problems. (#1126)
- Reactlog visualization: by default, the
showReactLog()function (which brings up the reactive graph) also displays the time that each reactive and observer were active for:
Additionally, to organize the graph, you can now drag any of the nodes to a specific position and leave it there.
- Nicer-looking tables: we’ve made tables generated with
renderTable()look cleaner and more modern. While this won’t break any older code, the finished look of your table will be quite a bit different, as the following image shows: