You are currently browsing the category archive for the ‘Packages’ category.

I’m pleased to announce that readxl 1.0.0 is available on CRAN. readxl makes it easy to bring tabular data out of Excel and into R, for modern .xlsx files and the legacy .xls format. readxl does not have any tricky external dependencies, such as Java or Perl, and is easy to install and use on Mac, Windows, and Linux.

You can install it with:

install.packages("readxl")

As well as fixing many bugs, this release:

  • Allows you to target specific cells for reading, in a variety of ways
  • Adds two new column types: "logical" and "list", for data of disparate type
  • Is more resilient to the wondrous diversity in spreadsheets, e.g., those written by 3rd party tools

You can see a full list of changes in the release notes. This is the first release maintained by Jenny Bryan.

Specifying the data rectangle

In an ideal world, data would live in a neat rectangle in the upper left corner of a spreadsheet. But spreadsheets often serve multiple purposes for users with different priorities. It is common to encounter several rows of notes above or below the data, for example. The new range argument provides a flexible interface for describing the data rectangle, including Excel-style ranges and row- or column-only ranges.

library(readxl)
read_excel(
  readxl_example("deaths.xlsx"),
  range = "arts!A5:F15"
)
#> # A tibble: 10 × 6
#>            Name Profession   Age `Has kids` `Date of birth`
#>                                  
#> 1   David Bowie   musician    69       TRUE      1947-01-08
#> 2 Carrie Fisher      actor    60       TRUE      1956-10-21
#> 3   Chuck Berry   musician    90       TRUE      1926-10-18
#> 4   Bill Paxton      actor    61       TRUE      1955-05-17
#> # ... with 6 more rows, and 1 more variables: `Date of death` 

read_excel(
  readxl_example("deaths.xlsx"),
  sheet = "other",
  range = cell_rows(5:15)
)
#> # A tibble: 10 × 6
#>           Name Profession   Age `Has kids` `Date of birth`
#>                                           
#> 1   Vera Rubin  scientist    88       TRUE      1928-07-23
#> 2  Mohamed Ali    athlete    74       TRUE      1942-01-17
#> 3 Morley Safer journalist    84       TRUE      1931-11-08
#> 4 Fidel Castro politician    90       TRUE      1926-08-13
#> # ... with 6 more rows, and 1 more variables: `Date of death`

There is also a new argument n_max that limits the number of data rows read from the sheet. It is an example of readxl’s evolution towards a readr-like interface. The Sheet Geometry vignette goes over all the options.

Column typing

The new ability to target cells for reading means that readxl’s automatic column typing will “just work” for most sheets, most of the time. Above, the Has kids column is automatically detected as logical, which is a new column type for readxl.

You can still specify column type explicitly via col_types, which gets a couple new features. If you provide exactly one type, it is recycled to the necessary length. The new type "guess" can be mixed with explicit types to specify some types, while leaving others to be guessed.

read_excel(
  readxl_example("deaths.xlsx"),
  range = "arts!A5:C15",
  col_types = c("guess", "skip", "numeric")
)
#> # A tibble: 10 × 2
#>            Name   Age
#>            
#> 1   David Bowie    69
#> 2 Carrie Fisher    60
#> 3   Chuck Berry    90
#> 4   Bill Paxton    61
#> # ... with 6 more rows

The new argument guess_max limits the rows used for type guessing. Leading and trailing whitespace is trimmed when the new trim_ws argument is TRUE, which is the default. Finally, thanks to Jonathan Marshall, multiple na values are accepted. The Cell and Column Types vignette has more detail.

"list" columns

Thanks to Greg Freedman Ellis we now have a "list" column type. This is useful if you want to bring truly disparate data into R without the coercion required by atomic vector types.

(df <- read_excel(
  readxl_example("clippy.xlsx"),
  col_types = c("text", "list")
))
#> # A tibble: 4 × 2
#>                   name      value
#>                  <chr>     <list>
#> 1                 Name  <chr [1]>
#> 2              Species  <chr [1]>
#> 3 Approx date of death <dttm [1]>
#> 4      Weight in grams  <dbl [1]>

tibble::deframe(df)
#> $Name
#> [1] "Clippy"
#> 
#> $Species
#> [1] "paperclip"
#> 
#> $`Approx date of death`
#> [1] "2007-01-01 UTC"
#> 
#> $`Weight in grams`
#> [1] 0.9

Everything else

To learn more, read the vignettes and articles or release notes. Highlights include:

  • General rationalization of sheet geometry, including detection and treatment of empty rows and columns.
  • Improved behavior and messaging around coercion and mismatched cell and column types.
  • Improved handling of datetimes with respect to 3rd party software, rounding, and the Lotus 1-2-3 leap year bug.
  • read_xls() and read_xlsx() are now exposed, so that files without an .xls or .xlsx extension can be read. Thanks Jirka Lewandowski!
  • readxl Workflows showcases patterns that reduce tedium and increase reproducibility when raw data arrives in a spreadsheet.

I’m planning to submit dplyr 0.6.0 to CRAN on May 11 (in four weeks time). In preparation, I’d like to announce that the release candidate, dplyr 0.5.0.9002 is now available. I would really appreciate it if you’d try it out and report any problems. This will ensure that the official release has as few bugs as possible.

Installation

Install the pre-release version with:

# install.packages("devtools")
devtools::install_github("tidyverse/dplyr")

If you discover any problems, please file a minimal reprex on GitHub. You can roll back to the released version with:

install.packages("dplyr")

Features

dplyr 0.6.0 is a major release including over 100 bug fixes and improvements. There are three big changes that I want to touch on here:

  • Databases
  • Improved encoding support (particularly for CJK on windows)
  • Tidyeval, a new framework for programming with dplyr

You can see a complete list of changes in the draft release notes.

Databases

Almost all database related code has been moved out of dplyr and into a new package, dbplyr. This makes dplyr simpler, and will make it easier to release fixes for bugs that only affect databases.

To install the development version of dbplyr so you can try it out, run:

devtools::install_github("hadley/dbplyr")

There’s one major change, as well as a whole heap of bug fixes and minor improvements. It is now no longer necessary to create a remote “src”. Instead you can work directly with the database connection returned by DBI, reflecting the robustness of the DBI ecosystem. Thanks largely to the work of Kirill Muller (funded by the R Consortium) DBI backends are now much more consistent, comprehensive, and easier to use. That means that there’s no longer a need for a layer between you and DBI.

You can continue to use src_mysql(), src_postgres(), and src_sqlite() (which still live in dplyr), but I recommend a new style that makes the connection to DBI more clear:

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbWriteTable(con, "iris", iris)
#> [1] TRUE

iris2 <- tbl(con, "iris")
iris2
#> Source:     table<iris> [?? x 5]
#> Database:   sqlite 3.11.1 [:memory:]
#> 
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl>   <chr>
#> 1           5.1         3.5          1.4         0.2  setosa
#> 2           4.9         3.0          1.4         0.2  setosa
#> 3           4.7         3.2          1.3         0.2  setosa
#> 4           4.6         3.1          1.5         0.2  setosa
#> 5           5.0         3.6          1.4         0.2  setosa
#> 6           5.4         3.9          1.7         0.4  setosa
#> 7           4.6         3.4          1.4         0.3  setosa
#> 8           5.0         3.4          1.5         0.2  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 10          4.9         3.1          1.5         0.1  setosa
#> # ... with more rows

This is particularly useful if you want to perform non-SELECT queries as you can do whatever you want with DBI::dbGetQuery() and DBI::dbExecute().

If you’ve implemented a database backend for dplyr, please read the backend news to see what’s changed from your perspective (not much). If you want to ensure your package works with both the current and previous version of dplyr, see wrap_dbplyr_obj() for helpers.

Character encoding

We have done a lot of work to ensure that dplyr works with encodings other that Latin1 on Windows. This is most likely to affect you if you work with data that contains Chinese, Japanese, or Korean (CJK) characters. dplyr should now just work with such data.

Tidyeval

dplyr has a new approach to non-standard evaluation (NSE) called tidyeval. Tidyeval is described in detail in a new vignette about programming with dplyr but, in brief, it gives you the ability to interpolate values in contexts where dplyr usually works with expressions:

my_var <- quo(homeworld)

starwars %>%
  group_by(!!my_var) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)
#> # A tibble: 49 × 3
#>         homeworld   height  mass
#>             <chr>    <dbl> <dbl>
#> 1        Alderaan 176.3333  64.0
#> 2     Aleen Minor  79.0000  15.0
#> 3          Bespin 175.0000  79.0
#> 4      Bestine IV 180.0000 110.0
#> 5  Cato Neimoidia 191.0000  90.0
#> 6           Cerea 198.0000  82.0
#> 7        Champala 196.0000   NaN
#> 8       Chandrila 150.0000   NaN
#> 9    Concord Dawn 183.0000  79.0
#> 10       Corellia 175.0000  78.5
#> # ... with 39 more rows

This will make it much easier to eliminate copy-and-pasted dplyr code by extracting repeated code into a function.

This also means that the underscored version of each main verb (filter_(), select_() etc). is no longer needed, and so these functions have been deprecated (but remain around for backward compatibility).

Over the couple of months there have been a bunch of smaller releases to packages in the tidyverse. This includes:

  • forcats 0.2.0, for working with factors.
  • readr 1.1.0, for reading flat-files from disk.
  • stringr 1.2.0, for manipulating strings.
  • tibble 1.3.0, a modern re-imagining of the data frame.

This blog post summarises the most important new features, and points to the full release notes where you can learn more.

(If you’ve never heard of the tidyverse before, it’s an set of packages that are designed to work together to help you do data science. The best place to learn all about it is R for Data Science.)

forcats 0.2.0

forcats has three new functions:

  • as_factor() is a generic version of as.factor(), which creates factors from character vectors ordered by appearance, rather than alphabetically. This ensures means that as_factor(x) will always return the same result, regardless of the current locale.
  • fct_other() makes it easier to convert selected levels to “other”:
    x <- factor(rep(LETTERS[1:6], times = c(10, 5, 1, 1, 1, 1)))
    
    x %>% 
      fct_other(keep = c("A", "B")) %>% 
      fct_count()
    #> # A tibble: 3 × 2
    #>        f     n
    #>    
    #> 1      A    10
    #> 2      B     5
    #> 3  Other     4
    
    x %>% 
      fct_other(drop = c("A", "B")) %>% 
      fct_count()
    #> # A tibble: 5 × 2
    #>        f     n
    #>    
    #> 1      C     1
    #> 2      D     1
    #> 3      E     1
    #> 4      F     1
    #> 5  Other    15
  • fct_relabel() allows programmatic relabeling of levels:
    x <- factor(letters[1:3])
    x
    #> [1] a b c
    #> Levels: a b c
    
    x %>% fct_relabel(function(x) paste0("-", x, "-"))
    #> [1] -a- -b- -c-
    #> Levels: -a- -b- -c-

See the full list of other changes in the release notes.

stringr 1.2.0

This release includes a change to the API: str_match_all() now returns NA if an optional group doesn’t match (previously it returned “”). This is more consistent with str_match() and other match failures.

x <- c("a=1,b=2", "c=3", "d=")

x %>% str_match("(.)=(\\d)?")
#>      [,1]  [,2] [,3]
#> [1,] "a=1" "a"  "1" 
#> [2,] "c=3" "c"  "3" 
#> [3,] "d="  "d"  NA
x %>% str_match_all("(.)=(\\d)?,?")
#> [[1]]
#>      [,1]   [,2] [,3]
#> [1,] "a=1," "a"  "1" 
#> [2,] "b=2"  "b"  "2" 
#> 
#> [[2]]
#>      [,1]  [,2] [,3]
#> [1,] "c=3" "c"  "3" 
#> 
#> [[3]]
#>      [,1] [,2] [,3]
#> [1,] "d=" "d"  NA

There are three new features:

  • In str_replace(), replacement can now be a function. The function is once for each match and its return value will be used as the replacement.
    redact <- function(x) {
      str_dup("-", str_length(x))
    }
    
    x <- c("It cost $500", "We spent $1,200 on stickers")
    x %>% str_replace_all("\\$[0-9,]+", redact)
    #> [1] "It cost ----"                "We spent ------ on stickers"
  • New str_which() mimics grep():
    fruit <- c("apple", "banana", "pear", "pinapple")
    
    # Matching positions    
    str_which(fruit, "p")
    #> [1] 1 3 4
    
    # Matching values
    str_subset(fruit, "p")
    #> [1] "apple"    "pear"     "pinapple"
  • A new vignette (vignette("regular-expressions")) describes the details of the regular expressions supported by stringr. The main vignette (vignette("stringr")) has been updated to give a high-level overview of the package.

See the full list of other changes in the release notes.

readr 1.1.0

readr gains two new features:

  • All write_*() functions now support connections. This means that that you can write directly to compressed formats such as .gz, bz2 or .xz (and readr will automatically do so if you use one of those suffixes).
    write_csv(iris, "iris.csv.bz2")
  • parse_factor(levels = NULL) and col_factor(levels = NULL) will produce a factor column based on the levels in the data, mimicing factor parsing in base R (with the exception that levels are created in the order seen).
    iris2 <- read_csv("iris.csv.bz2", col_types = cols(
      Species = col_factor(levels = NULL)
    ))

See the full list of other changes in the release notes.

tibble 1.3.0

tibble has one handy new function: deframe() is the opposite of enframe(): it turns a two-column data frame into a named vector.

df <- tibble(x = c("a", "b", "c"), y = 1:3)
deframe(df)
#> a b c 
#> 1 2 3

See the full list of other changes in the release notes.

Shiny 1.0.1 is now available on CRAN! This release primarily includes bug fixes and minor new features.

The most notable additions in this version of Shiny are the introduction of the reactiveVal() function (like reactiveValues(), but it only stores a single value), and that the choices of radioButtons() and checkboxGroupInput() can now contain HTML content instead of just plain text. We’ve also added compatibility for the development version of ggplot2.

Breaking changes

We unintentionally introduced a minor breaking change in that checkboxGroupInput used to accept choices = NULL to create an empty input. With Shiny 1.0.1, this throws an error; using choices = character(0) works. We intend to eliminate this breakage in Shiny 1.0.2.

Update (4/20/2017): This has now been fixed in Shiny 1.0.2, currently available on CRAN.

Also, the selected argument for radioButtons, checkboxGroupInput, and selectInput once upon a time accepted the name of a choice, instead of the value of a choice; this behavior has been deprecated with a warning for several years now, and in Shiny 1.0.1 it is no longer supported at all.

Storing single reactive values with reactiveVal

The reactiveValues object has been a part of Shiny since the earliest betas. It acts like a reactive version of an environment or named list, in that you can store and retrieve values using names:

rv <- reactiveValues(clicks = 0)

observeEvent(input$button, {
 currentValue <- rv$clicks
 rv$clicks <- currentValue + 1
})

If you only have a single value to store, though, it’s a little awkward that you have to use a data structure designed for multiple values.

With the new reactiveVal function, you can now create a reactive object for a single variable:

clicks <- reactiveVal(0)

observeEvent(input$button, {
 currentValue <- clicks()
 clicks(currentValue + 1)
})

As you can see in this example, you can read the value by calling it like a function with no arguments; and you set the value by calling it with one argument.

This has the added benefit that you can easily pass the clicks object to another function or module (no need to wrap it in a reactive()).

More flexible radioButtons and checkboxGroupInput

It’s now possible to create radio button and checkbox inputs with arbitrary HTML as labels. To do so, however, you need to pass different arguments to the functions. Now, when creating (or updating) either of radioButtons() or checkboxGroupInput(), you can specify the options in one of two (mutually exclusive) ways:

  • What we’ve always had:
    Use the choices argument, which must be a vector or list. The names of each element are displayed in the app UI as labels (i.e. what the user sees in your app), and the values are used for computation (i.e. the value is what’s returned by input$rd, where rd is a radioButtons() input). If the vector (or list) is unnamed, the values provided are used for both the UI labels and the server values.
  • What’s new and allows HTML:
    Use both the choiceNames and the choiceValues arguments, each of which must be an unnamed vector or list (and both must have the same length). The elements in choiceValues must still be plain text (these are the values used for computation). But the elements in choiceNames (the UI labels) can be constructed out of HTML, either using the HTML() function, or an HTML tag generation function, like tags$img() and icon().

Here’s an example app that demos the new functionality (in this case, we have a checkboxGroupInput() whose labels include the flag of the country they correspond to):

ggplot2 > 2.2.1 compatibility

The development version of ggplot2 has some changes that break compatibility with earlier versions of Shiny. The fixes in Shiny 1.0.1 will allow it to work with any version of ggplot2.

A note on Shiny v1.0.0

In January of this year, we quietly released Shiny 1.0.0 to CRAN. A lot of work went into that release, but other than minor bug fixes and features, it was mostly laying the foundation for some important features that will arrive in the coming months. So if you’re wondering if you missed the blog post for Shiny 1.0.0, you didn’t.

Full changes

As always, you can view the full changelog for Shiny 1.0.1 (and 1.0.0!) in our NEWS.md file.

Leaflet 1.1.0 is now available on CRAN! The Leaflet package is a tidy wrapper for the Leaflet.js mapping library, and makes it incredibly easy to generate interactive maps based on spatial data you have in R.

leaflet-choro

This release was nearly a year in the making, and includes many important new features.

  • Easily add textual labels on markers, polygons, etc., either on hover or statically
  • Highlight polygons, lines, circles, and rectangles on hover
  • Markers can now be configured with a variety of colors and icons, via integration with Leaflet.awesome-markers
  • Built-in support for many types of objects from sf, a new way of representing spatial data in R (all basic sf/sfc/sfg types except MULTIPOINT and GEOMETRYCOLLECTION are directly supported)
  • Projections other than Web Mercator are now supported via Proj4Leaflet
  • Color palette functions now natively support viridis palettes; use "viridis", "magma", "inferno", or "plasma" as the palette argument
  • Discrete color palette functions (colorBin, colorQuantile, and colorFactor) work much better with color brewer palettes
  • Integration with several Leaflet.js utility plugins
  • Data with NA points or zero rows no longer causes errors
  • Support for linked brushing and filtering, via Crosstalk (more about this to come in another blog post)

Many thanks to @bhaskarvk who contributed much of the code for this release.

Going forward, our intention is to prevent any more Leaflet.js plugins from accreting in the core leaflet package. Instead, we have made it possible to write 3rd party R packages that extend leaflet (though the process to do this is not documented yet). In the meantime, Bhaskar has started developing his own leaflet.extras package; it already supports several plugins, for everything from animated markers to heatmaps.

If big data is your thing, you use R, and you’re headed to Strata + Hadoop World in San Jose March 13 & 14th, you can experience in person how easy and practical it is to analyze big data with R and Spark.

In a beginner level talk by RStudio’s Edgar Ruiz and an intermediate level  workshop by Win-Vector’s John Mount, we cover the spectrum: What R is, what Spark is, how Sparklyr works, and what is required to set up and tune a Spark cluster. You’ll also learn practical applications including: how to quickly set up a local Spark instance, store big data in Spark and then connect to the data with R, use R to apply machine-learning algorithms to big data stored in Spark, and filter and aggregate big data stored in Spark and then import the results into R for analysis and visualization.

2:40pm–3:20pm Wednesday, March 15, 2017
Sparklyr: An R interface for Apache Spark
Edgar Ruiz (RStudio)
Primary topic: Spark & beyond
Location: LL21 C/D
Level: Beginner
Secondary topics: R

1:30pm–5:00pm Tuesday, March 14, 2017
Modeling big data with R, sparklyr, and Apache Spark
John Mount (Win-Vector LLC)
Primary topic: Data science & advanced analytics
Location: LL21 C/D
Level: Intermediate
Secondary topics: R

While you’re  at the conference be sure to look us up in the Innovator’s Pavilion – booth number P8 during the Expo Hall hours. We’ll have the latest books from RStudio authors, t-shirts to win, demonstrations of RStudio Connect and RStudio Server Pro and, of course, stickers and cheatsheets. Share with us what you’re doing with RStudio and get your product and company questions answered by RStudio employees.

See you in San Jose! (https://conferences.oreilly.com/strata/strata-ca)

roxygen2 6.0.0 is now available on CRAN. roxygen2 helps you document your packages by turning specially formatted inline comments into R’s standard Rd format. It automates everything that can be automated, and provides helpers for sharing documentation between topics. Learn more at http://r-pkgs.had.co.nz/man.html. Install the latest version with:

install.packages("roxygen2")

There are two headline features in this version of roxygen2:

  • Markdown support.
  • Improved documentation inheritance.

These are described in detail below.

This release also included many minor improvements and bug fixes. For a full list of changes, please see release notes. A big thanks to all the contributors to this release: @dlebauer, @fmichonneau, @gaborcsardi, @HenrikBengtsson, @jefferis, @jeroenooms, @jimhester, @kevinushey, @krlmlr, @LiNk-NY, @lorenzwalthert, @maxheld83, @nteetor, @shrektan, @yutannihilation

Markdown

Thanks to the hard work of Gabor Csardi you can now write roxygen2 comments in markdown. While we have tried to make markdown mode as backward compatible as possible, there are a few cases where you will need to make some minor changes. For this reason, you’ll need to explicitly opt-in to markdown support. There are two ways to do so:

  • Add Roxygen: list(markdown = TRUE) to your DESCRIPTION to turn it on everywhere.
  • Add @md to individual roxygen blocks to enable for selected topics.

roxygen2’s markdown dialect supports inline formatting (bold, italics, code), lists (numbered and bulleted), and a number of helpful link shortcuts:

  • [func()]: links to a function in the current package, and is translated to \code{\link[=func]{func()}.
  • [object]: links to an object in the current package, and is translated to \link{object}.
  • [link text][object]: links to an object with custom text, and is translated to \link[=link text]{object}

Similarly, you can link to functions and objects in other packages with [pkg::func()][pkg::object], and [link text][pkg::object]. For a complete list of syntax, and how to handle common problems, please see vignette("markdown") for more details.

To convert an existing roxygen2 package to use markdown, try https://github.com/r-pkgs/roxygen2md. Happy markdown-ing!

Improved inheritance

Writing documentation is challenging because you want to reduce duplication as much as possible (so you don’t accidentally end up with inconsistent documentation) but you don’t want the user to have to follow a spider’s web of cross-references. This version of roxygen2 provides more support for writing documentation in one place then reusing in multiple topics.

The new @inherit tag allows to you inherit parameters, return, references, title, description, details, sections, and seealso from another topic. @inherit my_fun will inherit everything; @inherit my_fun return params will allow to you inherit specified components. @inherits fun sections will inherit all sections; if you’d like to inherit a single section, you can use @inheritSection fun title. You can also inherit from a topic in another package with @inherit pkg::fun.

Another new tag is @inheritDotParams, which allows you to automatically generate parameter documentation for ... for the common case where you pass ... on to another function. The documentation generated is similar to the style used in ?plot and will eventually be incorporated in to RStudio’s autocomplete. When you pass along ... you often override some arguments, so the tag has a flexible specification:

  • @inheritDotParams foo takes all parameters from foo().
  • @inheritDotParams foo a b e:h takes parameters ab, and all parameters between e and h.
  • @inheritDotParams foo -x -y takes all parameters except for x and y.

All the @inherit tags (including the existing @inheritParams) now work recursively, so you can inherit from a function that inherited from elsewhere.

If you want to generate a basic package documentation page (accessible from package?packagename and ?packagename), you can document the special sentinel value "_PACKAGE". It automatically uses the title, description, authors, url and bug reports fields from the DESCRIPTION. The simplest approach is to do this:

#' @keywords internal
"_PACKAGE"

It only includes what’s already in the DESCRIPTION, but it will typically be easier for R users to access.

Today we are pleased to release version 1.1.1 of xml2. xml2 makes it easy to read, create, and modify XML with R. You can install it with:

install.packages("xml2")

As well as fixing many bugs, this release:

  • Makes it easier to create an modify XML
  • Improves roundtrip support between XML and lists
  • Adds support for XML validation and XSLT transformations.

You can see a full list of changes in the release notes. This is the first release maintained by Jim Hester.

Creating and modifying XML

xml2 has been overhauled with a set of methods to make generating and modfying XML easier:

  • xml_new_root() can be used to create a new document and root node simultaneously.
    xml_new_root("x") %>%
      xml_add_child("y") %>%
      xml_root()
    #> {xml_document}
    #> <x>
    #> [1] <y/>
  • New xml_set_text(), xml_set_name(), xml_set_attr(), and xml_set_attrs() make it easy to modify nodes within a pipeline.
    x <- read_xml("<a>
        <b />
        <c><b/></c>
      </a>")
    x
    #> {xml_document}
    #> <a>
    #> [1] <b/>
    #> [2] <c>\n  <b/>\n</c>
    
    x %>% 
      xml_find_all(".//b") %>% 
      xml_set_name("banana") %>% 
      xml_set_attr("oldname", "b")
    x
    #> {xml_document}
    #> <a>
    #> [1] <banana oldname="b"/>
    #> [2] <c>\n  <banana oldname="b"/>\n</c>
  • New xml_add_parent() makes it easy to insert a node as the parent of an existing node.

  • You can create more esoteric node types with xml_comment() (comments), xml_cdata() (CDATA nodes), and xml_dtd() (DTDs).

Coercion to and from R Lists

xml2 1.1.1 improves support for converting to and from R lists, thanks in part to work by Peter Foley and Jenny Bryan. In particular xml2 now supports preserving the root node name as well as saving all xml2 attributes as R attributes. These changes allows you to convert most XML documents to and from R lists with as_list() and as_xml_document() without loss of data.

x <- read_xml("<fruits><apple color = 'red' /></fruits>")
x
#> {xml_document}
#> <fruits>
#> [1] <apple color="red"/>
as_list(x)
#> $apple
#> list()
#> attr(,"color")
#> [1] "red"
as_xml_document(as_list(x))
#> {xml_document}
#> <apple color="red">

XML validation and xslt

xml2 1.1.1 also adds support for XML validation, thanks to Jeroen Ooms. Simply read the document and schema files and call xml_validate().

doc <- read_xml(system.file("extdata/order-doc.xml", package = "xml2"))
schema <- read_xml(system.file("extdata/order-schema.xml", package = "xml2"))
xml_validate(doc, schema)
#> [1] TRUE
#> attr(,"errors")
#> character(0)

Jeroen also released the first xml2 extension package in conjunction with xml2 1.1.1, xslt. xslt allows one to apply XSLT (Extensible Stylesheet Language Transformations) to XML documents, which are great for transforming XML data into other formats such as HTML.

We’re happy to announce that version 0.5 of the sparklyr package is now available on CRAN. The new version comes with many improvements over the first release, including:

  • Extended dplyr support by implementing: do() and n_distinct().
  • New functions including sdf_quantile(), ft_tokenizer() and ft_regex_tokenizer().
  • Improved compatibility, sparklyr now respects the value of the ‘na.action’ R option and dim(), nrow() and ncol().
  • Experimental support for Livy to enable clients, including RStudio, to connect remotely to Apache Spark.
  • Improved connections by simplifying initialization and providing error diagnostics.
  • Certified sparklyr, RStudio Server Pro and ShinyServer Pro with Cloudera.
  • Updated spark.rstudio.com with new deployment examples and a sparklyr cheatsheet.

Additional changes and improvements can be found in the sparklyr NEWS file.

For questions or feedback, please feel free to open a sparklyr github issue or a sparklyr stackoverflow question.

Extended dplyr support

sparklyr 0.5 adds supports for n_distinct() as a faster and more concise equivalent of length(unique(x)) and also adds support for do() as a convenient way to perform multiple serial computations over a group_by() operation:

library(sparklyr)
sc <- spark_connect(master = "local")
mtcars_tbl <- copy_to(sc, mtcars, overwrite = TRUE)

by_cyl <- group_by(mtcars_tbl, cyl)
fit_sparklyr <- by_cyl %>% 
   do(mod = ml_linear_regression(mpg ~ disp, data = .))

# display results
fit_sparklyr$mod

In this case, . represents a Spark DataFrame, which allows us to perform operations at scale (like this linear regression) for a small set of groups. However, since each group operation is performed sequentially, it is not recommended to use do() with a large number of groups. The code above performs multiple linear regressions with the following output:

[[1]]
Call: ml_linear_regression(mpg ~ disp, data = .)

Coefficients:
 (Intercept)         disp 
19.081987419  0.003605119 

[[2]]
Call: ml_linear_regression(mpg ~ disp, data = .)

Coefficients:
(Intercept)        disp 
 40.8719553  -0.1351418 

[[3]]
Call: ml_linear_regression(mpg ~ disp, data = .)

Coefficients:
(Intercept)        disp 
22.03279891 -0.01963409 

It’s worth mentioning that while sparklyr provides comprehensive support for dplyr, dplyr is not strictly required while using sparklyr. For instance, one can make use of DBI without dplyr as follows:

library(sparklyr)
library(DBI)

sc <- spark_connect(master = "local")
sdf_copy_to(sc, iris)
dbGetQuery(sc, "SELECT * FROM iris LIMIT 4")
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

New functions

The new sdf_quantile() function computes approximate quantiles (to some relative error), while the new ft_tokenizer() and ft_regex_tokenizer() functions split a string by white spaces or regex patterns.

For example, ft_tokenizer() can be used as follows:

library(sparklyr)
library(janeaustenr)
library(dplyr)

sc %>%
  spark_dataframe() %>%
  na.omit() %>%
  ft_tokenizer(input.col = “text”, output.col = “tokens”) %>%
  head(4)

Which produces the following output:

                   text                book     tokens
                  <chr>               <chr>     <list>
1 SENSE AND SENSIBILITY Sense & Sensibility <list [3]>
2                       Sense & Sensibility <list [1]>
3        by Jane Austen Sense & Sensibility <list [3]>
4                       Sense & Sensibility <list [1]>

Tokens can be further processed through, for instance, HashingTF.

Improved compatibility

‘na.action’ is a parameter accepted as part of the ‘ml.options’ argument, which defaults to getOption("na.action", "na.omit"). This allows sparklyr to match the behavior of R while processing NA records, for instance, the following linear model drops NA record appropriately:

library(sparklyr)
library(dplyr)
library(nycflights13)

sc <- spark_connect(master = "local")
flights_clean <- na.omit(copy_to(sc, flights))

ml_linear_regression(
  flights_tbl
  response = "dep_delay",
  features = c("arr_delay", "arr_time"))
* Dropped 9430 rows with 'na.omit' (336776 => 327346)
Call: ml_linear_regression(flights_tbl, response = "dep_delay",
                           features = c("arr_delay", "arr_time"))

Coefficients:
 (Intercept)    arr_delay     arr_time 
6.1001212994 0.8210307947 0.0005284729

In addition, dim(), nrow() and ncol() are now supported against Spark DataFrames.

Livy connections

Livy, “An Open Source REST Service for Apache Spark (Apache License)”, is now available in sparklyr 0.5 as an experimental feature. Among many scenarios, this enables connections from the RStudio desktop to Apache Spark when Livy is available and correctly configured in the remote cluster.

Livy running locally

To work with Livy locally, sparklyr supports livy_install() which installs Livy in your local environment, this is similar to spark_install(). Since Livy is a service to enable remote connections into Apache Spark, the service needs to be started with livy_service_start(). Once the service is running, spark_connect() needs to reference the running service and use method = "Livy", then sparklyr can be used as usual. A short example follows:

livy_install()
livy_service_start()

sc <- spark_connect(master = "http://localhost:8998",
                    method = "livy")
copy_to(sc, iris)

spark_disconnect(sc)
livy_service_stop()

Livy running in HDInsight

Microsoft Azure supports Apache Spark clusters configured with Livy and protected with basic authentication in HDInsight clusters. To use sparklyr with HDInsight clusters through Livy, first create the HDInsight cluster with Spark support:

hdinsight-azureCreating Spark Cluster in Microsoft Azure HDInsight

Once the cluster is created, you can connect with sparklyr as follows:

library(sparklyr)
library(dplyr)

config <- livy_config(user = "admin", password = "password")
sc <- spark_connect(master = "https://dm.azurehdinsight.net/livy/",
                    method = "livy",
                    config = config)

copy_to(sc, iris)

From a desktop running RStudio, the remote connection looks like this:

rstudio-hdinsight-azure.png

Improved connections

sparklyr 0.5 no longer requires internet connectivity to download additional Apache Spark packages. This enables connections in secure clusters that do not have internet access or while on the go.

Some community members reported a generic “Ports file does not exists” error while connecting with sparklyr 0.4. In 0.5, we’ve deprecated the ports file and improved error reporting. For instance, the following invalid connection example throws: a descriptive error, the spark-submit parameters and logging information that helps us troubleshoot connection issues.

> library(sparklyr)
> sc <- spark_connect(master = "local",
                      config = list("sparklyr.gateway.port" = "0"))
Error in force(code) : 
  Failed while connecting to sparklyr to port (0) for sessionid (5305): 
  Gateway in port (0) did not respond.
  Path: /spark-1.6.2-bin-hadoop2.6/bin/spark-submit
  Parameters: --class, sparklyr.Backend, 'sparklyr-1.6-2.10.jar', 0, 5305


---- Output Log ----
16/12/12 12:42:35 INFO sparklyr: Session (5305) starting

---- Error Log ----

Additional technical details can be found in the sparklyr gateway socket pull request.

Cloudera certification

sparklyr 0.4, sparklyr 0.5, RStudio Server Pro 1.0 and ShinyServer Pro 1.5 went through Cloudera’s certification and are now certified with Cloudera. Among various benefits, authentication features like Kerberos, have been tested and validated against secured clusters.

For more information see Cloudera’s partner listings.

We have released the R package bookdown (v0.3) to CRAN. It may be old news to some users, but we are happy to make an official announcement today. To install the package from CRAN, you can

install.packages("bookdown")

The bookdown package provides an easier way to write books and technical publications than traditional tools such as LaTeX and Word. It inherits the simplicity of syntax and flexibility for data analysis from R Markdown, and extends R Markdown for technical writing, so that you can make better use of document elements such as figures, tables, equations, theorems, citations, and references, etc. Similar to LaTeX, you can number and cross-reference these elements with bookdown. Read the rest of this entry »