You are currently browsing the monthly archive for July 2016.

New York City is a wonderful place to be most of the time but especially in September!

If you live or work in the city or just want a good business reason to visit, consider joining RStudio Chief Data Scientist Hadley Wickham in the heart of Manhattan on September 12th and 13th, just by Times Square at the AMA Conference Center. It’s a rare opportunity to learn from one of the R community’s most popular and innovative authors and package developers.

Hadley’s workshops usually sell out. This is his only East US public workshop in 2016 and there are no plans to do another in NYC in 2017. If you’re an active R user and have been meaning to take this class, now is the perfect time to do it!

Register here: https://www.eventbrite.com/e/master-r-developer-workshop-new-york-city-tickets-21347014495

We look forward to seeing you in New York!

The JSM conference in Chicago, July 31 thru August 4, 2016, is one of the largest to be found on statistics, with many terrific talks for R users. We’ve listed some of the sessions that we’re particularly excited about below. These include talks from RStudio employees, like Hadley Wickham, Yihui Xie, Mine Cetinkaya-Rundel, Garrett Grolemund, and Joe Cheng, but also include a bunch of other talks about R that we think look interesting.

When you’re not in one of the sessions below, please visit us in the exhibition area, booth #126-128. We’ll have copies of all our cheat sheets and stickers, and it’s a great place to learn about the other stuff we’ve been working on lately:  from Sparklyr and R Markdown Notebooks to the latest in RStudio Server Pro, Shiny Server Pro, shinyapps.io, RStudio Connect (beta) and more!

Another great place to chat with people interested in R is the Statistical Computing and Graphics Mixer at 6pm on Monday in the Hilton Stevens Salon A4. It’s advertised as a business meeting in the program, but don’t let that put you off – it’s open to all.

SUNDAY

Session 21: Statistical Computing and Graphics Student Awards
Sunday, July 31, 2016 : 2:00 PM to 3:50 PM, CC-W175b

Session 47 Making the Most of R Tools
Hadley Wickham, RStudio (Discussant)
Sunday, July 31, 2016: 4:00 PM to 4:50 PM, CC-W183b

Thinking with Data Using R and RStudio: Powerful Idioms for Analysts
Nicholas Jon Horton, Amherst College; Randall Pruim, Calvin College ; Daniel Kaplan, Macalester College
Transform Your Workflow and Deliverables with Shiny and R Markdown
Garrett Grolemund, RStudio

Session 54 Recent Advances in Information Visualization
Yihui Xie, RStudio (organizer)
Sunday, July 31, 2016: 4:00 PM to 4:50 PM, CC-W183c

Session 85 Reproducibility Promotes Transparency, Efficiency, and Aesthetics
Richard Schwinn
Sunday, July 31, 2016 : 5:35 PM to 5:50 PM, CC-W176a

Session 88 Communicate Better with R, R Markdown, and Shiny
Garrett Grolemund, RStudio (Poster Session)
Sunday, July 31, 2016: 6:00 PM to 8:00 PM, CC-Hall F1 West

MONDAY

Session 106  Linked Brushing in R
Hadley Wickham, RStudio
Monday, August 1, 2016 : 8:35 AM to 8:55 AM, CC-W196b

Session 127 R Tools for Statistical Computing
Monday, August 1, 2016 : 8:30 AM to 10:20 AM, CC-W196c

8:35 AM The Biglasso Package: Extending Lasso Model Fitting to Big Data in R — Yaohui Zeng, University of Iowa ; Patrick Breheny, University of Iowa
8:50 AM Independent Sampling for a Spatial Model with Incomplete Data — Harsimran Somal, University of Iowa ; Mary Kathryn Cowles, University of Iowa
9:05 AM Introduction to the TextmineR Package for R — Thomas Jones, Impact Research
9:20 AM Vector-Generalized Time Series Models — Victor Miranda Soberanis, University of Auckland ; Thomas Yee, University of Auckland
9:35 AM New Computational Approaches to Large/Complex Mixed Effects Models — Norman Matloff, University of California at Davis
9:50 AM Broom: An R Package for Converting Statistical Modeling Objects Into Tidy Data Frames — David G. Robinson, Stack Overflow
10:05 AM Exact Parametric and Nonparametric Likelihood-Ratio Tests for Two-Sample Comparisons — Yang Zhao, SUNY Buffalo ; Albert Vexler, SUNY Buffalo ; Alan Hutson, SUNY Buffalo ; Xiwei Chen, SUNY Buffalo

Session 270 Automated Analytics and Data Dashboards for Evaluating the Impacts of Educational Technologies
Daniel Stanhope and Joyce Yu and Karly Rectanus
Monday, August 1, 2016 : 3:05 PM to 3:50 PM, CC-Hall F1 West

TUESDAY

Session 276 Statistical Tools for Clinical Neuroimaging
Ciprian Crainiceanu
Tuesday, August 2, 2016 : 7:00 AM to 8:15 AM, CC-W375a

Session 332 Doing More with Data in and Outside the Undergraduate Classroom
Mine Cetinkaya-Rundel, Duke University (organizer)
Tuesday, August 2, 2016 : 10:30 AM to 12:20 PM, CC-W184bc

Session 407 Interactive Visualizations and Web Applications for Analytics
Tuesday, August 2, 2016 : 2:00 PM to 3:50 PM, CC-W179a

2:05 PM Radiant: A Platform-Independent Browser-Based Interface for Business Analytics in R — Vincent Nijs, Rady School of Management
2:20 PM Rbokeh: An R Interface to the Bokeh Plotting Library — Ryan Hafen, Hafen Consulting
2:35 PM Composable Linked Interactive Visualizations in R with Htmlwidgets and Shiny — Joseph Cheng, RStudio
2:50 PM Papayar: A Better Interactive Neuroimage Plotter in R — John Muschelli, The Johns Hopkins University
3:05 PM Interactive and Dynamic Web-Based Graphics for Data Analysis — Carson Sievert, Iowa State University
3:20 PM HTML Widgets: Interactive Visualizations from R Made Easy! — Yihui Xie, RStudio ; Ramnath Vaidyanathan, Alteryx

WEDNESDAY

Session 475  Steps Toward Reproducible Research
Yihui Xie, RStudio  (Discussant)
Wednesday, August 3, 2016 : 8:30 AM to 10:20 AM, CC-W196c

8:35 AM Reproducibility for All and Our Love/Hate Relationship with Spreadsheets — Jennifer Bryan, University of British Columbia
8:55 AM Steps Toward Reproducible Research — Karl W. Broman, University of Wisconsin – Madison
9:15 AM Enough with Trickle-Down Reproducibility: Scientists, Open This Gate! Scientists, Tear Down This Wall! — Karthik Ram, University of California at Berkeley
9:35 AM Integrating Reproducibility into the Undergraduate Statistics Curriculum — Mine Cetinkaya-Rundel, Duke University

Session 581 Mining Text in R
David Marchette, Naval Surface Warfare Center
Wednesday, August 3, 2016 : 2:05 PM to 2:40 PM, CC-W180

THURSDAY

Session 696 Statistics for Social Good
Hadley Wickham, RStudio (Chair)
Thursday, August 4, 2016 : 10:30 AM to 12:20 PM, CC-W179a

Session 694 Web Application Teaching Tools for Statistics Using R and Shiny
Jimmy Doi and Gail Potter and Jimmy Wong and Irvin Alcaraz and Peter Chi
Thursday, August 4, 2016 : 11:05 AM to 11:20 AM, CC-W192a

We’re proud to announce version 1.1 of the tibble package. Tibbles are a modern reimagining of the data frame, keeping what time has shown to be effective, and throwing out what is not. Grab the latest version with:

install.packages("tibble")

There are three major new features:

  • A more consistent naming scheme
  • Changes to how columns are extracted
  • Tweaks to the output

There are many other small improvements and bug fixes: please see the release notes for a complete list.

A better naming scheme

It’s caused some confusion that you use data_frame() and as_data_frame() to create and coerce tibbles. It’s also more important to make the distinction between tibbles and data frames more clear as we evolve a little further away from the semantics of data frames.

Now, we’re consistently using “tibble” as the key word in creation, coercion, and testing functions:

tibble(x = 1:5, y = letters[1:5])
#> # A tibble: 5 x 2
#>       x     y
#>   <int> <chr>
#> 1     1     a
#> 2     2     b
#> 3     3     c
#> 4     4     d
#> 5     5     e
as_tibble(data.frame(x = runif(5)))
#> # A tibble: 5 x 1
#>           x
#>       <dbl>
#> 1 0.4603887
#> 2 0.4824339
#> 3 0.4546795
#> 4 0.5042028
#> 5 0.4558387
is_tibble(data.frame())
#> [1] FALSE

Previously tibble() was an alias for frame_data(). If you were using tibble() to create tibbles by rows, you’ll need to switch to frame_data(). This is a breaking change, but we believe that the new naming scheme will be less confusing in the long run.

Extracting columns

The previous version of tibble was a little too strict when you attempted to retrieve a column that did not exist: we had forgotten that many people check for the presence of column with is.null(df$x). This is bad idea because of partial matching, but it is common:

df1 <- data.frame(xyz = 1)
df1$x
#> [1] 1

Now, instead of throwing an error, tibble will return NULL. If you use $, common in interactive scripts, tibble will generate a warning:

df2 <- tibble(xyz = 1)
df2$x
#> Warning: Unknown column 'x'
#> NULL
df2[["x"]]
#> NULL

We also provide a convenient helper for detecting the presence/absence of a column:

has_name(df1, "x")
#> [1] FALSE
has_name(df2, "x")
#> [1] FALSE

Output tweaks

We’ve tweaked the output to have a shorter header, more information in the footer. We’re using # consistently to denote metadata, and we print missing character values as <NA> (instead of NA).

The example below shows the new rendering of the flights table.

nycflights13::flights
#> # A tibble: 336,776 x 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1   2013     1     1      517            515         2      830
#> 2   2013     1     1      533            529         4      850
#> 3   2013     1     1      542            540         2      923
#> 4   2013     1     1      544            545        -1     1004
#> 5   2013     1     1      554            600        -6      812
#> 6   2013     1     1      554            558        -4      740
#> 7   2013     1     1      555            600        -5      913
#> 8   2013     1     1      557            600        -3      709
#> 9   2013     1     1      557            600        -3      838
#> 10  2013     1     1      558            600        -2      753
#> # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <time>

Thanks to Lionel Henry for contributing an option for determining the number of printed extra columns: getOption("tibble.max_extra_cols"). This is particularly important for the ultra-wide tables often released by statistical offices and other institutions.

Expect the printed output to continue to evolve. In the next version, we hope to do better with very wide columns (e.g. from long strings), and to make better use of now unused horizontal space (e.g. from long column names).

httr 1.2.0 is now available on CRAN. The httr package makes it easy to talk to web APIs from R. Learn more in the quick start vignette. Install the latest version with:

install.packages("httr")

There are a few small new features:

  • New RETRY() function allows you to retry a request multiple times until it succeeds, if you you are trying to talk to an unreliable service. To avoid hammering the server, it uses exponential backoff with jitter, as described in https://www.awsarchitectureblog.com/2015/03/backoff.html.
  • DELETE() gains a body parameter.
  • encode = "raw" parameter to functions that accept bodies. This allows you to do your own encoding.
  • http_type() returns the content/mime type of a request, sans parameters.

There is one important bug fix:

  • No longer uses use custom requests for standard POST requests. This has the side-effect of properly following redirects after POST, fixing some login issues in rvest.

httr 1.2.1 includes a fix for a small bug that I discovered shortly after releasing 1.2.0.

For the complete list of improvements, please see the release notes.

We are pleased to announced that xml2 1.0.0 is now available on CRAN. Xml2 is a wrapper around the comprehensive libxml2 C library, and makes it easy to work with XML and HTML files in R. Install the latest version with:

install.packages("xml2")

There are three major improvements in 1.0.0:

  1. You can now modify and create XML documents.
  2. xml_find_first() replaces xml_find_one(), and provides better semantics for missing nodes.
  3. Improved namespace handling when working with XPath.

There are many other small improvements and bug fixes: please see the release notes for a complete list.

Modification and creation

xml2 now supports modification and creation of XML nodes. This includes new functions xml_new_document(), xml_new_child(), xml_new_sibling(), xml_set_namespace(), xml_remove(), xml_replace(), xml_root(), and replacement methods for xml_name(), xml_attr(), xml_attrs() and xml_text().

The basic process of creating an XML document by hand looks something like this:

root <- xml_new_document() %>% xml_add_child("root")

root %>% 
  xml_add_child("a1", x = "1", y = "2") %>% 
  xml_add_child("b") %>% 
  xml_add_child("c") %>% 
  invisible()

root %>% 
  xml_add_child("a2") %>% 
  xml_add_sibling("a3") %>% 
  invisible()

cat(as.character(root))
#> <?xml version="1.0"?>
#> <root><a1 x="1" y="2"><b><c/></b></a1><a2/><a3/></root>

For a complete description of creation and mutation, please see vignette("modification", package = "xml2").

xml_find_first()

xml_find_one() has been deprecated in favor of xml_find_first(). xml_find_first() now always returns a single node: if there are multiple matches, it returns the first (without a warning), and if there are no matches, it returns a new xml_missing object.

This makes it much easier to work with ragged/inconsistent hierarchies:

x1 <- read_xml("<a>
  <b></b>
  <b><c>See</c></b>
  <b><c>Sea</c><c /></b>
</a>")

c <- x1 %>% 
  xml_find_all(".//b") %>% 
  xml_find_first(".//c")
c
#> {xml_nodeset (3)}
#> [1] <NA>
#> [2] <c>See</c>
#> [3] <c>Sea</c>

Missing nodes are replaced by missing values in functions that return vectors:

xml_name(c)
#> [1] NA  "c" "c"
xml_text(c)
#> [1] NA    "See" "Sea"

XPath and namespaces

XPath is challenging to use if your document contains any namespaces:

x <- read_xml('
 <root>
   <doc1 xmlns = "http://foo.com"><baz /></doc1>
   <doc2 xmlns = "http://bar.com"><baz /></doc2>
 </root>
')
x %>% xml_find_all(".//baz")
#> {xml_nodeset (0)}

To make life slightly easier, the default xml_ns() object is automatically passed to xml_find_*():

x %>% xml_ns()
#> d1 <-> http://foo.com
#> d2 <-> http://bar.com
x %>% xml_find_all(".//d1:baz")
#> {xml_nodeset (1)}
#> [1] <baz/>

If you just want to avoid the hassle of namespaces altogether, we have a new nuclear option: xml_ns_strip():

xml_ns_strip(x)
x %>% xml_find_all(".//baz")
#> {xml_nodeset (2)}
#> [1] <baz/>
#> [2] <baz/>