You are currently browsing the monthly archive for June 2016.

Following our initial and very gratifying Shiny Developer Conference this past January, which sold out in a few days, RStudio is very excited to announce a new and bigger conference today!

rstudio::conf, the conference about all things R and RStudio, will take place January 13 and 14, 2017 in Orlando, Florida. The conference will feature talks and tutorials from popular RStudio data scientists and developers like Hadley Wickham, Yihui Xie, Joe Cheng, Winston Chang, Garrett Grolemund, and J.J. Allaire, along with lightning talks from RStudio partners and customers.

Preceding the conference, on January 11 and 12, RStudio will offer two days of optional training. Training attendees can choose from Hadley Wickham’s Master R training, a new Intermediate Shiny workshop from Shiny creator Joe Cheng or a new workshop from Garrett Grolemund that is based on his soon-to-be-published book with Hadley: Introduction to Data Science with R.

rstudio::conf is for R and RStudio users who want to learn how to write better shiny applications in a better way, explore all the new capabilities of the R Markdown authoring framework, apply R to big data and work effectively with Spark, understand the RStudio toolchain for data science with R, discover best practices and tips for coding with RStudio, and investigate enterprise scale development and deployment practices and tools, including the new RStudio Connect.

Not to be missed, RStudio has also reserved Universal Studio’s The Wizarding World of Harry Potter on Friday night, January 13, for the exclusive use of conference attendees!

Conference attendance is limited to 400. Training is limited to 70 students for each of the three 2-day workshops. All seats are are available on a first-come, first-serve basis.

Please go to http://www.rstudio.com/conference to purchase.

We hope to see you in Florida at rstudio::conf 2017!

For questions or issues registering, please email conf@rstudio.com. To ask about sponsorship opportunities contact anne@rstudio.com.

UseR! 2016 has arrived and the RStudio team is at Stanford to share our newest products and latest enhancements to Shiny, R Markdown, dplyr, and more. Here’s a quick snapshot of RStudio related sessions. We hope to see you in as many of them as you can attend!

Monday June 27

Morning Tutorials

Afternoon Tutorials

Afternoon short talks moderated by Hadley Wickham

Tuesday June 28

Wednesday June 29

Thursday June 30

Stop by the booth!
Don’t miss our table in the exhibition area during the conference. Come talk to us about your plans for R and learn how RStudio Server Pro and Shiny Server Pro can provide enterprise-ready support and scalability for your RStudio IDE and Shiny deployments.

Note: Although UseR! is sold out, arrangements have been made to stream the keynote talks from https://aka.ms/user2016conference. Video recordings of the other sessions (where permitted by speakers) will be made available by UseR! organizers after the conference.

 

I’m very pleased to announce that dplyr 0.5.0 is now available from CRAN. Get the latest version with:

install.packages("dplyr")

dplyr 0.5.0 is a big release with a heap of new features, a whole bunch of minor improvements, and many bug fixes, both from me and from the broader dplyr community. In this blog post, I’ll highlight the most important changes:

  • Some breaking changes to single table verbs.
  • New tibble and dtplyr packages.
  • New vector functions.
  • Replacements for summarise_each() and mutate_each().
  • Improvements to SQL translation.

To see the complete list, please read the release notes.

Breaking changes

arrange() once again ignores grouping, reverting back to the behaviour of dplyr 0.3 and earlier. This makes arrange() inconsistent with other dplyr verbs, but I think this behaviour is generally more useful. Regardless, it’s not going to change again, as more changes will just cause more confusion.

mtcars %>% 
  group_by(cyl) %>% 
  arrange(desc(mpg))
#> Source: local data frame [32 x 11]
#> Groups: cyl [3]
#> 
#> # A tibble: 32 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
#> 2  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
#> 3  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2
#> 4  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2
#> 5  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1
#> ... with 27 more rows

If you give distinct() a list of variables, it now only keeps those variables (instead of, as previously, keeping the first value from the other variables). To preserve the previous behaviour, use .keep_all = TRUE:

df <- data_frame(x = c(1, 1, 1, 2, 2), y = 1:5)

# Now only keeps x variable
df %>% distinct(x)
#> # A tibble: 2 x 1
#>       x
#>   <dbl>
#> 1     1
#> 2     2

# Previous behaviour preserved all variables
df %>% distinct(x, .keep_all = TRUE)
#> # A tibble: 2 x 2
#>       x     y
#>   <dbl> <int>
#> 1     1     1
#> 2     2     4

The select() helper functions starts_with(), ends_with(), etc are now real exported functions. This means that they have better documentation, and there’s an extension mechnaism if you want to write your own helpers.

Tibble and dtplyr packages

Functions related to the creation and coercion of tbl_dfs (“tibble”s for short), now live in their own package: tibble. See vignette("tibble") for more details.

Similarly, all code related to the data table dplyr backend code has been separated out in to a new dtplyr package. This decouples the development of the data.table interface from the development of the dplyr package, and I hope will spur improvements to the backend. If both data.table and dplyr are loaded, you’ll get a message reminding you to load dtplyr.

Vector functions

This version of dplyr gains a number of vector functions inspired by SQL. Two functions make it a little easier to eliminate or generate missing values:

  • Given a set of vectors, coalesce() finds the first non-missing value in each position:
    x <- c(1,  2, NA, 4, NA, 6)
    y <- c(NA, 2,  3, 4,  5, NA)
    
    # Use this to piece together a complete vector:
    coalesce(x, y)
    #> [1] 1 2 3 4 5 6
    
    # Or just replace missing value with a constant:
    coalesce(x, 0)
    #> [1] 1 2 0 4 0 6
  • The complement of coalesce() is na_if(): it replaces a specified value with an NA.
    x <- c(1, 5, 2, -99, -99, 10)
    na_if(x, -99)
    #> [1]  1  5  2 NA NA 10

Three functions provide convenient ways of replacing values. In order from simplest to most complicated, they are:

  • if_else(), a vectorised if statement, takes a logical vector (usually created with a comparison operator like ==, <, or %in%) and replaces TRUEs with one vector and FALSEs with another.
    x1 <- sample(5)
    if_else(x1 < 5, "small", "big")
    #> [1] "small" "small" "big"   "small" "small"

    if_else() is similar to base::ifelse(), but has two useful improvements.
    First, it has a fourth argument that will replace missing values:

    x2 <- c(NA, x1)
    if_else(x2 < 5, "small", "big", "unknown")
    #> [1] "unknown" "small"   "small"   "big"     "small"   "small"

    Secondly, it also have stricter semantics that ifelse(): the true and false arguments must be the same type. This gives a less surprising return type, and preserves S3 vectors like dates and factors:

    x <- factor(sample(letters[1:5], 10, replace = TRUE))
    ifelse(x %in% c("a", "b", "c"), x, factor(NA))
    #>  [1] NA NA  1 NA  3  2  3 NA  3  2
    if_else(x %in% c("a", "b", "c"), x, factor(NA))
    #>  [1] <NA> <NA> a    <NA> c    b    c    <NA> c    b   
    #> Levels: a b c d e

    Currently, if_else() is very strict, so you’ll need to careful match the types of true and false. This is most likely to bite you when you’re using missing values, and you’ll need to use a specific NA: NA_integer_, NA_real_, or NA_character_:

    if_else(TRUE, 1, NA)
    #> Error: `false` has type 'logical' not 'double'
    if_else(TRUE, 1, NA_real_)
    #> [1] 1
  • recode(), a vectorised switch(), takes a numeric vector, character vector, or factor, and replaces elements based on their values.
    x <- sample(c("a", "b", "c", NA), 10, replace = TRUE)
    
    # The default is to leave non-replaced values as is
    recode(x, a = "Apple")
    #>  [1] "c"     "Apple" NA      NA      "c"     NA      "b"     NA     
    #>  [9] "c"     "Apple"
    # But you can choose to override the default:
    recode(x, a = "Apple", .default = NA_character_)
    #>  [1] NA      "Apple" NA      NA      NA      NA      NA      NA     
    #>  [9] NA      "Apple"
    # You can also choose what value is used for missing values
    recode(x, a = "Apple", .default = NA_character_, .missing = "Unknown")
    #>  [1] NA        "Apple"   "Unknown" "Unknown" NA        "Unknown" NA       
    #>  [8] "Unknown" NA        "Apple"
  • case_when(), is a vectorised set of if and else ifs. You provide it a set of test-result pairs as formulas: The left side of the formula should return a logical vector, and the right hand side should return either a single value, or a vector the same length as the left hand side. All results must be the same type of vector.
    x <- 1:40
    case_when(
      x %% 35 == 0 ~ "fizz buzz",
      x %% 5 == 0 ~ "fizz",
      x %% 7 == 0 ~ "buzz",
      TRUE ~ as.character(x)
    )
    #>  [1] "1"         "2"         "3"         "4"         "fizz"     
    #>  [6] "6"         "buzz"      "8"         "9"         "fizz"     
    #> [11] "11"        "12"        "13"        "buzz"      "fizz"     
    #> [16] "16"        "17"        "18"        "19"        "fizz"     
    #> [21] "buzz"      "22"        "23"        "24"        "fizz"     
    #> [26] "26"        "27"        "buzz"      "29"        "fizz"     
    #> [31] "31"        "32"        "33"        "34"        "fizz buzz"
    #> [36] "36"        "37"        "38"        "39"        "fizz"

    case_when() is still somewhat experiment and does not currently work inside mutate(). That will be fixed in a future version.

I also added one small helper for dealing with floating point comparisons: near() tests for equality with numeric tolerance (abs(x - y) < tolerance).

x <- sqrt(2) ^ 2

x == 2
#> [1] FALSE
near(x, 2)
#> [1] TRUE

Predicate functions

Thanks to ideas and code from Lionel Henry, a new family of functions improve upon summarise_each() and mutate_each():

  • summarise_all() and mutate_all() apply a function to all (non-grouped) columns:
    mtcars %>% group_by(cyl) %>% summarise_all(mean)    
    #> # A tibble: 3 x 11
    #>     cyl      mpg     disp        hp     drat       wt     qsec        vs
    #>   <dbl>    <dbl>    <dbl>     <dbl>    <dbl>    <dbl>    <dbl>     <dbl>
    #> 1     4 26.66364 105.1364  82.63636 4.070909 2.285727 19.13727 0.9090909
    #> 2     6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 0.5714286
    #> 3     8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214 0.0000000
    #> ... with 3 more variables: am <dbl>, gear <dbl>, carb <dbl>
  • summarise_at() and mutate_at() operate on a subset of columns. You can select columns with:
    • a character vector of column names,
    • a numeric vector of column positions, or
    • a column specification with select() semantics generated with the new vars() helper.
    mtcars %>% group_by(cyl) %>% summarise_at(c("mpg", "wt"), mean)
    #> # A tibble: 3 x 3
    #>     cyl      mpg       wt
    #>   <dbl>    <dbl>    <dbl>
    #> 1     4 26.66364 2.285727
    #> 2     6 19.74286 3.117143
    #> 3     8 15.10000 3.999214
    mtcars %>% group_by(cyl) %>% summarise_at(vars(mpg, wt), mean)
    #> # A tibble: 3 x 3
    #>     cyl      mpg       wt
    #>   <dbl>    <dbl>    <dbl>
    #> 1     4 26.66364 2.285727
    #> 2     6 19.74286 3.117143
    #> 3     8 15.10000 3.999214
  • summarise_if() and mutate_if() take a predicate function (a function that returns TRUE or FALSE when given a column). This makes it easy to apply a function only to numeric columns:
    iris %>% summarise_if(is.numeric, mean)
    #>   Sepal.Length Sepal.Width Petal.Length Petal.Width
    #> 1     5.843333    3.057333        3.758    1.199333

All of these functions pass ... on to the individual funs:

iris %>% summarise_if(is.numeric, mean, trim = 0.25)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     5.802632    3.032895     3.934211    1.230263

A new select_if() allows you to pick columns with a predicate function:

df <- data_frame(x = 1:3, y = c("a", "b", "c"))
df %>% select_if(is.numeric)
#> # A tibble: 3 x 1
#>       x
#>   <int>
#> 1     1
#> 2     2
#> 3     3
df %>% select_if(is.character)
#> # A tibble: 3 x 1
#>       y
#>   <chr>
#> 1     a
#> 2     b
#> 3     c

summarise_each() and mutate_each() will be deprecated in a future release.

SQL translation

I have completely overhauled the translation of dplyr verbs into SQL statements. Previously, dplyr used a rather ad-hoc approach which tried to guess when a new subquery was needed. Unfortunately this approach was fraught with bugs, so I have now implemented a richer internal data model. In the short-term, this is likely to lead to some minor performance decreases (as the generated SQL is more complex), but the dplyr is much more likely to generate correct SQL. In the long-term, these abstractions will make it possible to write a query optimiser/compiler in dplyr, which would make it possible to generate much more succinct queries. If you know anything about writing query optimisers or compilers and are interested in working on this problem, please let me know!

I’m pleased to announce tidyr 0.5.0. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has a simple convention: put variables in the columns and observations in the rows. You can learn more about it in the tidy data vignette. Install it with:

install.packages("tidyr")

This release has three useful new features:

  1. separate_rows() separates values that contain multiple values separated by a delimited into multiple rows. Thanks to Aaron Wolen for the contribution!
    df <- data_frame(x = 1:2, y = c("a,b", "d,e,f"))
    df %>% 
      separate_rows(y, sep = ",")
    #> Source: local data frame [5 x 2]
    #> 
    #>       x     y
    #>   <int> <chr>
    #> 1     1     a
    #> 2     1     b
    #> 3     2     d
    #> 4     2     e
    #> 5     2     f

    Compare with separate() which separates into (named) columns:

    df %>% 
      separate(y, c("y1", "y2", "y3"), sep = ",", fill = "right")
    #> Source: local data frame [2 x 4]
    #> 
    #>       x    y1    y2    y3
    #> * <int> <chr> <chr> <chr>
    #> 1     1     a     b  <NA>
    #> 2     2     d     e     f
  2. spread() gains a sep argument. Setting this will name columns as “key|sep|value”. This is useful when you’re spreading based on a numeric column:
    df <- data_frame(
      x = c(1, 2, 1), 
      key = c(1, 1, 2), 
      val = c("a", "b", "c")
    )
    df %>% spread(key, val)
    #> Source: local data frame [2 x 3]
    #> 
    #>       x     1     2
    #> * <dbl> <chr> <chr>
    #> 1     1     a     c
    #> 2     2     b  <NA>
    df %>% spread(key, val, sep = "_")
    #> Source: local data frame [2 x 3]
    #> 
    #>       x key_1 key_2
    #> * <dbl> <chr> <chr>
    #> 1     1     a     c
    #> 2     2     b  <NA>
  3. unnest() gains a .sep argument. This is useful if you have multiple columns of data frames that have the same variable names:
    df <- data_frame(
      x = 1:2,
      y1 = list(
        data_frame(y = 1),
        data_frame(y = 2)
      ),
      y2 = list(
        data_frame(y = "a"),
        data_frame(y = "b")
      )
    )
    df %>% unnest()
    #> Source: local data frame [2 x 3]
    #> 
    #>       x     y     y
    #>   <int> <dbl> <chr>
    #> 1     1     1     a
    #> 2     2     2     b
    df %>% unnest(.sep = "_")
    #> Source: local data frame [2 x 3]
    #> 
    #>       x  y1_y  y2_y
    #>   <int> <dbl> <chr>
    #> 1     1     1     a
    #> 2     2     2     b

    It also gains a .id column that makes the names of the list explicit:

    df <- data_frame(
      x = 1:2,
      y = list(
        a = 1:3,
        b = 3:1
      )
    )
    df %>% unnest()
    #> Source: local data frame [6 x 2]
    #> 
    #>       x     y
    #>   <int> <int>
    #> 1     1     1
    #> 2     1     2
    #> 3     1     3
    #> 4     2     3
    #> 5     2     2
    #> 6     2     1
    df %>% unnest(.id = "id")
    #> Source: local data frame [6 x 3]
    #> 
    #>       x     y    id
    #>   <int> <int> <chr>
    #> 1     1     1     a
    #> 2     1     2     a
    #> 3     1     3     a
    #> 4     2     3     b
    #> 5     2     2     b
    #> 6     2     1     b

tidyr 0.5.0 also includes a bumper crop of bug fixes, including fixes for spread() and gather() in the presence of list-columns. Please see the release notes for a complete list of changes.