You are currently browsing the category archive for the ‘Packages’ category.

HadleyWickhamHSJoin RStudio Chief Data Scientist Hadley Wickham at the University of Illinois at Chicago, on Wednesday May 27th & 28th for this rare opportunity to learn from one of the R community’s most popular and innovative authors and package developers.

As of this post, the workshop is two-thirds sold out. If you’re in or near Chicago and want to boost your R programming skills, this is Hadley’s only Central US public workshop planned for 2015.

Register here: https://rstudio-chicago.eventbrite.com

Devtools 1.8 is now available on CRAN. Devtools makes it so easy to build a package that it becomes your default way to organise code, data and documentation. You can learn more about developing packages at http://r-pkgs.had.co.nz/.

Get the latest version of devtools with:

install.packages("devtools")

There are three main improvements:

  • More helpers to get you up and running with package development as quickly as possible.

  • Better tools for package installation (including checking that all dependencies are up to date).

  • Improved reverse dependency checking for CRAN packages.

There were many other minor improvements and bug fixes. See the release notes for complete list of changes. The last release announcement was for devtools 1.6 since there weren’t many big changes in devtools 1.7. I’ve included the most important points in this announcement labelled with [1.7]. ## Helpers

The number of functions designed to get you up and going with package development continues to grow. This version sees the addition of:

  • dr_devtools(), which runs some common diagnostics: are you using the latest version of R and devtools? Similarly, dr_github() checks for common git/github configuration problems.

  • lint() runs lintr::lint_package() to check the style of package code [1.7].

  • use_code_of_conduct() adds a contributor code of conduct from http://contributor-covenant.org.

  • use_cran_badge() adds a CRAN status badge that you can copy into a README file. Green indicates package is on CRAN. Packages not yet submitted or accepted to CRAN get a red badge.

  • use_cran_comments() creates a cran-comments.md template and adds it to .Rbuildignore to help with CRAN submissions. [1.7]

  • use_coveralls() allows you to easily add test coverage with coveralls.

  • use_git() sets up a package to use git, initialising the repo and checking the existing files.

  • use_test() adds a new test file in tests/testthat.

  • use_readme_rmd() sets up a template to generate a README.md from a README.Rmd with knitr. [1.7]

Package installation and info

When developing packages it’s common to run into problems because you’ve updated a package, but you’ve forgotten to update it’s dependencies (install.packages() doesn’t this automatically). The new package_deps() solves this problem by finding all recursive dependencies of a package and determining if they’re out of date:

# Find out which dependencies are out of date
devtools::package_deps("devtools")
# Update them
update(devtools::package_deps("devtools"))

This code is used in install_deps() and revdep_check() – devtools is now aggressive about updating packages, which should avoid potential problems in CRAN submissions.
New update_packages() uses these tools to install a package (and its dependencies) only if they’re not already installed and current.

Reverse dependency checking

Devtools 1.7 included considerable improvement to reverse dependency checking. This sort of checking is important if your package gets popular, and is used by other CRAN packages. Before submitting updates to CRAN, you need to make sure that you have not broken the CRAN packages that use your package. Read more about it in the R packages book. To get started, run use_revdep(), then run the code in revdep/check.R.

I’m very excited to announce the 1.0.0 release of the stringr package. If you haven’t heard of stringr before, it makes string manipulation easier by:

  • Using consistent function and argument names: all functions start with str_, and the first argument is always the input string This makes stringr easier to learn and easy to use with the pipe.
  • Eliminating options that you don’t need 95% of the time.

To get started with stringr, check out the new vignette.

What’s new?

The biggest change in this release is that stringr is now powered by the stringi package instead of base R. This has two big benefits: stringr is now much faster, and has much better unicode support.

If you’ve used stringi before, you might wonder why stringr is still necessary: stringi does everything that stringr does, and much much more. There are two reasons that I think stringr is still important:

  1. Lots of people use it already, so this update will give many people a performance boost for free.
  2. The smaller API of stringr makes it a little easier to learn.

That said, once you’ve learned stringr, using stringi should be easy, so it’s a great place to start if you need a tool that doesn’t exist in stringr.

New features and functions

  • str_replace_all() gains a convenient syntax for applying multiple pairs of pattern and replacement to the same vector:
    x <- c("abc", "def")
    str_replace_all(x, c("[ad]" = "!", "[cf]" = "?"))
    #> [1] "!b?" "!e?"
  • str_subset() keeps values that match a pattern:
    x <- c("abc", "def", "jhi", "klm", "nop")
    str_subset(x, "[aeiou]")
    #> [1] "abc" "def" "jhi" "nop"
  • str_order() and str_sort() sort and order strings in a specified locale. str_conv() to converts strings from specified encoding to UTF-8.
    # The vowels come before the consonants in Hawaiian
    str_sort(letters[1:10], locale = "haw")
    #>  [1] "a" "e" "i" "b" "c" "d" "f" "g" "h" "j"
  • New modifier boundary() allows you to count, locate and split by character, word, line and sentence boundaries.
    words <- c("These are   some words. Some more words.")
    str_count(words, boundary("word"))
    #> [1] 7
    str_split(words, boundary("word"))
    #> [[1]]
    #> [1] "These" "are"   "some"  "words" "Some"  "more"  "words"

There were two minor changes to make stringr a little more consistent:

  • str_c() now returns a zero length vector if any of its inputs are zero length vectors. This is consistent with all other functions, and standard R recycling rules. Similarly, using str_c("x", NA) now yields NA. If you want "xNA", use str_replace_na() on the inputs.
  • str_match() now returns NA if an optional group doesn’t match (previously it returned “”). This is more consistent with str_extract() and other match failures.

Development

Stringr is over five years old and is quite stable (the last release was over two years ago). Although I’ve endeavoured to make the change to stringi as seemless as possible, it’s likely that it has created some new bugs. If you have problems, please try the development version, and if that doesn’t help, file an issue on github.

I’m pleased to announced that the first version of xml2 is now available on CRAN. Xml2 is a wrapper around the comprehensive libxml2 C library that makes it easier to work with XML and HTML in R:

  • Read XML and HTML with read_xml() and read_html().
  • Navigate the tree with xml_children(), xml_siblings() and xml_parent(). Alternatively, use xpath to jump directly to the nodes you’re interested in with xml_find_one() and xml_find_all(). Get the full path to a node with xml_path().
  • Extract various components of a node with xml_text(), xml_attrs(), xml_attr(), and xml_name().
  • Convert to list with as_list().
  • Where appropriate, functions support namespaces with a global url -> prefix lookup table. See xml_ns() for more details.
  • Convert relative urls to absolute with url_absolute(), and transform in the opposite direction with url_relative(). Escape and unescape special characters with url_escape() and url_unescape().
  • Support for modifying and creating xml documents in planned in a future version.

This package owes a debt of gratitude to Duncan Temple Lang who’s XML package has made it possible to use XML with R for almost 15 years!

Usage

You can install it by running:

install.packages("xml2")

(If you’re on a mac, you might need to wait a couple of days – CRAN is busy rebuilding all the packages for R 3.2.0 so it’s running a bit behind.)

Here’s a small example working with an inline XML document:

library(xml2)
x <- read_xml("<foo>
  <bar>text <baz id = 'a' /></bar>
  <bar>2</bar>
  <baz id = 'b' /> 
</foo>")

xml_name(x)
#> [1] "foo"
xml_children(x)
#> {xml_nodeset (3)}
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>

# Find all baz nodes anywhere in the document
baz <- xml_find_all(x, ".//baz")
baz
#> {xml_nodeset (2)}
#> [1] <baz id="a"/>
#> [2] <baz id="b"/>
xml_path(baz)
#> [1] "/foo/bar[1]/baz" "/foo/baz"
xml_attr(baz, "id")
#> [1] "a" "b"

Development

Xml2 is still under active development. If notice any problems (including crashes), please try the development version, and if that doesn’t work, file an issue.

I’m pleased to announced that the first version of readxl is now available on CRAN. Readxl makes it easy to get tabular data out of excel. It:

  • Supports both the legacy .xls format and the modern xml-based .xlsx format. .xls support is made possible the with libxls C library, which abstracts away many of the complexities of the underlying binary format. To parse .xlsx, we use the insanely fast RapidXML C++ library.
  • Has no external dependencies so it’s easy to use on all platforms.
  • Re-encodes non-ASCII characters to UTF-8.
  • Loads datetimes into POSIXct columns. Both Windows (1900) and Mac (1904) date specifications are processed correctly.
  • Blank columns are automatically dropped.
  • Returns output with class c("tbl_df", "tbl", "data.frame") so if you also use dplyr you’ll get an enhanced print method (i.e. you’ll see just the first ten rows, not the first 10,000!).

You can install it by running:

install.packages("readxl")

There’s not really much to say about how to use it:

library(readxl)
# Use a excel file included in the package
sample <- system.file("extdata", "datasets.xlsx", package = "readxl")

# Read by position
head(read_excel(sample, 2))
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# Or by name:
excel_sheets(sample)
#> [1] "iris"     "mtcars"   "chickwts" "quakes"
head(read_excel(sample, "mtcars"))
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

You can see the documentation for more info on the col_names, col_types and na arguments.

Readxl is still under active development. If you have problems loading a dataset, please try the development version, and if that doesn’t work, file an issue.

The dygraphs package is an R interface to the dygraphs JavaScript charting library. It provides rich facilities for charting time-series data in R, including:

  • Automatically plots xts time-series objects (or objects convertible to xts).
  • Rich interactive features including zoom/pan and series/point highlighting.
  • Highly configurable axis and series display (including optional 2nd Y-axis).
  • Display upper/lower bars (e.g. prediction intervals) around series.
  • Various graph overlays including shaded regions, event lines, and annotations.
  • Use at the R console just like conventional R plots (via RStudio Viewer).
  • Embeddable within R Markdown documents and Shiny web applications.

The dygraphs package is available on CRAN now and can be installed with:

install.packages("dygraphs")

Examples

Here are some examples of interactive time series visualizations you can create with only a line or two of R code (the screenshots are static, click them to see the interactive version).

Panning and Zooming

This code adds a range selector that’s can be used to pan and zoom around the series data:

dygraph(nhtemp, main = "New Haven Temperatures") %>%
  dyRangeSelector()

Screen Shot 2015-04-09 at 1.01.35 PM

Point Highlighting

When you hover over the time-series the values of all points at the location of the mouse are shown in the legend:

lungDeaths <- cbind(ldeaths, mdeaths, fdeaths)
dygraph(lungDeaths, main = "Deaths from Lung Disease (UK)") %>%
  dyOptions(colors = RColorBrewer::brewer.pal(3, "Set2"))

Screen Shot 2015-04-09 at 12.53.54 PM

Shading and Annotations

There are a wide variety of tools available to annotate time series. Here we demonstrate creating shaded regions:

dygraph(nhtemp, main="New Haven Temperatures") %>% 
  dySeries(label="Temp (F)", color="black") %>%
  dyShading(from="1920-1-1", to="1930-1-1", color="#FFE6E6") %>%
  dyShading(from="1940-1-1", to="1950-1-1", color="#CCEBD6")

Screen Shot 2015-04-09 at 1.11.31 PM

You can find additional examples and documentation on the dygraphs for R website.

Bringing JavaScript to R

One of the reasons we are excited about dygraphs is that it takes a mature and feature rich visualization library formerly only accessible to web developers and makes it available to all R users.

This is part of a larger trend enabled by the htmlwidgets package, and we expect that more and more libraries like dygraphs will emerge over the coming months to bring the best of JavaScript data visualization to R.

 

I’m pleased to announced that readr is now available on CRAN. Readr makes it easy to read many types of tabular data:

  • Delimited files withread_delim(), read_csv(), read_tsv(), and read_csv2().
  • Fixed width files with read_fwf(), and read_table().
  • Web log files with read_log().

You can install it by running:

install.packages("readr")

Compared to the equivalent base functions, readr functions are around 10x faster. They’re also easier to use because they’re more consistent, they produce data frames that are easier to use (no more stringsAsFactors = FALSE!), they have a more flexible column specification, and any parsing problems are recorded in a data frame. Each of these features is described in more detail below.

Input

All readr functions work the same way. There are four important arguments:

  • file gives the file to read; a url or local path. A local path can point to a a zipped, bzipped, xzipped, or gzipped file – it’ll be automatically uncompressed in memory before reading. You can also pass in a connection or a raw vector.

    For small examples, you can also supply literal data: if file contains a new line, then the data will be read directly from the string. Thanks to data.table for this great idea!

    library(readr)
    read_csv("x,y\n1,2\n3,4")
    #>   x y
    #> 1 1 2
    #> 2 3 4
  • col_names: describes the column names (equivalent to header in base R). It has three possible values:
    • TRUE will use the the first row of data as column names.
    • FALSE will number the columns sequentially.
    • A character vector to use as column names.
  • col_types: overrides the default column types (equivalent to colClasses in base R). More on that below.
  • progress: By default, readr will display a progress bar if the estimated loading time is greater than 5 seconds. Use progress = FALSE to suppress the progress indicator.

Output

The output has been designed to make your life easier:

  • Characters are never automatically converted to factors (i.e. no more stringsAsFactors = FALSE!).
  • Column names are left as is, not munged into valid R identifiers (i.e. there is no check.names = TRUE). Use backticks to refer to variables with unusual names, e.g. df$`Income ($000)`.
  • The output has class c("tbl_df", "tbl", "data.frame") so if you also use dplyr you’ll get an enhanced print method (i.e. you’ll see just the first ten rows, not the first 10,000!).
  • Row names are never set.

Column types

Readr heuristically inspects the first 100 rows to guess the type of each columns. This is not perfect, but it’s fast and it’s a reasonable start. Readr can automatically detect these column types:

  • col_logical() [l], contains only T, F, TRUE or FALSE.
  • col_integer() [i], integers.
  • col_double() [d], doubles.
  • col_euro_double() [e], “Euro” doubles that use , as the decimal separator.
  • col_date() [D]: Y-m-d dates.
  • col_datetime() [T]: ISO8601 date times
  • col_character() [c], everything else.

You can manually specify other column types:

  • col_skip() [_], don’t import this column.
  • col_date(format) and col_datetime(format, tz), dates or date times parsed with given format string. Dates and times are rather complex, so they’re described in more detail in the next section.
  • col_numeric() [n], a sloppy numeric parser that ignores everything apart from 0-9, - and . (this is useful for parsing currency data).
  • col_factor(levels, ordered), parse a fixed set of known values into a (optionally ordered) factor.

There are two ways to override the default choices with the col_types argument:

  • Use a compact string: "dc__d". Each letter corresponds to a column so this specification means: read first column as double, second as character, skip the next two and read the last column as a double. (There’s no way to use this form with column types that need parameters.)
  • With a (named) list of col objects:
    read_csv("iris.csv", col_types = list(
      Sepal.Length = col_double(),
      Sepal.Width = col_double(),
      Petal.Length = col_double(),
      Petal.Width = col_double(),
      Species = col_factor(c("setosa", "versicolor", "virginica"))
    ))

    Any omitted columns will be parsed automatically, so the previous call is equivalent to:

    read_csv("iris.csv", col_types = list(
      Species = col_factor(c("setosa", "versicolor", "virginica"))
    )

Dates and times

One of the most helpful features of readr is its ability to import dates and date times. It can automatically recognise the following formats:

  • Dates in year-month-day form: 2001-10-20 or 2010/15/10 (or any non-numeric separator). It can’t automatically recongise dates in m/d/y or d/m/y format because they’re ambiguous: is 02/01/2015 the 2nd of January or the 1st of February?
  • Date times as ISO8601 form: e.g. 2001-02-03 04:05:06.07 -0800, 20010203 040506, 20010203 etc. I don’t support every possible variant yet, so please let me know if it doesn’t work for your data (more details in ?parse_datetime).

If your dates are in another format, don’t despair. You can use col_date() and col_datetime() to explicit specify a format string. Readr implements it’s own strptime() equivalent which supports the following format strings:

  • Year: \%Y (4 digits). \%y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
  • Month: \%m (2 digits), \%b (abbreviated name in current locale), \%B (full name in current locale).
  • Day: \%d (2 digits), \%e (optional leading space)
  • Hour: \%H
  • Minutes: \%M
  • Seconds: \%S (integer seconds), \%OS (partial seconds)
  • Time zone: \%Z (as name, e.g. America/Chicago), \%z (as offset from UTC, e.g. +0800)
  • Non-digits: \%. skips one non-digit charcater, \%* skips any number of non-digit characters.
  • Shortcuts: \%D = \%m/\%d/\%y, \%F = \%Y-\%m-\%d, \%R = \%H:\%M, \%T = \%H:\%M:\%S, \%x = \%y/\%m/\%d.

To practice parsing date times with out having to load the file each time, you can use parse_datetime() and parse_date():

parse_date("2015-10-10")
#> [1] "2015-10-10"
parse_datetime("2015-10-10 15:14")
#> [1] "2015-10-10 15:14:00 UTC"

parse_date("02/01/2015", "%m/%d/%Y")
#> [1] "2015-02-01"
parse_date("02/01/2015", "%d/%m/%Y")
#> [1] "2015-01-02"

Problems

If there are any problems parsing the file, the read_ function will throw a warning telling you how many problems there are. You can then use the problems() function to access a data frame that gives information about each problem:

csv <- "x,y
1,a
b,2
"

df <- read_csv(csv, col_types = "ii")
#> Warning: 2 problems parsing literal data. See problems(...) for more
#> details.
problems(df)
#>   row col   expected actual
#> 1   1   2 an integer      a
#> 2   2   1 an integer      b
df
#>    x  y
#> 1  1 NA
#> 2 NA  2

Helper functions

Readr also provides a handful of other useful functions:

  • read_lines() works the same way as readLines(), but is a lot faster.
  • read_file() reads a complete file into a string.
  • type_convert() attempts to coerce all character columns to their appropriate type. This is useful if you need to do some manual munging (e.g. with regular expressions) to turn strings into numbers. It uses the same rules as the read_* functions.
  • write_csv() writes a data frame out to a csv file. It’s quite a bit faster than write.csv() and it never writes row.names. It also escapes " embedded in strings in a way that read_csv() can read.

Development

Readr is still under very active development. If you have problems loading a dataset, please try the development version, and if that doesn’t work, file an issue.

data visualization cheatsheet

We’ve added a new cheatsheet to our collection. Data Visualization with ggplot2 describes how to build a plot with ggplot2 and the grammar of graphics. You will find helpful reminders of how to use:

  • geoms
  • stats
  • scales
  • coordinate systems
  • facets
  • position adjustments
  • legends, and
  • themes

The cheatsheet also documents tips on zooming.

Download the cheatsheet here.

Bonus – Frans van Dunné of Innovate Online has provided Spanish translations of the Data Wrangling, R Markdown, Shiny, and Package Development cheatsheets. Download them at the bottom of the cheatsheet gallery.

Cheatsheet

We’ve added a new cheatsheet to our collection! Package Development with devtools will help you find the most useful functions for building packages in R. The cheatsheet will walk you through the steps of building a package from:

  • Setting up the package structure
  • Adding a DESCRIPTION file
  • Writing code
  • Writing tests
  • Writing documentation with roxygen
  • Adding data sets
  • Building a NAMESPACE, and
  • Including vignettes

The sheet focuses on Hadley Wickham’s devtools package, and it is a useful supplement to Hadley’s book R Packages, which you can read online at r-pkgs.had.co.nz.

Download the sheet here.

Bonus – Vivian Zhang of SupStat Analytics has kindly translated the existing Data Wrangling, R Markdown, and Shiny cheatsheets into Chinese. You can download the translations at the bottom of the cheatsheet gallery.

I’m pleased to announced that the new haven package is now available on CRAN. Haven makes it easy to read data from SAS, SPSS and Stata. Haven has the same goal as the foreign package, but it:

  • Can read binary SAS7BDAT files.
  • Can read Stata13 files.
  • Always returns a data frame.

(Haven also has experimental support for writing SPSS and Stata data. This still has some rough edges but please try it out and report any problems that you find.)

Haven is a binding to the excellent ReadStat C library by Evan Miller. Haven wouldn’t be possible without his hard work – thanks Evan! I’d also like to thank Matt Shotwell who spend a lot of time reverse engineering the SAS binary data format, and Dennis Fisher who tested the SAS code with thousands of SAS files.

Usage

Using haven is easy:

  • Install it, install.packages("haven"),
  • Load it, library(haven),
  • Then pick the appropriate read function:
    • SAS: read_sas()
    • SPSS: read_sav() or read_por()
    • Stata: read_dta().

These only need the name of the path. (read_sas() optionally also takes the path to a catolog file.)

Output

All functions return a data frame:

  • The output also has class tbl_df which will improve the default print method (to only show the first ten rows and the variables that fit on one screen) if you have dplyr loaded. If you don’t use dplyr, it has no effect.
  • Variable labels are attached as an attribute to each variable. These are not printed (because they tend to be long), but if you have a preview version of RStudio, you’ll see them in the revamped viewer pane.
  • Missing values in numeric variables should be seemlessly converted. Missing values in character variables are converted to the empty string, "": if you want to convert them to missing values, use zap_empty().
  • Dates are converted in to Dates, and datetimes to POSIXcts. Time variables are read into a new class called hms which represents an offset in seconds from midnight. It has print() and format() methods to nicely display times, but otherwise behaves like an integer vector.
  • Variables with labelled values are turned into a new labelled class, as described next.

Labelled variables

SAS, Stata and SPSS all have the notion of a “labelled” variable. These are similar to factors, but:

  • Integer, numeric and character vectors can be labelled.
  • Not every value must be associated with a label.

Factors, by contrast, are always integers and every integer value must be associated with a label.

Haven provides a labelled class to model these objects. It doesn’t implement any common methods, but instead focusses of ways to turn a labelled variable into standard R variable:

  • as_factor(): turns labelled integers into factors. Any values that don’t have a label associated with them will become a missing value. (NB: there’s no way to make as.factor() work with labelled variables, so you’ll need to use this new function.)
  • zap_labels(): turns any labelled values into missing values. This deals with the common pattern where you have a continuous variable that has missing values indiciated by sentinel values.

If you have a use case that’s not covered by these function, please let me know.

Development

Haven is still under very active development. If you have problems loading a dataset, please try the development version, and if that doesn’t work, file an issue.

Follow

Get every new post delivered to your Inbox.

Join 17,624 other followers