I’m pleased to announced that readr is now available on CRAN. Readr makes it easy to read many types of tabular data:
- Delimited files with
- Fixed width files with
- Web log files with
You can install it by running:
Compared to the equivalent base functions, readr functions are around 10x faster. They’re also easier to use because they’re more consistent, they produce data frames that are easier to use (no more
stringsAsFactors = FALSE!), they have a more flexible column specification, and any parsing problems are recorded in a data frame. Each of these features is described in more detail below.
All readr functions work the same way. There are four important arguments:
filegives the file to read; a url or local path. A local path can point to a a zipped, bzipped, xzipped, or gzipped file – it’ll be automatically uncompressed in memory before reading. You can also pass in a connection or a raw vector.
For small examples, you can also supply literal data: if
filecontains a new line, then the data will be read directly from the string. Thanks to data.table for this great idea!
library(readr) read_csv("x,y\n1,2\n3,4") #> x y #> 1 1 2 #> 2 3 4
col_names: describes the column names (equivalent to
headerin base R). It has three possible values:
TRUEwill use the the first row of data as column names.
FALSEwill number the columns sequentially.
- A character vector to use as column names.
col_types: overrides the default column types (equivalent to
colClassesin base R). More on that below.
progress: By default, readr will display a progress bar if the estimated loading time is greater than 5 seconds. Use
progress = FALSEto suppress the progress indicator.
The output has been designed to make your life easier:
- Characters are never automatically converted to factors (i.e. no more
stringsAsFactors = FALSE!).
- Column names are left as is, not munged into valid R identifiers (i.e. there is no
check.names = TRUE). Use backticks to refer to variables with unusual names, e.g.
- The output has class
c("tbl_df", "tbl", "data.frame")so if you also use dplyr you’ll get an enhanced print method (i.e. you’ll see just the first ten rows, not the first 10,000!).
- Row names are never set.
Readr heuristically inspects the first 100 rows to guess the type of each columns. This is not perfect, but it’s fast and it’s a reasonable start. Readr can automatically detect these column types:
col_logical()[l], contains only
col_euro_double()[e], “Euro” doubles that use
,as the decimal separator.
col_date()[D]: Y-m-d dates.
col_datetime()[T]: ISO8601 date times
col_character()[c], everything else.
You can manually specify other column types:
col_skip()[_], don’t import this column.
col_datetime(format, tz), dates or date times parsed with given format string. Dates and times are rather complex, so they’re described in more detail in the next section.
col_numeric()[n], a sloppy numeric parser that ignores everything apart from 0-9,
.(this is useful for parsing currency data).
col_factor(levels, ordered), parse a fixed set of known values into a (optionally ordered) factor.
There are two ways to override the default choices with the
- Use a compact string:
"dc__d". Each letter corresponds to a column so this specification means: read first column as double, second as character, skip the next two and read the last column as a double. (There’s no way to use this form with column types that need parameters.)
- With a (named) list of col objects:
read_csv("iris.csv", col_types = list( Sepal.Length = col_double(), Sepal.Width = col_double(), Petal.Length = col_double(), Petal.Width = col_double(), Species = col_factor(c("setosa", "versicolor", "virginica")) ))
Any omitted columns will be parsed automatically, so the previous call is equivalent to:
read_csv("iris.csv", col_types = list( Species = col_factor(c("setosa", "versicolor", "virginica")) )
Dates and times
One of the most helpful features of readr is its ability to import dates and date times. It can automatically recognise the following formats:
- Dates in year-month-day form:
2010/15/10(or any non-numeric separator). It can’t automatically recongise dates in m/d/y or d/m/y format because they’re ambiguous: is
02/01/2015the 2nd of January or the 1st of February?
- Date times as ISO8601 form: e.g.
2001-02-03 04:05:06.07 -0800,
20010203etc. I don’t support every possible variant yet, so please let me know if it doesn’t work for your data (more details in
If your dates are in another format, don’t despair. You can use
col_datetime() to explicit specify a format string. Readr implements it’s own
strptime() equivalent which supports the following format strings:
\%y(2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
\%b(abbreviated name in current locale),
\%B(full name in current locale).
\%e(optional leading space)
- Time zone:
\%Z(as name, e.g.
\%z(as offset from UTC, e.g.
\%.skips one non-digit charcater,
\%*skips any number of non-digit characters.
To practice parsing date times with out having to load the file each time, you can use
parse_date("2015-10-10") #>  "2015-10-10" parse_datetime("2015-10-10 15:14") #>  "2015-10-10 15:14:00 UTC" parse_date("02/01/2015", "%m/%d/%Y") #>  "2015-02-01" parse_date("02/01/2015", "%d/%m/%Y") #>  "2015-01-02"
If there are any problems parsing the file, the
read_ function will throw a warning telling you how many problems there are. You can then use the
problems() function to access a data frame that gives information about each problem:
csv <- "x,y 1,a b,2 " df <- read_csv(csv, col_types = "ii") #> Warning: 2 problems parsing literal data. See problems(...) for more #> details. problems(df) #> row col expected actual #> 1 1 2 an integer a #> 2 2 1 an integer b df #> x y #> 1 1 NA #> 2 NA 2
Readr also provides a handful of other useful functions:
read_lines()works the same way as
readLines(), but is a lot faster.
read_file()reads a complete file into a string.
type_convert()attempts to coerce all character columns to their appropriate type. This is useful if you need to do some manual munging (e.g. with regular expressions) to turn strings into numbers. It uses the same rules as the
write_csv()writes a data frame out to a csv file. It’s quite a bit faster than
write.csv()and it never writes row.names. It also escapes
"embedded in strings in a way that