You are currently browsing the monthly archive for November 2014.
rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces. Install it with:
rvest in action
To see rvest in action, imagine we’d like to scrape some information about The Lego Movie from IMDB. We start by downloading and parsing the file with
library(rvest) lego_movie <- html("http://www.imdb.com/title/tt1490017/")
To extract the rating, we start with selectorgadget to figure out which css selector matches the data we want:
strong span. (If you haven’t heard of selectorgadget, make sure to read
vignette("selectorgadget") – it’s the easiest way to determine which selector extracts the data that you’re interested in.) We use
html_node() to find the first node that matches that selector, extract its contents with
html_text(), and convert it to numeric with
lego_movie %>% html_node("strong span") %>% html_text() %>% as.numeric() #>  7.9
We use a similar process to extract the cast, using
html_nodes() to find all nodes that match the selector:
lego_movie %>% html_nodes("#titleCast .itemprop span") %>% html_text() #>  "Will Arnett" "Elizabeth Banks" "Craig Berry" #>  "Alison Brie" "David Burrows" "Anthony Daniels" #>  "Charlie Day" "Amanda Farinos" "Keith Ferguson" #>  "Will Ferrell" "Will Forte" "Dave Franco" #>  "Morgan Freeman" "Todd Hansen" "Jonah Hill"
The titles and authors of recent message board postings are stored in a the third table on the page. We can use
[[ to find it, then coerce it to a data frame with
lego_movie %>% html_nodes("table") %>% .[] %>% html_table() #> X 1 NA #> 1 this movie is very very deep and philosophical mrdoctor524 #> 2 This got an 8.0 and Wizard of Oz got an 8.1... marr-justinm #> 3 Discouraging Building? Laestig #> 4 LEGO - the plural neil-476 #> 5 Academy Awards browncoatjw #> 6 what was the funniest part? actionjacksin
Other important functions
- If you prefer, you can use xpath selectors instead of css:
html_nodes(doc, xpath = "//table//td")).
Extract the tag names with
html_tag(), text with
html_text(), a single attribute with
html_attr()or all attributes with
Detect and repair text encoding problems with
Navigate around a website as if you’re in a browser with
forward(). Extract, modify and submit forms with
submit_form(). (This is still a work in progress, so I’d love your feedback.)
To see these functions in action, check out package demos with
demo(package = "rvest").
RStudio has teamed up with O’Reilly media to create a new way to learn R!
The Introduction to Data Science with R video course is a comprehensive introduction to the R language. It’s ideal for non-programmers with no data science experience or for data scientists switching to R from Excel, SAS or other software.
Join RStudio Master Instructor Garrett Grolemund as he covers the three skill sets of data science: computer programming (with R), manipulating data sets (including loading, cleaning, and visualizing data), and modeling data with statistical methods. You’ll learn R’s syntax and grammar as well as how to load, save, and transform data, generate beautiful graphs, and fit statistical models to the data.
All of the techniques introduced in this video are motivated by real problems that involve real datasets. You’ll get plenty of hands-on experience with R (and not just hear about it!), and lots of help if you get stuck.
You’ll also learn how to use the ggplot2, reshape2, and dplyr packages.
The course contains over eight hours of instruction. You can access the first hour free from O’Reilly’s website. The course covers the same content as our two day Introduction to Data Science with R workshop, right down to the same exercises. But unlike our workshops, the videos are self-paced, which can help you learn R in a more relaxed way.
To learn more, visit Introduction to Data Science with R.