My friend Randy Olson and I got into the habit to argue about the relative qualities of our favorite languages for data analysis and visualization. I am an enthusiastic R user (www.r-project.org) while Randy is a fan of Python (www.python.org). One thing we agree on however is that our discussions are meaningless unless we actually put R and Python to a series of tests to showcase their relative strengths and weaknesses. Essentially we will set a common goal (e.g., perform a particular type of data analysis or draw a particular type of graph) and create the R and Python codes to achieve this goal. And since Randy and I are all about sharing, open source and open access, we decided to make public the results of our friendly challenges so that you can help us decide between R and Python and, hopefully, also learn something along the way.
Today’s challenge: a data thief manual for honest scientists (Part 1 of 2)
1 - Introduction
Last week we started our challenge series with a rather simple task: plot a pretty barchart from some data collected by Randy for his recent post on the “Top 25 most violence packed films” in the history of the movie industry. Today we will try to up our game a little bit with a more complex task. We will show you how you can collect the data that Randy used for his post directly from the website they originate from (www.MovieBodyCounts.com). This is called data scraping, or the art of taking advantage of the gigantic database that is the Internet.
The basic principle behind the scraping of website data is simple: a website is a like database, and each page of the website is like a table of this database. All we want is find in the database the tables that contain information that we would like to acquire, and then extract this information from within these relevant tables. This task can be relatively easy if all the pages of a website have a similar structure (i.e., if the database is clean and well maintained). In this ideal situation, all we have to do is identify one or more stable markers that delimit the desired information and use them to tell R or Python what to save in memory. Unfortunately not all websites have a similar structure across all of their pages and it can quickly become a nightmare to identify such markers. Worse, sometimes you will have to resign yourself to scrape or correct part or all of the data manually.
For this challenge, we will attempt to recover the following pieces of information for each movie listed on www.MovieBodyCounts.com: title, release year, count of on-screen deaths and link to the movie page on www.imdb.com (this will help us for part 2 of this challenge next week). We will detail the different steps of the process and provide for each step the corresponding code. You will also find the entire codes at the end of this document.
2 - Step by step process
First things first, let’s set up our working environment by loading some necessary libraries.
Now a word about the organization of www.MovieBodyCounts.com. To be perfectly honest, it is a bit messy :-) Movies are organized in a series of alphabetically ordered lists (by the first letter of each movie’s title), each letter having its own page (http://www.moviebodycounts.com/movies-[A-Z].htm). There is also a list for movies which title starts with a number (http://www.moviebodycounts.com/movies-numbers.htm). Finally, all category letters are capitalized in the lists’ URLs, except for letters v and x. Annoying, right? This is just one of the many little problems one can encounter when dealing with messy databases :-)
With all this information in mind, our first task is to create a list of all these lists.
Our next task is to go through the HTML code of all these lists and gather the URLs of all the movie webpages. This is where the data scraping really starts.
As you will quickly notices by reading the following code, Randy and I have decided to use a different approach to identify and collect the desired URLs (and of all the data in the rest of this challenge). I have decided to rely on the XML Path Language (XPath), a language that makes it easy to navigate through elements and attributes in an XML/HTML document. Randy has decided to use an approach based on more “classical” string parsing and manipulation functions. Note that these are just personal preferences. XPath interpreters are also available in Python, and R is fully equipped for manipulating character strings.
For each movie list, we will…
…download the raw HTML content of the webpage,…
…transform raw HTML into a more convenient format to work with,…
…find movie page entry, store the URL for later use and close the loop.
Now that we know where to find each movie, we can start the hard part of this challenge. We will go through each movie webpage and attempt to find its title, release year, count of on-screen deaths and link to its page on www.imdb.com. We will save all this information in a .csv file.
For each movie, we will…
…download the raw HTML content of the webpage and transform raw HTML into a more convenient format to work with,…
…attempt to find movie title,…
…attempt to find movie year,…
…attempt to find link to movie on IMDB,…
… and finally attempt to find the on-screen kill count. Here, Randy chose an approach that minimizes his coding effort, but that will potentially force him to make several manual corrections a posteriori. I chose to find a solution that works with minimal to no manual corrections, but that requires an extra coding effort. Whichever approach is best depends mostly on the size of the data you want to scrape and the time you have to do it.
Almost done! Now we just need to close the loop and write the data frame into a .csv file
And voilà! You should now have a .csv file somewhere on your computer containing all the information we just scraped from the website. Not too hard, right?
Keep the .csv file, we will use it again next week to complete this challenge by scraping additional information from www.imdb.com.
Today’s challenge was code and text heavy. No pretty pictures to please the eye. So, for all the brave people who made it to the end, here is a cat picture :-)