My friend Randy Olson and I got into the habit to argue about the relative qualities of our favorite languages for data analysis and visualization. I am an enthusiastic R user (www.r-project.org) while Randy is a fan of Python (www.python.org). One thing we agree on however is that our discussions are meaningless unless we actually put R and Python to a series of tests to showcase their relative strengths and weaknesses. Essentially we will set a common goal (e.g., perform a particular type of data analysis or draw a particular type of graph) and create the R and Python codes to achieve this goal. And since Randy and I are all about sharing, open source and open access, we decided to make public the results of our friendly challenges so that you can help us decide between R and Python and, hopefully, also learn something along the way.
Today’s challenge: a data thief manual for honest scientists (Part 2 of 2)
1 - Introduction
Last time we showed you how to scrape data from www.MovieBodyCounts.com. Today, we will finish what we started by retrieving additional information from www.imdb.com. In particular, we will attempt to recover the following pieces of information for each of the movies we collected last time: MPAA rating, genre(s), director(s), duration in minutes, IMDb rating and full cast. We will detail the different steps of the process and provide for each step the corresponding code (red boxes for R, green boxes for Python). You will also find the entire codes at the end of this document.
If you think there’s a better way to code this in either language, leave a pull request on our GitHub repository or leave a note with suggestions in the comments below.
2 - Step by step process
First things first, let’s set up our working environment by loading some necessary libraries.
Randy is lucky today. Someone else has already written a package (‘IMDbPY’) to scrape data from IMDb. Unfortunately for me, R users are too busy working with serious data sets to take the time to write such a package for my favorite data processing language. Hadley Wickham has included a ‘movie’ data set in the ggplot2 package that contains some of the information stored on IMDb, but some of the pieces we need for today’s challenge are missing.
Since I am not easily discouraged, I decided to write my own IMDb scraping function (see below). It is not as sophisticated as the Python package Randy is using today, but it does the job until someone else decides to write a more complete R/IMDb package. As you will see, I am using the same scraping technique (XPath) as the one I used in the first part of the challenge.
Randy and I now have a working IMDb scraper. We can start collecting and organizing the data that we need.
First, let’s load the data we collected last time.
Then, we will extract the movie IMDb ID from the IMDb URL we collected last week. It’s easy, it’s the only number in the URL.
Now that this is done, we will simply let the IMDb scraper collect the data we want and we will append it to the data from the first part of the challenge.
And finally, all what is left to do is to save the complete data set into a .csv file and close the script.
That’s it! You should now have a .csv file somewhere on your computer containing all the information we just scraped in both parts of this challenge.
Sorry it took us so long to complete this part, but beginnings of semesters are always very busy times at the university.
Stay tuned for our next challenge! It will be about making a linear regression, running basic diagnostic tests and plotting the resulting straight line with its confidence interval.