Scraping IMDB for the Second Best TV Show Ever
Web Scraping IMDB
It is common knowledge that The Wire by David Simon is the best TV show ever in existence, which begs the question: what comes second? In this blog post, I’ll demonstrate how to find similar shows on IMDB using web scraping.
Here is the link to my GitHub repo hosting the files for my scraper. https://github.com/justinlaicy926/IMDB-Scaper
To start this project, we will be using the Scrapy python library, a very convenient tool for customizing your own personal scraper. In the Scrapy library, a spider object is what scrapes the webpage. In the ImdbSpider object, we will be making three parse functions.
It makes sense that a show that shares the most actors with our show will be of matching quality, so we will be compiling a table of all other shows with shared actors. After that’s done, we’ll figure out which show is the second best.
After installing Scrapy, run these lines in python command prompt to start an empty Scrapy project.
scrapy startproject IMDB_scraper
We will start from the main page of The Wire. https://www.imdb.com/title/tt0306414/
#make sure to import scrapy at the start of the imdb_spider.py
import scrapy
#first set the start URL to https://www.imdb.com/title/tt0306414/ under the ImdbSpider class
start_urls = ['https://www.imdb.com/title/tt0306414/']
We will then implement three spider objects to scrape the main page for the show, the cast page, and each individual actor page. Let’s start with the main parse function.
def parse(self, response):
"""
Parse method, navigates to the Cast segment of the IMDB page and calls the subsequent function
"""
#creates new url for the credit page
cast_link = response.urljoin("fullcredits/")
#navigates to said page and call the appropriate function
yield scrapy.Request(cast_link, callback= self.parse_full_credits)
This method works by starting from the show’s main page, and utilizing the built-in urljoin function to move into the Full Credits page, where our second method will be deployed. We are able to do this because IMDB is an extremely structured website, and this code is easily scalable to other shows of your liking.
Now that we are on the Full Credits page, let’s implement the parse_full_credits method to get to every actor’s individual page.
def parse_full_credits(self,response):
"""
Starts at a Cast page of IMDB, crawl all actors, crew not included, then call the parse_actor_page function
"""
#a list of relative paths for each actor
rel_paths = [a.attrib["href"] for a in response.css("td.primary_photo a")]
#craws each link
if rel_paths:
for path in rel_paths:
actor_link = response.urljoin(path)
yield scrapy.Request(actor_link, callback = self.parse_actor_page)
In this function, we are compiling a list of links to each actor’s individual page in the rel_paths list. Then, we use the urljoin function again to move to their page and call our third function to finally compile all their works.
Our third and final method will be parse_actor_page, which compiles a list of all works by one actor.
def parse_actor_page(self, response):
"""
Crawls each actor page and compiles every work that actor has starred in
"""
#selects actor name
actor_name = response.css("span.itemprop::text").get()
#selects the work from the actor page
movie_or_TV_name = response.css("div.filmo-row b")
for movie in movie_or_TV_name:
yield {"actor" : actor_name, "movie_or_TV_name" : movie.css("a::text").get()}
This method works by scraping the actor’s name and all of their works first, before yielding a dictionary entry for each of their work.
After finishing the three methods, we can call our function using this command.
scrapy crawl imdb_spider -o results.csv
We now have a CSV file with over 1000 entries. Let’s now perfrom data analysis to find out what show comes close to the timeless masterpiece that is The Wire.
import pandas as pd
#read our CSV file into a pandas dataframe
df = pd.read_csv("results.csv")
#inspect our dataframe
df
| actor | movie_or_TV_name | |
|---|---|---|
| 0 | Thuliso Dingwall | The Mechanics Rose |
| 1 | Thuliso Dingwall | Person of Interest |
| 2 | Thuliso Dingwall | Unforgettable |
| 3 | Thuliso Dingwall | Ex$pendable |
| 4 | Thuliso Dingwall | Toe to Toe |
| ... | ... | ... |
| 18313 | Lance Reddick | Great Expectations |
| 18314 | Lance Reddick | What the Deaf Man Heard |
| 18315 | Lance Reddick | The Nanny |
| 18316 | Lance Reddick | Swift Justice |
| 18317 | Lance Reddick | New York Undercover |
18318 rows × 2 columns
We are interested in finding out which movie or TV show has the most amount of shared actors. Let’s use the groupby function for this purpose.
df = df.groupby(["movie_or_TV_name"])["actor"].aggregate(["count"])
df.head()
| count | |
|---|---|
| movie_or_TV_name | |
| #BlackGirlhood | 1 |
| #Like | 1 |
| #Lucky Number | 1 |
| #MoreLife | 1 |
| #PrettyPeopleProblems | 1 |
df = df.sort_values(("count"), ascending = False)
df.head()
| count | |
|---|---|
| movie_or_TV_name | |
| The Wire | 758 |
| Homicide: Life on the Street | 120 |
| Law & Order | 104 |
| Law & Order: Special Victims Unit | 102 |
| Veep | 74 |
Unsurprisingly, The Wire itself shares the most amount of actors with The Wire. We’ll be focusing on the runner-ups.
There appears to be a tie among Homicide: Life on the Street, Law & Order, and Law & Order: Special Victims Unit. All of them are of the same genre.
Here is an visualization of the shared actor scenarios with The Wire. Notice how most shows on the plot share very few actors with The Wire. It is an indication of David Simon’s choice of cast members: he prefers actors that organically resemble their roles to big-names. He worked with people having little or no acting experience but an actual background on the Baltimore streets.
import plotly.express as px
from plotly.io import write_html
fig = px.box(df, y="count",
labels={
"count": "Number of Shared Actor with The Wire",
},
)
fig.show()
write_html(fig, "box.html")
This is a fairly extreme distribution, with most shows having little resemblence. Again, this shows how special The Wire really is. No movie or TV ever comes close to its execellence.