Scraping IMDB for the Second Best TV Show Ever

Web Scraping IMDB

It is common knowledge that The Wire by David Simon is the best TV show ever in existence, which begs the question: what comes second? In this blog post, I’ll demonstrate how to find similar shows on IMDB using web scraping.

Here is the link to my GitHub repo hosting the files for my scraper. https://github.com/justinlaicy926/IMDB-Scaper

To start this project, we will be using the Scrapy python library, a very convenient tool for customizing your own personal scraper. In the Scrapy library, a spider object is what scrapes the webpage. In the ImdbSpider object, we will be making three parse functions.

It makes sense that a show that shares the most actors with our show will be of matching quality, so we will be compiling a table of all other shows with shared actors. After that’s done, we’ll figure out which show is the second best.

After installing Scrapy, run these lines in python command prompt to start an empty Scrapy project.

scrapy startproject IMDB_scraper

We will start from the main page of The Wire. https://www.imdb.com/title/tt0306414/

#make sure to import scrapy at the start of the imdb_spider.py
import scrapy

#first set the start URL to https://www.imdb.com/title/tt0306414/ under the ImdbSpider class
start_urls = ['https://www.imdb.com/title/tt0306414/']

We will then implement three spider objects to scrape the main page for the show, the cast page, and each individual actor page. Let’s start with the main parse function.

def parse(self, response):
        """
        Parse method, navigates to the Cast segment of the IMDB page and calls the subsequent function
        """

        #creates new url for the credit page 
        cast_link = response.urljoin("fullcredits/")

        #navigates to said page and call the appropriate function 
        yield scrapy.Request(cast_link, callback= self.parse_full_credits)

This method works by starting from the show’s main page, and utilizing the built-in urljoin function to move into the Full Credits page, where our second method will be deployed. We are able to do this because IMDB is an extremely structured website, and this code is easily scalable to other shows of your liking.

Now that we are on the Full Credits page, let’s implement the parse_full_credits method to get to every actor’s individual page.

def parse_full_credits(self,response):
        """
        Starts at a Cast page of IMDB, crawl all actors, crew not included, then call the parse_actor_page function 
        """

        #a list of relative paths for each actor   
        rel_paths = [a.attrib["href"] for a in response.css("td.primary_photo a")]

        #craws each link
        if rel_paths:
            for path in rel_paths:
                actor_link = response.urljoin(path)
                yield scrapy.Request(actor_link, callback = self.parse_actor_page)

In this function, we are compiling a list of links to each actor’s individual page in the rel_paths list. Then, we use the urljoin function again to move to their page and call our third function to finally compile all their works.

Our third and final method will be parse_actor_page, which compiles a list of all works by one actor.

def parse_actor_page(self, response):
        """
        Crawls each actor page and compiles every work that actor has starred in
        """

        #selects actor name
        actor_name = response.css("span.itemprop::text").get()

        #selects the work from the actor page
        movie_or_TV_name = response.css("div.filmo-row b")
        for movie in movie_or_TV_name:
            yield {"actor" : actor_name, "movie_or_TV_name" : movie.css("a::text").get()}

This method works by scraping the actor’s name and all of their works first, before yielding a dictionary entry for each of their work.

After finishing the three methods, we can call our function using this command.

scrapy crawl imdb_spider -o results.csv

We now have a CSV file with over 1000 entries. Let’s now perfrom data analysis to find out what show comes close to the timeless masterpiece that is The Wire.

import pandas as pd
#read our CSV file into a pandas dataframe
df = pd.read_csv("results.csv")

#inspect our dataframe
df
actor movie_or_TV_name
0 Thuliso Dingwall The Mechanics Rose
1 Thuliso Dingwall Person of Interest
2 Thuliso Dingwall Unforgettable
3 Thuliso Dingwall Ex$pendable
4 Thuliso Dingwall Toe to Toe
... ... ...
18313 Lance Reddick Great Expectations
18314 Lance Reddick What the Deaf Man Heard
18315 Lance Reddick The Nanny
18316 Lance Reddick Swift Justice
18317 Lance Reddick New York Undercover

18318 rows × 2 columns

We are interested in finding out which movie or TV show has the most amount of shared actors. Let’s use the groupby function for this purpose.

df = df.groupby(["movie_or_TV_name"])["actor"].aggregate(["count"])
df.head()
count
movie_or_TV_name
#BlackGirlhood 1
#Like 1
#Lucky Number 1
#MoreLife 1
#PrettyPeopleProblems 1
df = df.sort_values(("count"), ascending = False)
df.head()
count
movie_or_TV_name
The Wire 758
Homicide: Life on the Street 120
Law & Order 104
Law & Order: Special Victims Unit 102
Veep 74

Unsurprisingly, The Wire itself shares the most amount of actors with The Wire. We’ll be focusing on the runner-ups.

There appears to be a tie among Homicide: Life on the Street, Law & Order, and Law & Order: Special Victims Unit. All of them are of the same genre.

Here is an visualization of the shared actor scenarios with The Wire. Notice how most shows on the plot share very few actors with The Wire. It is an indication of David Simon’s choice of cast members: he prefers actors that organically resemble their roles to big-names. He worked with people having little or no acting experience but an actual background on the Baltimore streets.

import plotly.express as px
from plotly.io import write_html
fig = px.box(df, y="count",
            labels={
                     "count": "Number of Shared Actor with The Wire",
                 },
            )
fig.show()
write_html(fig, "box.html")

This is a fairly extreme distribution, with most shows having little resemblence. Again, this shows how special The Wire really is. No movie or TV ever comes close to its execellence.

Written on April 26, 2022