Intro to Web Scraping with Scrapy

Scrapy is a Python framework developed for web scraping. It allows you to build scrapers that retrieve a page’s HTML, parse and process the data, and store it in the file format and location of choice.

Scrapy has a ton of useful features, including built-in caching, asynchronous requests with Python’s web framework, user-agent randomization, and more. While it is a little trickier to use than Python Requests / BeautifulSoup, it has much more robust functionality and is better suited for large scale scraping.

Scrapy Setup

Step 1: Setting up a Python Environment

It is best practice to use a virtual environment for Scrapy (and all Python) projects in order to manage dependencies and versioning. I like to use Anaconda for this. To install Anaconda, just go to their website and download the application version for your operating system. You can find a detailed installation guide here.

Once that’s installed, you can use the below command to create a development environment for your project.

conda create --name project_name anaconda

To activate the environment, use:

conda activate project_name

And to deactivate the environment, use:

conda deactivate

Finally, you can install Scrapy inside the virtual environment using the below command:

conda install -c conda-forge scrapy

Step 2: Setting up the Scrapy Project

Now that that’s out of the way, the process of creating a Scrapy project is super quick. Running the below command will create a project template for you with all of the necessary files:

scrapy startproject my_project

Navigate into the newly created directory and you should see the below file structure:

├── scrapy.cfg
└── my_project
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

Here’s a quick rundown of what each of the files are for:

settings.py contains your project settings, such as pipeline and middleware activation, concurrency, caching, logging, etc.
items.py is where you can define a custom model for the scraped data that will inherit the Scrapy Item class. I prefer to define my models on the spider pages themselves if I’m creating multiple spiders in one project but you can do so here if you prefer.
pipelines.py is where the item yielded by the spider gets passed. Here you can create your own custom pipelines to clean text data, process images, or connect to file outputs or databases.
middlewares.py can be used to modify how the request is made and how the response is handled.
scrapy.cfg is a configuration file that contains deployment settings.

Creating a Basic Spider

Now that that’s set up, it’s time to create a spider, the file that does the actual scraping. To do this, just run the below command with the name of the spider you want to create and the website you are planning to scrape.

scrapy genspider clubs transfermarkt.com

Navigate to the spiders folder and you should see a template containing a Spider class with all of the basic components populated. If you want to target a subpage of the domain you provided, you can go ahead and sub that in for the default start_urls value, like I did below.

import scrapy

class ClubsSpider(scrapy.Spider):
    name = 'clubs'
    allowed_domains = ['transfermarkt.com']
    start_urls = ['https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1']

    def parse(self, response):
        pass

The spider class contains the following attributes:

name - the name of the spider to use when running the spider from the terminal.
allowed_domains - the domains from which the spider is allowed to scrape pages.
start_urls - the first URL that the spider should scrape.

It also contains a parse function that is called after the spider receives a response from the specified start_urls. Inside this function, you will select the particular bits of data you want to scrape from the website and save them in your chosen format.

Selecting Data with XPaths

Scrapy uses XPaths to define its HTML targets. XPath is a syntax for describing parts of an X(HT)ML document.

You can easily find an HTML element’s XPath using Chrome’s Elements tab. Just hover over the element that you want to select, right-click, and select Copy XPath.

Using Transfermarkt as an example, if I wanted to select the “News” nav bar item, I would get the below XPath:

"//*[@id=main']/header/nav/ul/li[1]/span"

Decoding the above XPath:

// gets all of whatever comes after it in the document, so //img for example would get all images on the page.
* matches any element node, so //* selects all elements in the document.
//*[@id=“main”] selects all elements with an id of “main”.
/header looks for all header type children of the selected elements.
/li[1] selects the first li child element of the ul.

One important thing to note with XPaths is that they don’t use zero-based indexing like Python, so the first member is index 1.

You can also use CSS selectors (full documentation here) to grab data from a webpage, but I personally find XPaths to be more useful, especially with websites like Transfermarkt that don’t use a lot of id or class names.

Testing XPaths with the Scrapy Shell

One of Scrapy’s many useful features is its built-in shell that allows you to test and debug XPath selectors. Using the Scrapy shell, you can see the output of your selector directly in the terminal instead of having to run your spider to view the results.

The below command opens the Scrapy shell:

scrapy shell

Once inside the shell, you can fetch the page you want to scrape and test out your selectors. In this example, I will scrape some basic information from all of the clubs in the English Premier League.

First we pass the full URL to the fetch function:

fetch('https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1')

Note: if you're fetching a valid URL and receiving a 404 error, try adding USER_AGENT = 'project_name (+http://www.your-website.com)' to the settings.py file in your project. This will identify your spider to the host website and can sometimes be a prerequisite to scraping a webpage.

Once that's working, we can use Google's developer tools to start selecting the information we want. Let’s start with the club names. The closest ID to the actual element table is yw1, so I will target that ID name and then work my way through the nested tags. It is always best to be as specific as possible and use ID values when you can, since page structures are prone to change.

response.xpath("//*[@id = 'yw1']/table/tbody/tr/td[2]/a/text()").extract()

You will often use the extract method to get the raw result of the XPath selector. In this case, we get an array with all of the club names:

['Manchester City', 'Chelsea FC', 'Liverpool FC' …]

If you want to extract only the first element with the provided XPath selector, you can use the extract_first method instead.

Putting Everything Together

Now that we know how to select data with XPaths, let’s finish building our simple spider.

As I mentioned earlier, I personally prefer to define my data object in the spider file itself. For this example, we’ll keep it simple and just scrape the club name and market value for each of the clubs. Adding the below code above the spider will define the shape of our output data:

import scrapy

class ClubsItem(scrapy.Item):
    name = scrapy.Field()
    market_value = scrapy.Field()

class ClubsSpider(scrapy.Spider):
    # ...

Next, we will build out the parse function where we will use the XPaths to scrape the data and use our Scrapy item template to save the data to an item:

import scrapy

class ClubsItem(scrapy.Item):
    name = scrapy.Field()
    market_value = scrapy.Field()

class ClubsSpider(scrapy.Spider):
    name = 'clubs'
    allowed_domains = ['transfermarkt.com']
    start_urls = [
        'https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1']

    def parse(self, response):
        club_rows = response.xpath("//*[@id = 'yw1']/table/tbody/tr")

        for row in club_rows:
            club_name = row.xpath('td[2]/a/text()').extract()[0]
            club_mv = row.xpath('td[7]/a/text()').extract()[0]

            yield ClubsItem({'name': club_name, 'market_value': club_mv})

Here, our spider executes the below steps:

Step 1: Makes a request to the start URL, 'https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1'.

Step 2: After receiveing a response, it grabs each club row from the table on the page. Note that we do not extract the values yet, since we are just using this value to apply relative XPaths inside of the loop.

Step 3: Loops through the rows returned and selects the name of the club and the market value, extracting the text values.

Step 4: Finally, for each row, we create and yield a new ClubsItem with our newly scraped values. Yielding items allows them to be stored in whatever file format we specify while running the spider (CSV, JSON, DB, etc.)

Running the Spider

The final step in this process is to actually run the spider, saving the output information in whatever file format we want. We can do so using the below command:

scrapy crawl clubs -o clubs.json

Once this is run, the terminal will log the scraping process and any errors that you might run into. Once the scraping is completed, you will see a new file in your current directory, clubs.json with all of the data in JSON format.

To change the output file format, just modify the file extension:

scrapy crawl clubs -o clubs.csv

Wrapping Up

To view the full code from this demonstration, check out the Github repo here.

This article outlines the bare minimum needed to create a spider, and there’s obviously a lot more to cover with more complex tasks such as scraping multiple pages, images, handling pagination, etc. I have linked some resources below that go into some of these more advanced topics.

If you’re curious about how Scrapy is used in a larger scale project, you can read about my experience scraping data for a big football migration project I’m working on here.

Resources

Scrapy Official Documentation

Data Visualization with Python and JavaScript, 2nd Editio n (Chapter 6), Kyran Dale

A Minimalist End-to-End Scrapy Tutorial, Harry Wang

Scrapy Tutorial

Created by Zoe Ferencova