Scrapy is a Python framework developed for web scraping. It allows you to build scrapers that retrieve a page’s HTML, parse and process the data, and store it in the file format and location of choice.
Scrapy has a ton of useful features, including built-in caching, asynchronous requests with Python’s web framework, user-agent randomization, and more. While it is a little trickier to use than Python Requests / BeautifulSoup, it has much more robust functionality and is better suited for large scale scraping.
It is best practice to use a virtual environment for Scrapy (and all Python) projects in order to manage dependencies and versioning. I like to use Anaconda for this. To install Anaconda, just go to their website and download the application version for your operating system. You can find a detailed installation guide here.
Once that’s installed, you can use the below command to create a development environment for your project.
conda create --name project_name anaconda
To activate the environment, use:
conda activate project_name
And to deactivate the environment, use:
conda deactivate
Finally, you can install Scrapy inside the virtual environment using the below command:
conda install -c conda-forge scrapy
Now that that’s out of the way, the process of creating a Scrapy project is super quick. Running the below command will create a project template for you with all of the necessary files:
scrapy startproject my_project
Navigate into the newly created directory and you should see the below file structure:
├── scrapy.cfg
└── my_project
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
Here’s a quick rundown of what each of the files are for:
Now that that’s set up, it’s time to create a spider, the file that does the actual scraping. To do this, just run the below command with the name of the spider you want to create and the website you are planning to scrape.
scrapy genspider clubs transfermarkt.com
Navigate to the spiders folder and you should see a template containing a Spider class with all of the basic components populated. If you want to target a subpage of the domain you provided, you can go ahead and sub that in for the default start_urls value, like I did below.
import scrapy
class ClubsSpider(scrapy.Spider):
name = 'clubs'
allowed_domains = ['transfermarkt.com']
start_urls = ['https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1']
def parse(self, response):
pass
The spider class contains the following attributes:
It also contains a parse function that is called after the spider receives a response from the specified start_urls. Inside this function, you will select the particular bits of data you want to scrape from the website and save them in your chosen format.
Scrapy uses XPaths to define its HTML targets. XPath is a syntax for describing parts of an X(HT)ML document.
You can easily find an HTML element’s XPath using Chrome’s Elements tab. Just hover over the element that you want to select, right-click, and select Copy XPath.
Using Transfermarkt as an example, if I wanted to select the “News” nav bar item, I would get the below XPath:
"//*[@id=main']/header/nav/ul/li[1]/span"
Decoding the above XPath:
One important thing to note with XPaths is that they don’t use zero-based indexing like Python, so the first member is index 1.
You can also use CSS selectors (full documentation here) to grab data from a webpage, but I personally find XPaths to be more useful, especially with websites like Transfermarkt that don’t use a lot of id or class names.
One of Scrapy’s many useful features is its built-in shell that allows you to test and debug XPath selectors. Using the Scrapy shell, you can see the output of your selector directly in the terminal instead of having to run your spider to view the results.
The below command opens the Scrapy shell:
scrapy shell
Once inside the shell, you can fetch the page you want to scrape and test out your selectors. In this example, I will scrape some basic information from all of the clubs in the English Premier League.
First we pass the full URL to the fetch function:
fetch('https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1')
Note: if you're fetching a valid URL and receiving a 404 error, try adding USER_AGENT = 'project_name (+http://www.your-website.com)' to the settings.py file in your project. This will identify your spider to the host website and can sometimes be a prerequisite to scraping a webpage.
Once that's working, we can use Google's developer tools to start selecting the information we want. Let’s start with the club names. The closest ID to the actual element table is yw1, so I will target that ID name and then work my way through the nested tags. It is always best to be as specific as possible and use ID values when you can, since page structures are prone to change.
response.xpath("//*[@id = 'yw1']/table/tbody/tr/td[2]/a/text()").extract()
You will often use the extract method to get the raw result of the XPath selector. In this case, we get an array with all of the club names:
['Manchester City', 'Chelsea FC', 'Liverpool FC' …]
If you want to extract only the first element with the provided XPath selector, you can use the extract_first method instead.
Now that we know how to select data with XPaths, let’s finish building our simple spider.
As I mentioned earlier, I personally prefer to define my data object in the spider file itself. For this example, we’ll keep it simple and just scrape the club name and market value for each of the clubs. Adding the below code above the spider will define the shape of our output data:
import scrapy
class ClubsItem(scrapy.Item):
name = scrapy.Field()
market_value = scrapy.Field()
class ClubsSpider(scrapy.Spider):
# ...
Next, we will build out the parse function where we will use the XPaths to scrape the data and use our Scrapy item template to save the data to an item:
import scrapy
class ClubsItem(scrapy.Item):
name = scrapy.Field()
market_value = scrapy.Field()
class ClubsSpider(scrapy.Spider):
name = 'clubs'
allowed_domains = ['transfermarkt.com']
start_urls = [
'https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1']
def parse(self, response):
club_rows = response.xpath("//*[@id = 'yw1']/table/tbody/tr")
for row in club_rows:
club_name = row.xpath('td[2]/a/text()').extract()[0]
club_mv = row.xpath('td[7]/a/text()').extract()[0]
yield ClubsItem({'name': club_name, 'market_value': club_mv})
Here, our spider executes the below steps:
Step 1: Makes a request to the start URL, 'https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1'.
Step 2: After receiveing a response, it grabs each club row from the table on the page. Note that we do not extract the values yet, since we are just using this value to apply relative XPaths inside of the loop.
Step 3: Loops through the rows returned and selects the name of the club and the market value, extracting the text values.
Step 4: Finally, for each row, we create and yield a new ClubsItem with our newly scraped values. Yielding items allows them to be stored in whatever file format we specify while running the spider (CSV, JSON, DB, etc.)
The final step in this process is to actually run the spider, saving the output information in whatever file format we want. We can do so using the below command:
scrapy crawl clubs -o clubs.json
Once this is run, the terminal will log the scraping process and any errors that you might run into. Once the scraping is completed, you will see a new file in your current directory, clubs.json with all of the data in JSON format.
To change the output file format, just modify the file extension:
scrapy crawl clubs -o clubs.csv
To view the full code from this demonstration, check out the Github repo here.
This article outlines the bare minimum needed to create a spider, and there’s obviously a lot more to cover with more complex tasks such as scraping multiple pages, images, handling pagination, etc. I have linked some resources below that go into some of these more advanced topics.
If you’re curious about how Scrapy is used in a larger scale project, you can read about my experience scraping data for a big football migration project I’m working on here.
Data Visualization with Python and JavaScript, 2nd Edition (Chapter 6), Kyran Dale
A Minimalist End-to-End Scrapy Tutorial, Harry Wang