Build a Web crawler with Scrapy

By Admin

Creating your first web crawler project with Python and Scrapy

March 3 10:01PM • 20 min read

In this Article You will:

Learning how to use scrapy and the power of python and webscraping to gather data is no easy task. This tutorial will teach you the basics of working in scrapy, as well as providing you with hands on experience with webscraping by scraping the BrainyQuote website. Lets get into creating your first webscraper!

Getting started

First, create a folder on your computer where you will store your code. Name this file "QuotesScraper". Double click on the file you just created, and then right click and select the "Open with Visual Studio Code" option.

opening the folder with visual studio code
Opening the folder with visual studio code

Installing Dependencies

To start, we first need to install some key dependencies. For this tutorial we will be using Visual studio code. If you have not yet installed visual studio code, click on this link https://www.youtube.com/watch?v=dNFgRUD2w68. You also need to install scrapy via pip. To do this, make sure you have selected a python interpreter (you can use ctrl + shift + p on windows). Click on "Terminal" on the top left hand side of the visual studio code application. After that, click "New Terminal", and then enter the command: pip install scrapy

Starting the project

In your terminal, enter the command: scrapy startproject QuotesScrape This will create a "QuotesScrape" directory which contains key files for this project.

Creating the Spider

In your terminal, enter the command: scrapy genspider BrainyQuote https://www.brainyquote.com/topics/valentines-day-quotes This will create a spider named "BrainyQuote.py" under "QuotesScrape/spiders/".

Allowed Domains

Click on the BrainyQuote.py file and make sure that the "allowed_domains" and "start_url" variables matches what is in the image below.

a picture of a code snippet
Configuration of code in Visual Studio

Spider Code

After this, enter this code shown in the image below:

Main Spider Code
Main Spider Code
Firstly, scrapy scans through the "start_urls" list and generates a response object. The response object has different methods such as .css and .xpath, which can be used to extract information from the object.
Here,
quoteblocks = response.css(".grid-item.qb.clearfix.bqQt")
this code selects all items that have a class of "grid-item qb clearfix bqQt". Note that we choose this class because the parent container of each quote and its author has this class name.
a picture of the block we want to select
As you can see, the text containing the name of the author of each quote resides in this div which has a class of "grid-item qb clearfix bqQt", which is why we want to target and select this element.
for quoteblock in quoteblocks: 
    
item = {}
#get the quote text
quote = quoteblock.xpath(".//a/div/text()").get()

#get the author name
author = quoteblock.xpath(".//a[@title='view author']/text()")

item["quote"] = quote
item["author"] = author

yield item

We use a for loop to loop through all the quotes we have selected to extract information from them. We then initialize a dictionary named "item" where we will store the information about each quote and its author. We use a xpath selector on each quoteblock. Using quoteblock.xpath() allows us to select elements inside of the quoteblock easily. The xpath selector ".//a/div/text()" selects the first a element inside each quoteblock, and then selects the first div element inside the a element, and finally selects the text node (which contains the quote). Note that this only selects the text node and does not return the text inside as a string. To do this we have to use the .get() method.

Selecting the Author

To scrape the name of the author from each quote, we use the following line of code:


author = quoteblock.xpath(".//a[@title='view author']/text()")
                            
The xpath selector ".//a[@title='view author']/text()" selects the a elements in each quoteblock which have a title attribute of 'view author', and selects the text node. Using the .get() method, we obtain the author of each quote in string form.

item["quote"] = quote
item["author"] = author

yield item
                            
We then store the quote text and author name inside the dictionary "item" and yield the item (Note: this is important as parse functions in scrapy must return an item or a dictionary or a Request object).

Launching your spider

Almost done! To launch your web crawler, first, navigate to QuotesScrape/QuotesScrape/spiders/settings.py, and set the parameter "ROBOTSTXT_OBEY" to "FALSE". (This will prevent your crawler from getting blocked. However, please follow responsible scraping practices and dont send too many requests at once). After you have completed this step, go to the "BrainyQuote.py" file and Open a new terminal ("Click on the terminal button on the top left hand corner") and Type: scrapy crawl BrainyQuote -o data.json. You should now see the scraper fully working, and all of your scraped data inside a file named data.json!

Data Collected in a Json File
Data Collected in a Json File