If you want to learn how to build your own web crawler using a VPS, have you considered using Scrapy? In this installment of ColoCrossing Tutorials, we’ll go over the basic functions of the Scrapy web crawling app.
Scrapy is an open source application that is used to extract data from websites. Its framework is developed in Python which enables your VPS to perform crawling tasks in a fast, simple and extensible way.
How to Install Scrapy on Ubuntu 16.04 LTS
As we previously mentioned, Scrapy is dependent on Python, development libraries and pip software.
Python’s latest version should be pre-installed on your Ubuntu VPS. From there, we will only have to install pip and python developer libraries before installation of Scrapy.
Before continuing let’s make sure that our system is up to date. Let’s therefore log into our system and gain root privileges using the following command:
> sudo -i
We can now make sure everything is up to date using the two following commands:
> apt-get update
> apt-get install python
In the next step we are going to install Pip. Pip is the replacement for easy_install for python package indexer. It is used for installation and management of Python packages. We can perform that installation using the following command:
> apt-get install python-pip
Once Pip is installed, we will have to install python development libraries by using following command.
> apt-get install python-dev
If this package is missing, the installation of Scrapy will generate an error about the python.h header file. Make sure to check the output of the previous command before continuing with the next steps of the installation.
Scrapy framework can be installed from a deb package. Try running the following command:
> pip install scrapy
The installation will take some time and should end with the following message:
“Successfully installed scrapy queuelib service-identity parsel w3lib PyDispatcher cssselect Twisted pyasn1 pyasn1-modules attrs constantly incremental
Cleaning up…”
If you see that, you have successfully installed Scrapy and you are now ready to start crawling the web!
Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:
> scrapy startproject myProject
This will create a “myProject” directory with the following content:
- scrapy.cfg - the project configuration file - myProject/
– you’ll import your code from here
– items.py – project items definition file
– pipelines.py – project pipelines file
– settings.py – project settings file
– spiders/ – a directory where you’ll later put your spiders
We are now going to create our first spider and execute it to collect some information from the web.
Spiders are classes that you define. Scrapy uses spiders to scrape information from a website (or a group of websites). This is the code for our first Spider. Save it in a file named “quotes_spider.py” under the “myProject/spiders” directory in your project:
import scrapy
class QuotesSpider(scrapy.Spider):
name = “quotes”
def start_requests(self):
urls = [
‘http://quotes.toscrape.com/page/1/’,
‘http://quotes.toscrape.com/page/2/’,
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split(“/”)[–2]
filename = ‘quotes-%s.html’ % page
with open(filename, ‘wb’) as f:
f.write(response.body)
self.log(‘Saved file %s’ % filename)
What this code will do is basically navigate the two following webpages that contain quotes from different authors and save them in html files named, quote-1.html and quote-2.html:
Once you have saved the file with the code you are ready to execute your first crawler using the two following commands:
> cd myProject
> scrapy crawl quotes
The execution of the spider should end with the following line:
“…..[scrapy] INFO: Spider closed (finished)”
If you list the files in your current directory you should see the new html files generated by the spider:
quotes-1.html
quotes-2.html
In the following example we are going to extract the information of each author, following the links to their page and save the result in a JSON Lines format file. We will first need to create a new spider named author_spider.py with the following content:
import scrapy
class AuthorSpider(scrapy.Spider):
name = ‘author’
start_urls = [‘http://quotes.toscrape.com/’]
def parse(self, response):
# follow links to author pages
for href in response.css(‘.author+a::attr(href)’).extract():
yield scrapy.Request(response.urljoin(href),
callback=self.parse_author)
# follow pagination links
next_page = response.css(‘li.next a::attr(href)’).extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
‘name’: extract_with_css(‘h3.author-title::text’),
‘birthdate’: extract_with_css(‘.author-born-date::text’),
‘bio’: extract_with_css(‘.author-description::text’),
}
We can now execute this new crawler with the following command:
> scrapy crawl author -o author.jl
This will create a file named author.jl with the content of the extraction. The JSON Lines format is useful because it’s stream-like, you can easily append new records to it.
This is just a brief overview of the Scrapy app. It looks like you could do perform some pretty sophisticated tasks using Scrapy on your Ubuntu VPS.
If you’d like to Get More Info about Scrapy, the best thing to do is to take a deep dive into Scrapy’s documentation.