Spider is a class responsible for defining how to follow the links through a website and extract the information from the pages.
The default spiders of Scrapy are as follows −
It is a spider from which every other spiders must inherit. It has the following class −
class scrapy.spiders.Spider
The following table shows the fields of scrapy.Spider class −
Sr.No | Field & Description |
---|---|
1 | name It is the name of your spider. |
2 | allowed_domains It is a list of domains on which the spider crawls. |
3 | start_urls It is a list of URLs, which will be the roots for later crawls, where the spider will begin to crawl from. |
4 | custom_settings These are the settings, when running the spider, will be overridden from project wide configuration. |
5 | crawler It is an attribute that links to Crawler object to which the spider instance is bound. |
6 | settings These are the settings for running a spider. |
7 | logger It is a Python logger used to send log messages. |
8 | from_crawler(crawler,*args,**kwargs) It is a class method, which creates your spider. The parameters are −
|
9 | start_requests() When no particular URLs are specified and the spider is opened for scrapping, Scrapy calls start_requests() method. |
10 | make_requests_from_url(url) It is a method used to convert urls to requests. |
11 | parse(response) This method processes the response and returns scrapped data following more URLs. |
12 | log(message[,level,component]) It is a method that sends a log message through spiders logger. |
13 | closed(reason) This method is called when the spider closes. |
Spider arguments are used to specify start URLs and are passed using crawl command with -a option, shown as follows −
scrapy crawl first_scrapy -a group = accessories
The following code demonstrates how a spider receives arguments −
import scrapy class FirstSpider(scrapy.Spider): name = "first" def __init__(self, group = None, *args, **kwargs): super(FirstSpider, self).__init__(*args, **kwargs) self.start_urls = ["http://www.example.com/group/%s" % group]
You can use generic spiders to subclass your spiders from. Their aim is to follow all links on the website based on certain rules to extract data from all pages.
For the examples used in the following spiders, let’s assume we have a project with the following fields −
import scrapy from scrapy.item import Item, Field class First_scrapyItem(scrapy.Item): product_title = Field() product_link = Field() product_description = Field()
CrawlSpider defines a set of rules to follow the links and scrap more than one page. It has the following class −
class scrapy.spiders.CrawlSpider
Following are the attributes of CrawlSpider class −
It is a list of rule objects that defines how the crawler follows the link.
The following table shows the rules of CrawlSpider class −
Sr.No | Rule & Description |
---|---|
1 | LinkExtractor It specifies how spider follows the links and extracts the data. |
2 | callback It is to be called after each page is scraped. |
3 | follow It specifies whether to continue following links or not. |
It returns either item or request object by allowing to parse initial responses.
Note − Make sure you rename parse function other than parse while writing the rules because the parse function is used by CrawlSpider to implement its logic.
Let’s take a look at the following example, where spider starts crawling demoexample.com's home page, collecting all pages, links, and parses with the parse_items method −
import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class DemoSpider(CrawlSpider): name = "demo" allowed_domains = ["www.demoexample.com"] start_urls = ["http://www.demoexample.com"] rules = ( Rule(LinkExtractor(allow =(), restrict_xpaths = ("//div[@class = 'next']",)), callback = "parse_item", follow = True), ) def parse_item(self, response): item = DemoItem() item["product_title"] = response.xpath("a/text()").extract() item["product_link"] = response.xpath("a/@href").extract() item["product_description"] = response.xpath("div[@class = 'desc']/text()").extract() return items
It is the base class for spiders that scrape from XML feeds and iterates over nodes. It has the following class −
class scrapy.spiders.XMLFeedSpider
The following table shows the class attributes used to set an iterator and a tag name −
Sr.No | Attribute & Description |
---|---|
1 | iterator It defines the iterator to be used. It can be either iternodes, html or xml. Default is iternodes. |
2 | itertag It is a string with node name to iterate. |
3 | namespaces It is defined by list of (prefix, uri) tuples that automatically registers namespaces using register_namespace() method. |
4 | adapt_response(response) It receives the response and modifies the response body as soon as it arrives from spider middleware, before spider starts parsing it. |
5 | parse_node(response,selector) It receives the response and a selector when called for each node matching the provided tag name. Note − Your spider won't work if you don't override this method. |
6 | process_results(response,results) It returns a list of results and response returned by the spider. |
It iterates through each of its rows, receives a CSV file as a response, and calls parse_row() method. It has the following class −
class scrapy.spiders.CSVFeedSpider
The following table shows the options that can be set regarding the CSV file −
Sr.No | Option & Description |
---|---|
1 | delimiter It is a string containing a comma(',') separator for each field. |
2 | quotechar It is a string containing quotation mark('"') for each field. |
3 | headers It is a list of statements from where the fields can be extracted. |
4 | parse_row(response,row) It receives a response and each row along with a key for header. |
from scrapy.spiders import CSVFeedSpider from demoproject.items import DemoItem class DemoSpider(CSVFeedSpider): name = "demo" allowed_domains = ["www.demoexample.com"] start_urls = ["http://www.demoexample.com/feed.csv"] delimiter = ";" quotechar = "'" headers = ["product_title", "product_link", "product_description"] def parse_row(self, response, row): self.logger.info("This is row: %r", row) item = DemoItem() item["product_title"] = row["product_title"] item["product_link"] = row["product_link"] item["product_description"] = row["product_description"] return item
SitemapSpider with the help of Sitemaps crawl a website by locating the URLs from robots.txt. It has the following class −
class scrapy.spiders.SitemapSpider
The following table shows the fields of SitemapSpider −
Sr.No | Field & Description |
---|---|
1 | sitemap_urls A list of URLs which you want to crawl pointing to the sitemaps. |
2 | sitemap_rules It is a list of tuples (regex, callback), where regex is a regular expression, and callback is used to process URLs matching a regular expression. |
3 | sitemap_follow It is a list of sitemap's regexes to follow. |
4 | sitemap_alternate_links Specifies alternate links to be followed for a single url. |
The following SitemapSpider processes all the URLs −
from scrapy.spiders import SitemapSpider class DemoSpider(SitemapSpider): urls = ["http://www.demoexample.com/sitemap.xml"] def parse(self, response): # You can scrap items here
The following SitemapSpider processes some URLs with callback −
from scrapy.spiders import SitemapSpider class DemoSpider(SitemapSpider): urls = ["http://www.demoexample.com/sitemap.xml"] rules = [ ("/item/", "parse_item"), ("/group/", "parse_group"), ] def parse_item(self, response): # you can scrap item here def parse_group(self, response): # you can scrap group here
The following code shows sitemaps in the robots.txt whose url has /sitemap_company −
from scrapy.spiders import SitemapSpider class DemoSpider(SitemapSpider): urls = ["http://www.demoexample.com/robots.txt"] rules = [ ("/company/", "parse_company"), ] sitemap_follow = ["/sitemap_company"] def parse_company(self, response): # you can scrap company here
You can even combine SitemapSpider with other URLs as shown in the following command.
from scrapy.spiders import SitemapSpider class DemoSpider(SitemapSpider): urls = ["http://www.demoexample.com/robots.txt"] rules = [ ("/company/", "parse_company"), ] other_urls = ["http://www.demoexample.com/contact-us"] def start_requests(self): requests = list(super(DemoSpider, self).start_requests()) requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls] return requests def parse_company(self, response): # you can scrap company here... def parse_other(self, response): # you can scrap other here...