Scrapy - Spiders

Description

Spider is a class responsible for defining how to follow the links through a website and extract the information from the pages.

The default spiders of Scrapy are as follows −

scrapy.Spider

It is a spider from which every other spiders must inherit. It has the following class −

class scrapy.spiders.Spider

The following table shows the fields of scrapy.Spider class −

Sr.No	Field & Description
1	name It is the name of your spider.
2	allowed_domains It is a list of domains on which the spider crawls.
3	start_urls It is a list of URLs, which will be the roots for later crawls, where the spider will begin to crawl from.
4	custom_settings These are the settings, when running the spider, will be overridden from project wide configuration.
5	crawler It is an attribute that links to Crawler object to which the spider instance is bound.
6	settings These are the settings for running a spider.
7	logger It is a Python logger used to send log messages.
8	*from_crawler(crawler,args,kwargs) It is a class method, which creates your spider. The parameters are − crawler − A crawler to which the spider instance will be bound. args(list) − These arguments are passed to the method _init_(). kwargs(dict) − These keyword arguments are passed to the method _init_().
9	start_requests() When no particular URLs are specified and the spider is opened for scrapping, Scrapy calls start_requests() method.
10	make_requests_from_url(url) It is a method used to convert urls to requests.
11	parse(response) This method processes the response and returns scrapped data following more URLs.
12	log(message[,level,component]) It is a method that sends a log message through spiders logger.
13	closed(reason) This method is called when the spider closes.

Spider Arguments

Spider arguments are used to specify start URLs and are passed using crawl command with -a option, shown as follows −

scrapy crawl first_scrapy -a group = accessories

The following code demonstrates how a spider receives arguments −

import scrapy 

class FirstSpider(scrapy.Spider): 
   name = "first" 
   
   def __init__(self, group = None, *args, **kwargs): 
      super(FirstSpider, self).__init__(*args, **kwargs) 
      self.start_urls = ["http://www.example.com/group/%s" % group]

Generic Spiders

You can use generic spiders to subclass your spiders from. Their aim is to follow all links on the website based on certain rules to extract data from all pages.

For the examples used in the following spiders, let’s assume we have a project with the following fields −

import scrapy 
from scrapy.item import Item, Field 
  
class First_scrapyItem(scrapy.Item): 
   product_title = Field() 
   product_link = Field() 
   product_description = Field()

CrawlSpider

CrawlSpider defines a set of rules to follow the links and scrap more than one page. It has the following class −

class scrapy.spiders.CrawlSpider

Following are the attributes of CrawlSpider class −

rules

It is a list of rule objects that defines how the crawler follows the link.

The following table shows the rules of CrawlSpider class −

Sr.No	Rule & Description
1	LinkExtractor It specifies how spider follows the links and extracts the data.
2	callback It is to be called after each page is scraped.
3	follow It specifies whether to continue following links or not.

Sr.No

Rule & Description

LinkExtractor

It specifies how spider follows the links and extracts the data.

callback

It is to be called after each page is scraped.

follow

It specifies whether to continue following links or not.

parse_start_url(response)

It returns either item or request object by allowing to parse initial responses.

Note − Make sure you rename parse function other than parse while writing the rules because the parse function is used by CrawlSpider to implement its logic.

Let’s take a look at the following example, where spider starts crawling demoexample.com's home page, collecting all pages, links, and parses with the parse_items method −

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class DemoSpider(CrawlSpider):
   name = "demo"
   allowed_domains = ["www.demoexample.com"]
   start_urls = ["http://www.demoexample.com"]
      
   rules = ( 
      Rule(LinkExtractor(allow =(), restrict_xpaths = ("//div[@class = 'next']",)),
         callback = "parse_item", follow = True),
   )
   
   def parse_item(self, response):
      item = DemoItem()
      item["product_title"] = response.xpath("a/text()").extract()
      item["product_link"] = response.xpath("a/@href").extract()
      item["product_description"] = response.xpath("div[@class = 'desc']/text()").extract()
      return items

XMLFeedSpider

It is the base class for spiders that scrape from XML feeds and iterates over nodes. It has the following class −

class scrapy.spiders.XMLFeedSpider

The following table shows the class attributes used to set an iterator and a tag name −

Sr.No	Attribute & Description
1	iterator It defines the iterator to be used. It can be either iternodes, html or xml. Default is iternodes.
2	itertag It is a string with node name to iterate.
3	namespaces It is defined by list of (prefix, uri) tuples that automatically registers namespaces using register_namespace() method.
4	adapt_response(response) It receives the response and modifies the response body as soon as it arrives from spider middleware, before spider starts parsing it.
5	parse_node(response,selector) It receives the response and a selector when called for each node matching the provided tag name. Note − Your spider won't work if you don't override this method.
6	process_results(response,results) It returns a list of results and response returned by the spider.

CSVFeedSpider

It iterates through each of its rows, receives a CSV file as a response, and calls parse_row() method. It has the following class −

class scrapy.spiders.CSVFeedSpider

The following table shows the options that can be set regarding the CSV file −

Sr.No	Option & Description
1	delimiter It is a string containing a comma(',') separator for each field.
2	quotechar It is a string containing quotation mark('"') for each field.
3	headers It is a list of statements from where the fields can be extracted.
4	parse_row(response,row) It receives a response and each row along with a key for header.

CSVFeedSpider Example

from scrapy.spiders import CSVFeedSpider
from demoproject.items import DemoItem  

class DemoSpider(CSVFeedSpider): 
   name = "demo" 
   allowed_domains = ["www.demoexample.com"] 
   start_urls = ["http://www.demoexample.com/feed.csv"] 
   delimiter = ";" 
   quotechar = "'" 
   headers = ["product_title", "product_link", "product_description"]  
   
   def parse_row(self, response, row): 
      self.logger.info("This is row: %r", row)  
      item = DemoItem() 
      item["product_title"] = row["product_title"] 
      item["product_link"] = row["product_link"] 
      item["product_description"] = row["product_description"] 
      return item

SitemapSpider

SitemapSpider with the help of Sitemaps crawl a website by locating the URLs from robots.txt. It has the following class −

class scrapy.spiders.SitemapSpider

The following table shows the fields of SitemapSpider −

Sr.No	Field & Description
1	sitemap_urls A list of URLs which you want to crawl pointing to the sitemaps.
2	sitemap_rules It is a list of tuples (regex, callback), where regex is a regular expression, and callback is used to process URLs matching a regular expression.
3	sitemap_follow It is a list of sitemap's regexes to follow.
4	sitemap_alternate_links Specifies alternate links to be followed for a single url.

SitemapSpider Example

The following SitemapSpider processes all the URLs −

from scrapy.spiders import SitemapSpider  

class DemoSpider(SitemapSpider): 
   urls = ["http://www.demoexample.com/sitemap.xml"]  
   
   def parse(self, response): 
      # You can scrap items here

The following SitemapSpider processes some URLs with callback −

from scrapy.spiders import SitemapSpider  

class DemoSpider(SitemapSpider): 
   urls = ["http://www.demoexample.com/sitemap.xml"] 
   
   rules = [ 
      ("/item/", "parse_item"), 
      ("/group/", "parse_group"), 
   ]  
   
   def parse_item(self, response): 
      # you can scrap item here  
   
   def parse_group(self, response): 
      # you can scrap group here

The following code shows sitemaps in the robots.txt whose url has /sitemap_company −

from scrapy.spiders import SitemapSpider

class DemoSpider(SitemapSpider): 
   urls = ["http://www.demoexample.com/robots.txt"] 
   rules = [ 
      ("/company/", "parse_company"), 
   ] 
   sitemap_follow = ["/sitemap_company"]  
   
   def parse_company(self, response): 
      # you can scrap company here

You can even combine SitemapSpider with other URLs as shown in the following command.

from scrapy.spiders import SitemapSpider  

class DemoSpider(SitemapSpider): 
   urls = ["http://www.demoexample.com/robots.txt"] 
   rules = [ 
      ("/company/", "parse_company"), 
   ]  
   
   other_urls = ["http://www.demoexample.com/contact-us"] 
   def start_requests(self): 
      requests = list(super(DemoSpider, self).start_requests()) 
      requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls] 
      return requests 

   def parse_company(self, response): 
      # you can scrap company here... 

   def parse_other(self, response): 
      # you can scrap other here...

Previous Page Print Page