Scrapy - Following Links


Advertisements

Description

In this chapter, we'll study how to extract the links of the pages of our interest, follow them and extract data from that page. For this, we need to make the following changes in our previous code shown as follows −

import scrapy
from tutorial.items import DmozItem

class MyprojectSpider(scrapy.Spider):
   name = "project"
   allowed_domains = ["dmoz.org"]
   
   start_urls = [
      "http://www.dmoz.org/Computers/Programming/Languages/Python/",
   ]
   def parse(self, response):
      for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
         url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback = self.parse_dir_contents)

   def parse_dir_contents(self, response):
      for sel in response.xpath('//ul/li'):
         item = DmozItem()
         item['title'] = sel.xpath('a/text()').extract()
         item['link'] = sel.xpath('a/@href').extract()
         item['desc'] = sel.xpath('text()').extract()
         yield item

The above code contains the following methods −

  • parse() − It will extract the links of our interest.

  • response.urljoin − The parse() method will use this method to build a new url and provide a new request, which will be sent later to callback.

  • parse_dir_contents() − This is a callback which will actually scrape the data of interest.

Here, Scrapy uses a callback mechanism to follow links. Using this mechanism, the bigger crawler can be designed and can follow links of interest to scrape the desired data from different pages. The regular method will be callback method, which will extract the items, look for links to follow the next page, and then provide a request for the same callback.

The following example produces a loop, which will follow the links to the next page.

def parse_articles_follow_next_page(self, response):
   for article in response.xpath("//article"):
      item = ArticleItem()
    
      ... extract article data here

      yield item

   next_page = response.css("ul.navigation > li.next-page > a::attr('href')")
   if next_page:
      url = response.urljoin(next_page[0].extract())
      yield scrapy.Request(url, self.parse_articles_follow_next_page)
Advertisements