Scrapy - Requests and Responses


Advertisements

Description

Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the request and get back to the request when it returns a response object.

Request Objects

The request object is a HTTP request that generates a response. It has the following class −

class scrapy.http.Request(url[, callback, method = 'GET', headers, body, cookies, meta,
   encoding = 'utf-8', priority = 0, dont_filter = False, errback])

Following table shows the parameters of Request objects −

Sr.No Parameter & Description
1

url

It is a string that specifies the URL request.

2

callback

It is a callable function which uses the response of the request as first parameter.

3

method

It is a string that specifies the HTTP method request.

4

headers

It is a dictionary with request headers.

5

body

It is a string or unicode that has a request body.

6

cookies

It is a list containing request cookies.

7

meta

It is a dictionary that contains values for metadata of the request.

8

encoding

It is a string containing utf-8 encoding used to encode URL.

9

priority

It is an integer where the scheduler uses priority to define the order to process requests.

10

dont_filter

It is a boolean specifying that the scheduler should not filter the request.

11

errback

It is a callable function to be called when an exception while processing a request is raised.

Passing Additional Data to Callback Functions

The callback function of a request is called when the response is downloaded as its first parameter.

For example −

def parse_page1(self, response): 
   return scrapy.Request("http://www.something.com/some_page.html", 
      callback = self.parse_page2)  

def parse_page2(self, response): 
   self.logger.info("%s page visited", response.url) 

You can use Request.meta attribute, if you want to pass arguments to callable functions and receive those arguments in the second callback as shown in the following example −

def parse_page1(self, response): 
   item = DemoItem() 
   item['foremost_link'] = response.url 
   request = scrapy.Request("http://www.something.com/some_page.html", 
      callback = self.parse_page2) 
   request.meta['item'] = item 
   return request  

def parse_page2(self, response): 
   item = response.meta['item'] 
   item['other_link'] = response.url 
   return item

Using errbacks to Catch Exceptions in Request Processing

The errback is a callable function to be called when an exception while processing a request is raised.

The following example demonstrates this −

import scrapy  

from scrapy.spidermiddlewares.httperror import HttpError 
from twisted.internet.error import DNSLookupError 
from twisted.internet.error import TimeoutError, TCPTimedOutError  

class DemoSpider(scrapy.Spider): 
   name = "demo" 
   start_urls = [ 
      "http://www.httpbin.org/",              # HTTP 200 expected 
      "http://www.httpbin.org/status/404",    # Webpage not found  
      "http://www.httpbin.org/status/500",    # Internal server error 
      "http://www.httpbin.org:12345/",        # timeout expected 
      "http://www.httphttpbinbin.org/",       # DNS error expected 
   ]  
   
   def start_requests(self): 
      for u in self.start_urls: 
         yield scrapy.Request(u, callback = self.parse_httpbin, 
         errback = self.errback_httpbin, 
         dont_filter=True)  
   
   def parse_httpbin(self, response): 
      self.logger.info('Recieved response from {}'.format(response.url)) 
      # ...  
   
   def errback_httpbin(self, failure): 
      # logs failures 
      self.logger.error(repr(failure))  
      
      if failure.check(HttpError): 
         response = failure.value.response 
         self.logger.error("HttpError occurred on %s", response.url)  
      
      elif failure.check(DNSLookupError): 
         request = failure.request 
         self.logger.error("DNSLookupError occurred on %s", request.url) 

      elif failure.check(TimeoutError, TCPTimedOutError): 
         request = failure.request 
         self.logger.error("TimeoutError occurred on %s", request.url) 

Request.meta Special Keys

The request.meta special keys is a list of special meta keys identified by Scrapy.

Following table shows some of the keys of Request.meta −

Sr.No Key & Description
1

dont_redirect

It is a key when set to true, does not redirect the request based on the status of the response.

2

dont_retry

It is a key when set to true, does not retry the failed requests and will be ignored by the middleware.

3

handle_httpstatus_list

It is a key that defines which response codes per-request basis can be allowed.

4

handle_httpstatus_all

It is a key used to allow any response code for a request by setting it to true.

5

dont_merge_cookies

It is a key used to avoid merging with the existing cookies by setting it to true.

6

cookiejar

It is a key used to keep multiple cookie sessions per spider.

7

dont_cache

It is a key used to avoid caching HTTP requests and response on each policy.

8

redirect_urls

It is a key which contains URLs through which the requests pass.

9

bindaddress

It is the IP of the outgoing IP address that can be used to perform the request.

10

dont_obey_robotstxt

It is a key when set to true, does not filter the requests prohibited by the robots.txt exclusion standard, even if ROBOTSTXT_OBEY is enabled.

11

download_timeout

It is used to set timeout (in secs) per spider for which the downloader will wait before it times out.

12

download_maxsize

It is used to set maximum size (in bytes) per spider, which the downloader will download.

13

proxy

Proxy can be set for Request objects to set HTTP proxy for the use of requests.

Request Subclasses

You can implement your own custom functionality by subclassing the request class. The built-in request subclasses are as follows −

FormRequest Objects

The FormRequest class deals with HTML forms by extending the base request. It has the following class −

class scrapy.http.FormRequest(url[,formdata, callback, method = 'GET', headers, body, 
   cookies, meta, encoding = 'utf-8', priority = 0, dont_filter = False, errback])

Following is the parameter −

formdata − It is a dictionary having HTML form data that is assigned to the body of the request.

Note − Remaining parameters are the same as request class and is explained in Request Objects section.

The following class methods are supported by FormRequest objects in addition to request methods −

classmethod from_response(response[, formname = None, formnumber = 0, formdata = None, 
   formxpath = None, formcss = None, clickdata = None, dont_click = False, ...])

The following table shows the parameters of the above class −

Sr.No Parameter & Description
1

response

It is an object used to pre-populate the form fields using HTML form of response.

2

formname

It is a string where the form having name attribute will be used, if specified.

3

formnumber

It is an integer of forms to be used when there are multiple forms in the response.

4

formdata

It is a dictionary of fields in the form data used to override.

5

formxpath

It is a string when specified, the form matching the xpath is used.

6

formcss

It is a string when specified, the form matching the css selector is used.

7

clickdata

It is a dictionary of attributes used to observe the clicked control.

8

dont_click

The data from the form will be submitted without clicking any element, when set to true.

Examples

Following are some of the request usage examples −

Using FormRequest to send data via HTTP POST

The following code demonstrates how to return FormRequest object when you want to duplicate HTML form POST in your spider −

return [FormRequest(url = "http://www.something.com/post/action", 
   formdata = {'firstname': 'John', 'lastname': 'dave'}, 
   callback = self.after_post)]

Using FormRequest.from_response() to simulate a user login

Normally, websites use elements through which it provides pre-populated form fields.

The FormRequest.form_response() method can be used when you want these fields to be automatically populate while scraping.

The following example demonstrates this.

import scrapy  
class DemoSpider(scrapy.Spider): 
   name = 'demo' 
   start_urls = ['http://www.something.com/users/login.php']  
   def parse(self, response): 
      return scrapy.FormRequest.from_response( 
         response, 
         formdata = {'username': 'admin', 'password': 'confidential'}, 
         callback = self.after_login 
      )  
   
   def after_login(self, response): 
      if "authentication failed" in response.body: 
         self.logger.error("Login failed") 
         return  
      # You can continue scraping here

Response Objects

It is an object indicating HTTP response that is fed to the spiders to process. It has the following class −

class scrapy.http.Response(url[, status = 200, headers, body, flags])

The following table shows the parameters of Response objects −

Sr.No Parameter & Description
1

url

It is a string that specifies the URL response.

2

status

It is an integer that contains HTTP status response.

3

headers

It is a dictionary containing response headers.

4

body

It is a string with response body.

5

flags

It is a list containing flags of response.

Response Subclasses

You can implement your own custom functionality by subclassing the response class. The built-in response subclasses are as follows −

TextResponse objects

TextResponse objects are used for binary data such as images, sounds, etc. which has the ability to encode the base Response class. It has the following class −

class scrapy.http.TextResponse(url[, encoding[,status = 200, headers, body, flags]])

Following is the parameter −

encoding − It is a string with encoding that is used to encode a response.

Note − Remaining parameters are same as response class and is explained in Response Objects section.

The following table shows the attributes supported by TextResponse object in addition to response methods −

Sr.No Attribute & Description
1

text

It is a response body, where response.text can be accessed multiple times.

2

encoding

It is a string containing encoding for response.

3

selector

It is an attribute instantiated on first access and uses response as target.

The following table shows the methods supported by TextResponse objects in addition to response methods −

Sr.No Method & Description
1

xpath (query)

It is a shortcut to TextResponse.selector.xpath(query).

2

css (query)

It is a shortcut to TextResponse.selector.css(query).

3

body_as_unicode()

It is a response body available as a method, where response.text can be accessed multiple times.

HtmlResponse Objects

It is an object that supports encoding and auto-discovering by looking at the meta httpequiv attribute of HTML. Its parameters are the same as response class and is explained in Response objects section. It has the following class −

class scrapy.http.HtmlResponse(url[,status = 200, headers, body, flags])

XmlResponse Objects

It is an object that supports encoding and auto-discovering by looking at the XML line. Its parameters are the same as response class and is explained in Response objects section. It has the following class −

class scrapy.http.XmlResponse(url[, status = 200, headers, body, flags])
Advertisements