Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the request and get back to the request when it returns a response object.
The request object is a HTTP request that generates a response. It has the following class −
class scrapy.http.Request(url[, callback, method = 'GET', headers, body, cookies, meta, encoding = 'utf-8', priority = 0, dont_filter = False, errback])
Following table shows the parameters of Request objects −
Sr.No | Parameter & Description |
---|---|
1 | url It is a string that specifies the URL request. |
2 | callback It is a callable function which uses the response of the request as first parameter. |
3 | method It is a string that specifies the HTTP method request. |
4 | headers It is a dictionary with request headers. |
5 | body It is a string or unicode that has a request body. |
6 | cookies It is a list containing request cookies. |
7 | meta It is a dictionary that contains values for metadata of the request. |
8 | encoding It is a string containing utf-8 encoding used to encode URL. |
9 | priority It is an integer where the scheduler uses priority to define the order to process requests. |
10 | dont_filter It is a boolean specifying that the scheduler should not filter the request. |
11 | errback It is a callable function to be called when an exception while processing a request is raised. |
The callback function of a request is called when the response is downloaded as its first parameter.
For example −
def parse_page1(self, response): return scrapy.Request("http://www.something.com/some_page.html", callback = self.parse_page2) def parse_page2(self, response): self.logger.info("%s page visited", response.url)
You can use Request.meta attribute, if you want to pass arguments to callable functions and receive those arguments in the second callback as shown in the following example −
def parse_page1(self, response): item = DemoItem() item['foremost_link'] = response.url request = scrapy.Request("http://www.something.com/some_page.html", callback = self.parse_page2) request.meta['item'] = item return request def parse_page2(self, response): item = response.meta['item'] item['other_link'] = response.url return item
The errback is a callable function to be called when an exception while processing a request is raised.
The following example demonstrates this −
import scrapy from scrapy.spidermiddlewares.httperror import HttpError from twisted.internet.error import DNSLookupError from twisted.internet.error import TimeoutError, TCPTimedOutError class DemoSpider(scrapy.Spider): name = "demo" start_urls = [ "http://www.httpbin.org/", # HTTP 200 expected "http://www.httpbin.org/status/404", # Webpage not found "http://www.httpbin.org/status/500", # Internal server error "http://www.httpbin.org:12345/", # timeout expected "http://www.httphttpbinbin.org/", # DNS error expected ] def start_requests(self): for u in self.start_urls: yield scrapy.Request(u, callback = self.parse_httpbin, errback = self.errback_httpbin, dont_filter=True) def parse_httpbin(self, response): self.logger.info('Recieved response from {}'.format(response.url)) # ... def errback_httpbin(self, failure): # logs failures self.logger.error(repr(failure)) if failure.check(HttpError): response = failure.value.response self.logger.error("HttpError occurred on %s", response.url) elif failure.check(DNSLookupError): request = failure.request self.logger.error("DNSLookupError occurred on %s", request.url) elif failure.check(TimeoutError, TCPTimedOutError): request = failure.request self.logger.error("TimeoutError occurred on %s", request.url)
The request.meta special keys is a list of special meta keys identified by Scrapy.
Following table shows some of the keys of Request.meta −
Sr.No | Key & Description |
---|---|
1 | dont_redirect It is a key when set to true, does not redirect the request based on the status of the response. |
2 | dont_retry It is a key when set to true, does not retry the failed requests and will be ignored by the middleware. |
3 | handle_httpstatus_list It is a key that defines which response codes per-request basis can be allowed. |
4 | handle_httpstatus_all It is a key used to allow any response code for a request by setting it to true. |
5 | dont_merge_cookies It is a key used to avoid merging with the existing cookies by setting it to true. |
6 | cookiejar It is a key used to keep multiple cookie sessions per spider. |
7 | dont_cache It is a key used to avoid caching HTTP requests and response on each policy. |
8 | redirect_urls It is a key which contains URLs through which the requests pass. |
9 | bindaddress It is the IP of the outgoing IP address that can be used to perform the request. |
10 | dont_obey_robotstxt It is a key when set to true, does not filter the requests prohibited by the robots.txt exclusion standard, even if ROBOTSTXT_OBEY is enabled. |
11 | download_timeout It is used to set timeout (in secs) per spider for which the downloader will wait before it times out. |
12 | download_maxsize It is used to set maximum size (in bytes) per spider, which the downloader will download. |
13 | proxy Proxy can be set for Request objects to set HTTP proxy for the use of requests. |
You can implement your own custom functionality by subclassing the request class. The built-in request subclasses are as follows −
The FormRequest class deals with HTML forms by extending the base request. It has the following class −
class scrapy.http.FormRequest(url[,formdata, callback, method = 'GET', headers, body, cookies, meta, encoding = 'utf-8', priority = 0, dont_filter = False, errback])
Following is the parameter −
formdata − It is a dictionary having HTML form data that is assigned to the body of the request.
Note − Remaining parameters are the same as request class and is explained in Request Objects section.
The following class methods are supported by FormRequest objects in addition to request methods −
classmethod from_response(response[, formname = None, formnumber = 0, formdata = None, formxpath = None, formcss = None, clickdata = None, dont_click = False, ...])
The following table shows the parameters of the above class −
Sr.No | Parameter & Description |
---|---|
1 | response It is an object used to pre-populate the form fields using HTML form of response. |
2 | formname It is a string where the form having name attribute will be used, if specified. |
3 | formnumber It is an integer of forms to be used when there are multiple forms in the response. |
4 | formdata It is a dictionary of fields in the form data used to override. |
5 | formxpath It is a string when specified, the form matching the xpath is used. |
6 | formcss It is a string when specified, the form matching the css selector is used. |
7 | clickdata It is a dictionary of attributes used to observe the clicked control. |
8 | dont_click The data from the form will be submitted without clicking any element, when set to true. |
Following are some of the request usage examples −
Using FormRequest to send data via HTTP POST
The following code demonstrates how to return FormRequest object when you want to duplicate HTML form POST in your spider −
return [FormRequest(url = "http://www.something.com/post/action", formdata = {'firstname': 'John', 'lastname': 'dave'}, callback = self.after_post)]
Using FormRequest.from_response() to simulate a user login
Normally, websites use elements through which it provides pre-populated form fields.
The FormRequest.form_response() method can be used when you want these fields to be automatically populate while scraping.
The following example demonstrates this.
import scrapy class DemoSpider(scrapy.Spider): name = 'demo' start_urls = ['http://www.something.com/users/login.php'] def parse(self, response): return scrapy.FormRequest.from_response( response, formdata = {'username': 'admin', 'password': 'confidential'}, callback = self.after_login ) def after_login(self, response): if "authentication failed" in response.body: self.logger.error("Login failed") return # You can continue scraping here
It is an object indicating HTTP response that is fed to the spiders to process. It has the following class −
class scrapy.http.Response(url[, status = 200, headers, body, flags])
The following table shows the parameters of Response objects −
Sr.No | Parameter & Description |
---|---|
1 | url It is a string that specifies the URL response. |
2 | status It is an integer that contains HTTP status response. |
3 | headers It is a dictionary containing response headers. |
4 | body It is a string with response body. |
5 | flags It is a list containing flags of response. |
You can implement your own custom functionality by subclassing the response class. The built-in response subclasses are as follows −
TextResponse objects
TextResponse objects are used for binary data such as images, sounds, etc. which has the ability to encode the base Response class. It has the following class −
class scrapy.http.TextResponse(url[, encoding[,status = 200, headers, body, flags]])
Following is the parameter −
encoding − It is a string with encoding that is used to encode a response.
Note − Remaining parameters are same as response class and is explained in Response Objects section.
The following table shows the attributes supported by TextResponse object in addition to response methods −
Sr.No | Attribute & Description |
---|---|
1 | text It is a response body, where response.text can be accessed multiple times. |
2 | encoding It is a string containing encoding for response. |
3 | selector It is an attribute instantiated on first access and uses response as target. |
The following table shows the methods supported by TextResponse objects in addition to response methods −
Sr.No | Method & Description |
---|---|
1 | xpath (query) It is a shortcut to TextResponse.selector.xpath(query). |
2 | css (query) It is a shortcut to TextResponse.selector.css(query). |
3 | body_as_unicode() It is a response body available as a method, where response.text can be accessed multiple times. |
It is an object that supports encoding and auto-discovering by looking at the meta httpequiv attribute of HTML. Its parameters are the same as response class and is explained in Response objects section. It has the following class −
class scrapy.http.HtmlResponse(url[,status = 200, headers, body, flags])
It is an object that supports encoding and auto-discovering by looking at the XML line. Its parameters are the same as response class and is explained in Response objects section. It has the following class −
class scrapy.http.XmlResponse(url[, status = 200, headers, body, flags])