Scrapy shell can be used to scrap the data with error free code, without the use of spider. The main purpose of Scrapy shell is to test the extracted code, XPath, or CSS expressions. It also helps specify the web pages from which you are scraping the data.
The shell can be configured by installing the IPython (used for interactive computing) console, which is a powerful interactive shell that gives the auto completion, colorized output, etc.
If you are working on the Unix platform, then it's better to install the IPython. You can also use bpython, if IPython is inaccessible.
You can configure the shell by setting the environment variable called SCRAPY_PYTHON_SHELL or by defining the scrapy.cfg file as follows −
[settings] shell = bpython
Scrapy shell can be launched using the following command −
scrapy shell <url>
The url specifies the URL for which the data needs to be scraped.
The shell provides some additional shortcuts and Scrapy objects as described in the following table −
Shell provides the following available shortcuts in the project −
Sr.No | Shortcut & Description |
---|---|
1 | shelp() It provides the available objects and shortcuts with the help option. |
2 | fetch(request_or_url) It collects the response from the request or URL and associated objects will get updated properly. |
3 | view(response) You can view the response for the given request in the local browser for observation and to display the external link correctly, it appends a base tag to the response body. |
Shell provides the following available Scrapy objects in the project −
Sr.No | Object & Description |
---|---|
1 | crawler It specifies the current crawler object. |
2 | spider If there is no spider for present URL, then it will handle the URL or spider object by defining the new spider. |
3 | request It specifies the request object for the last collected page. |
4 | response It specifies the response object for the last collected page. |
5 | settings It provides the current Scrapy settings. |
Let us try scraping scrapy.org site and then begin to scrap the data from reddit.com as described.
Before moving ahead, first we will launch the shell as shown in the following command −
scrapy shell 'http://scrapy.org' --nolog
Scrapy will display the available objects while using the above URL −
[s] Available Scrapy objects: [s] crawler <scrapy.crawler.Crawler object at 0x1e16b50> [s] item {} [s] request <GET http://scrapy.org > [s] response <200 http://scrapy.org > [s] settings <scrapy.settings.Settings object at 0x2bfd650> [s] spider <Spider 'default' at 0x20c6f50> [s] Useful shortcuts: [s] shelp() Provides available objects and shortcuts with help option [s] fetch(req_or_url) Collects the response from the request or URL and associated objects will get update [s] view(response) View the response for the given request
Next, begin with the working of objects, shown as follows −
>> response.xpath('//title/text()').extract_first() u'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework' >> fetch("http://reddit.com") [s] Available Scrapy objects: [s] crawler [s] item {} [s] request [s] response <200 https://www.reddit.com/> [s] settings [s] spider [s] Useful shortcuts: [s] shelp() Shell help (print this help) [s] fetch(req_or_url) Fetch request (or URL) and update local objects [s] view(response) View response in a browser >> response.xpath('//title/text()').extract() [u'reddit: the front page of the internet'] >> request = request.replace(method="POST") >> fetch(request) [s] Available Scrapy objects: [s] crawler ...
You can inspect the responses which are processed from the spider, only if you are expecting to get that response.
For instance −
import scrapy class SpiderDemo(scrapy.Spider): name = "spiderdemo" start_urls = [ "http://mysite.com", "http://mysite1.org", "http://mysite2.net", ] def parse(self, response): # You can inspect one specific response if ".net" in response.url: from scrapy.shell import inspect_response inspect_response(response, self)
As shown in the above code, you can invoke the shell from spiders to inspect the responses using the following function −
scrapy.shell.inspect_response
Now run the spider, and you will get the following screen −
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200) (referer: None) 2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200) (referer: None) 2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200) (referer: None) [s] Available Scrapy objects: [s] crawler ... >> response.url 'http://mysite2.org'
You can examine whether the extracted code is working using the following code −
>> response.xpath('//div[@class = "val"]')
It displays the output as
[]
The above line has displayed only a blank output. Now you can invoke the shell to inspect the response as follows −
>> view(response)
It displays the response as
True