Scrapy - Shell


Advertisements

Description

Scrapy shell can be used to scrap the data with error free code, without the use of spider. The main purpose of Scrapy shell is to test the extracted code, XPath, or CSS expressions. It also helps specify the web pages from which you are scraping the data.

Configuring the Shell

The shell can be configured by installing the IPython (used for interactive computing) console, which is a powerful interactive shell that gives the auto completion, colorized output, etc.

If you are working on the Unix platform, then it's better to install the IPython. You can also use bpython, if IPython is inaccessible.

You can configure the shell by setting the environment variable called SCRAPY_PYTHON_SHELL or by defining the scrapy.cfg file as follows −

[settings]
shell = bpython

Launching the Shell

Scrapy shell can be launched using the following command −

scrapy shell <url>

The url specifies the URL for which the data needs to be scraped.

Using the Shell

The shell provides some additional shortcuts and Scrapy objects as described in the following table −

Available Shortcuts

Shell provides the following available shortcuts in the project −

Sr.No Shortcut & Description
1

shelp()

It provides the available objects and shortcuts with the help option.

2

fetch(request_or_url)

It collects the response from the request or URL and associated objects will get updated properly.

3

view(response)

You can view the response for the given request in the local browser for observation and to display the external link correctly, it appends a base tag to the response body.

Available Scrapy Objects

Shell provides the following available Scrapy objects in the project −

Sr.No Object & Description
1

crawler

It specifies the current crawler object.

2

spider

If there is no spider for present URL, then it will handle the URL or spider object by defining the new spider.

3

request

It specifies the request object for the last collected page.

4

response

It specifies the response object for the last collected page.

5

settings

It provides the current Scrapy settings.

Example of Shell Session

Let us try scraping scrapy.org site and then begin to scrap the data from reddit.com as described.

Before moving ahead, first we will launch the shell as shown in the following command −

scrapy shell 'http://scrapy.org' --nolog

Scrapy will display the available objects while using the above URL −

[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
[s]   item       {}
[s]   request    <GET http://scrapy.org >
[s]   response   <200 http://scrapy.org >
[s]   settings   <scrapy.settings.Settings object at 0x2bfd650>
[s]   spider     <Spider 'default' at 0x20c6f50>
[s] Useful shortcuts:
[s]   shelp()           Provides available objects and shortcuts with help option
[s]   fetch(req_or_url) Collects the response from the request or URL and associated 
objects will get update
[s]   view(response)    View the response for the given request

Next, begin with the working of objects, shown as follows −

>> response.xpath('//title/text()').extract_first() 
u'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'  
>> fetch("http://reddit.com") 
[s] Available Scrapy objects: 
[s]   crawler     
[s]   item       {} 
[s]   request     
[s]   response   <200 https://www.reddit.com/> 
[s]   settings    
[s]   spider      
[s] Useful shortcuts: 
[s]   shelp()           Shell help (print this help) 
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects 
[s]   view(response)    View response in a browser  
>> response.xpath('//title/text()').extract() 
[u'reddit: the front page of the internet']  
>> request = request.replace(method="POST")  
>> fetch(request) 
[s] Available Scrapy objects: 
[s]   crawler     
... 

Invoking the Shell from Spiders to Inspect Responses

You can inspect the responses which are processed from the spider, only if you are expecting to get that response.

For instance −

import scrapy 

class SpiderDemo(scrapy.Spider): 
   name = "spiderdemo" 
   start_urls = [ 
      "http://mysite.com", 
      "http://mysite1.org", 
      "http://mysite2.net", 
   ]  
   
   def parse(self, response): 
      # You can inspect one specific response 
      if ".net" in response.url: 
         from scrapy.shell import inspect_response 
         inspect_response(response, self)

As shown in the above code, you can invoke the shell from spiders to inspect the responses using the following function −

scrapy.shell.inspect_response

Now run the spider, and you will get the following screen −

2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None) 
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None) 
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None) 
[s] Available Scrapy objects: 
[s]   crawler     
...  
>> response.url 
'http://mysite2.org' 

You can examine whether the extracted code is working using the following code −

>> response.xpath('//div[@class = "val"]')

It displays the output as

[]

The above line has displayed only a blank output. Now you can invoke the shell to inspect the response as follows −

>> view(response)

It displays the response as

True
Advertisements