Scrapy - Item Loaders


Advertisements

Description

Item loaders provide a convenient way to fill the items that are scraped from the websites.

Declaring Item Loaders

The declaration of Item Loaders is like Items.

For example −

from scrapy.loader import ItemLoader 
from scrapy.loader.processors import TakeFirst, MapCompose, Join  

class DemoLoader(ItemLoader):  
   default_output_processor = TakeFirst()  
   title_in = MapCompose(unicode.title) 
   title_out = Join()  
   size_in = MapCompose(unicode.strip)  
   # you can continue scraping here

In the above code, you can see that input processors are declared using _in suffix and output processors are declared using _out suffix.

The ItemLoader.default_input_processor and ItemLoader.default_output_processor attributes are used to declare default input/output processors.

Using Item Loaders to Populate Items

To use Item Loader, first instantiate with dict-like object or without one where the item uses Item class specified in ItemLoader.default_item_class attribute.

  • You can use selectors to collect values into the Item Loader.

  • You can add more values in the same item field, where Item Loader will use an appropriate handler to add these values.

The following code demonstrates how items are populated using Item Loaders −

from scrapy.loader import ItemLoader 
from demoproject.items import Demo  

def parse(self, response): 
   l = ItemLoader(item = Product(), response = response)
   l.add_xpath("title", "//div[@class = 'product_title']")
   l.add_xpath("title", "//div[@class = 'product_name']")
   l.add_xpath("desc", "//div[@class = 'desc']")
   l.add_css("size", "div#size]")
   l.add_value("last_updated", "yesterday")
   return l.load_item()

As shown above, there are two different XPaths from which the title field is extracted using add_xpath() method −

1. //div[@class = "product_title"] 
2. //div[@class = "product_name"]

Thereafter, a similar request is used for desc field. The size data is extracted using add_css() method and last_updated is filled with a value "yesterday" using add_value() method.

Once all the data is collected, call ItemLoader.load_item() method which returns the items filled with data extracted using add_xpath(), add_css() and add_value() methods.

Input and Output Processors

Each field of an Item Loader contains one input processor and one output processor.

  • When data is extracted, input processor processes it and its result is stored in ItemLoader.

  • Next, after collecting the data, call ItemLoader.load_item() method to get the populated Item object.

  • Finally, you can assign the result of the output processor to the item.

The following code demonstrates how to call input and output processors for a specific field −

l = ItemLoader(Product(), some_selector)
l.add_xpath("title", xpath1) # [1]
l.add_xpath("title", xpath2) # [2]
l.add_css("title", css)      # [3]
l.add_value("title", "demo") # [4]
return l.load_item()         # [5]

Line 1 − The data of title is extracted from xpath1 and passed through the input processor and its result is collected and stored in ItemLoader.

Line 2 − Similarly, the title is extracted from xpath2 and passed through the same input processor and its result is added to the data collected for [1].

Line 3 − The title is extracted from css selector and passed through the same input processor and the result is added to the data collected for [1] and [2].

Line 4 − Next, the value "demo" is assigned and passed through the input processors.

Line 5 − Finally, the data is collected internally from all the fields and passed to the output processor and the final value is assigned to the Item.

Declaring Input and Output Processors

The input and output processors are declared in the ItemLoader definition. Apart from this, they can also be specified in the Item Field metadata.

For example −

import scrapy 
from scrapy.loader.processors import Join, MapCompose, TakeFirst 
from w3lib.html import remove_tags  

def filter_size(value): 
   if value.isdigit(): 
      return value  

class Item(scrapy.Item): 
   name = scrapy.Field( 
      input_processor = MapCompose(remove_tags), 
      output_processor = Join(), 
   )
   size = scrapy.Field( 
      input_processor = MapCompose(remove_tags, filter_price), 
      output_processor = TakeFirst(), 
   ) 

>>> from scrapy.loader import ItemLoader 
>>> il = ItemLoader(item = Product()) 
>>> il.add_value('title', [u'Hello', u'<strong>world</strong>']) 
>>> il.add_value('size', [u'<span>100 kg</span>']) 
>>> il.load_item()

It displays an output as −

{'title': u'Hello world', 'size': u'100 kg'}

Item Loader Context

The Item Loader Context is a dict of arbitrary key values shared among input and output processors.

For example, assume you have a function parse_length

def parse_length(text, loader_context): 
   unit = loader_context.get('unit', 'cm') 
   
   # You can write parsing code of length here  
   return parsed_length

By receiving loader_context arguements, it tells the Item Loader it can receive Item Loader context. There are several ways to change the value of Item Loader context −

  • Modify current active Item Loader context −

loader = ItemLoader (product)
loader.context ["unit"] = "mm"
  • On Item Loader instantiation −

loader = ItemLoader(product, unit = "mm")
  • On Item Loader declaration for input/output processors that instantiates with Item Loader context −

class ProductLoader(ItemLoader):
   length_out = MapCompose(parse_length, unit = "mm")

ItemLoader Objects

It is an object which returns a new item loader to populate the given item. It has the following class −

class scrapy.loader.ItemLoader([item, selector, response, ]**kwargs)

The following table shows the parameters of ItemLoader objects −

Sr.No Parameter & Description
1

item

It is the item to populate by calling add_xpath(), add_css() or add_value().

2

selector

It is used to extract data from websites.

3

response

It is used to construct selector using default_selector_class.

Following table shows the methods of ItemLoader objects −

Sr.No Method & Description Example
1

get_value(value, *processors, **kwargs)

By a given processor and keyword arguments, the value is processed by get_value() method.

>>> from scrapy.loader.processors import TakeFirst
>>> loader.get_value(u'title: demoweb', TakeFirst(), 
unicode.upper, re = 'title: (.+)')
'DEMOWEB`
2

add_value(field_name, value, *processors, **kwargs)

It processes the value and adds to the field where it is first passed through get_value by giving processors and keyword arguments before passing through field input processor.

loader.add_value('title', u'DVD')
loader.add_value('colors', [u'black', u'white'])
loader.add_value('length', u'80')
loader.add_value('price', u'2500')
3

replace_value(field_name, value, *processors, **kwargs)

It replaces the collected data with a new value.

loader.replace_value('title', u'DVD')
loader.replace_value('colors', [u'black', 
u'white'])
loader.replace_value('length', u'80')
loader.replace_value('price', u'2500')
4

get_xpath(xpath, *processors, **kwargs)

It is used to extract unicode strings by giving processors and keyword arguments by receiving XPath.

# HTML code: <div class = "item-name">DVD</div>
loader.get_xpath("//div[@class = 
'item-name']")

# HTML code: <div id = "length">the length is 
45cm</div>
loader.get_xpath("//div[@id = 'length']", TakeFirst(), 
re = "the length is (.*)")
5

add_xpath(field_name, xpath, *processors, **kwargs)

It receives XPath to the field which extracts unicode strings.

# HTML code: <div class = "item-name">DVD</div>
loader.add_xpath('name', '//div
[@class = "item-name"]')

# HTML code: <div id = "length">the length is 
45cm</div>
loader.add_xpath('length', '//div[@id = "length"]',
 re = 'the length is (.*)')
6

replace_xpath(field_name, xpath, *processors, **kwargs)

It replaces the collected data using XPath from sites.

# HTML code: <div class = "item-name">DVD</div>
loader.replace_xpath('name', '
//div[@class = "item-name"]')

# HTML code: <div id = "length">the length is
 45cm</div>
loader.replace_xpath('length', '
//div[@id = "length"]', re = 'the length is (.*)')
7

get_css(css, *processors, **kwargs)

It receives CSS selector used to extract the unicode strings.

loader.get_css("div.item-name")
loader.get_css("div#length", TakeFirst(), 
re = "the length is (.*)")
8

add_css(field_name, css, *processors, **kwargs)

It is similar to add_value() method with one difference that it adds CSS selector to the field.

loader.add_css('name', 'div.item-name')
loader.add_css('length', 'div#length', 
re = 'the length is (.*)')
9

replace_css(field_name, css, *processors, **kwargs)

It replaces the extracted data using CSS selector.

loader.replace_css('name', 'div.item-name')
loader.replace_css('length', 'div#length',
 re = 'the length is (.*)')
10

load_item()

When the data is collected, this method fills the item with collected data and returns it.

def parse(self, response):
l = ItemLoader(item = Product(), 
response = response)
l.add_xpath('title', '//
div[@class = "product_title"]')
loader.load_item()
11

nested_xpath(xpath)

It is used to create nested loaders with an XPath selector.

loader = ItemLoader(item = Item())
loader.add_xpath('social', '
a[@class = "social"]/@href')
loader.add_xpath('email', '
a[@class = "email"]/@href')
12

nested_css(css)

It is used to create nested loaders with a CSS selector.

loader = ItemLoader(item = Item())
loader.add_css('social', 'a[@class = "social"]/@href')
loader.add_css('email', 'a[@class = "email"]/@href')	

Following table shows the attributes of ItemLoader objects −

Sr.No Attribute & Description
1

item

It is an object on which the Item Loader performs parsing.

2

context

It is the current context of Item Loader that is active.

3

default_item_class

It is used to represent the items, if not given in the constructor.

4

default_input_processor

The fields which don't specify input processor are the only ones for which default_input_processors are used.

5

default_output_processor

The fields which don't specify the output processor are the only ones for which default_output_processors are used.

6

default_selector_class

It is a class used to construct the selector, if it is not given in the constructor.

7

selector

It is an object that can be used to extract the data from sites.

Nested Loaders

It is used to create nested loaders while parsing the values from the subsection of a document. If you don't create nested loaders, you need to specify full XPath or CSS for each value that you want to extract.

For instance, assume that the data is being extracted from a header page −

<header>
   <a class = "social" href = "http://facebook.com/whatever">facebook</a>
   <a class = "social" href = "http://twitter.com/whatever">twitter</a>
   <a class = "email" href = "mailto:someone@example.com">send mail</a>
</header>

Next, you can create a nested loader with header selector by adding related values to the header −

loader = ItemLoader(item = Item())
header_loader = loader.nested_xpath('//header')
header_loader.add_xpath('social', 'a[@class = "social"]/@href')
header_loader.add_xpath('email', 'a[@class = "email"]/@href')
loader.load_item()

Reusing and extending Item Loaders

Item Loaders are designed to relieve the maintenance which becomes a fundamental problem when your project acquires more spiders.

For instance, assume that a site has their product name enclosed in three dashes (e.g. --DVD---). You can remove those dashes by reusing the default Product Item Loader, if you don’t want it in the final product names as shown in the following code −

from scrapy.loader.processors import MapCompose 
from demoproject.ItemLoaders import DemoLoader  

def strip_dashes(x): 
   return x.strip('-')  

class SiteSpecificLoader(DemoLoader): 
   title_in = MapCompose(strip_dashes, DemoLoader.title_in)

Available Built-in Processors

Following are some of the commonly used built-in processors −

class scrapy.loader.processors.Identity

It returns the original value without altering it. For example −

>>> from scrapy.loader.processors import Identity
>>> proc = Identity()
>>> proc(['a', 'b', 'c'])
['a', 'b', 'c']

class scrapy.loader.processors.TakeFirst

It returns the first value that is non-null/non-empty from the list of received values. For example −

>>> from scrapy.loader.processors import TakeFirst
>>> proc = TakeFirst()
>>> proc(['', 'a', 'b', 'c'])
'a'

class scrapy.loader.processors.Join(separator = u' ')

It returns the value attached to the separator. The default separator is u' ' and it is equivalent to the function u' '.join. For example −

>>> from scrapy.loader.processors import Join
>>> proc = Join()
>>> proc(['a', 'b', 'c'])
u'a b c'
>>> proc = Join('<br>')
>>> proc(['a', 'b', 'c'])
u'a<br>b<br>c'

class scrapy.loader.processors.Compose(*functions, **default_loader_context)

It is defined by a processor where each of its input value is passed to the first function, and the result of that function is passed to the second function and so on, till lthe ast function returns the final value as output.

For example −

>>> from scrapy.loader.processors import Compose
>>> proc = Compose(lambda v: v[0], str.upper)
>>> proc(['python', 'scrapy'])
'PYTHON'

class scrapy.loader.processors.MapCompose(*functions, **default_loader_context)

It is a processor where the input value is iterated and the first function is applied to each element. Next, the result of these function calls are concatenated to build new iterable that is then applied to the second function and so on, till the last function.

For example −

>>> def filter_scrapy(x): 
   return None if x == 'scrapy' else x  

>>> from scrapy.loader.processors import MapCompose 
>>> proc = MapCompose(filter_scrapy, unicode.upper) 
>>> proc([u'hi', u'everyone', u'im', u'pythonscrapy']) 
[u'HI, u'IM', u'PYTHONSCRAPY'] 

class scrapy.loader.processors.SelectJmes(json_path)

This class queries the value using the provided json path and returns the output.

For example −

>>> from scrapy.loader.processors import SelectJmes, Compose, MapCompose
>>> proc = SelectJmes("hello")
>>> proc({'hello': 'scrapy'})
'scrapy'
>>> proc({'hello': {'scrapy': 'world'}})
{'scrapy': 'world'}

Following is the code, which queries the value by importing json −

>>> import json
>>> proc_single_json_str = Compose(json.loads, SelectJmes("hello"))
>>> proc_single_json_str('{"hello": "scrapy"}')
u'scrapy'
>>> proc_json_list = Compose(json.loads, MapCompose(SelectJmes('hello')))
>>> proc_json_list('[{"hello":"scrapy"}, {"world":"env"}]')
[u'scrapy']
Advertisements