When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml library, which processes the XML and HTML in Python language.
Use the following code snippet to define different concepts of selectors −
<html> <head> <title>My Website</title> </head> <body> <span>Hello world!!!</span> <div class = 'links'> <a href = 'one.html'>Link 1<img src = 'image1.jpg'/></a> <a href = 'two.html'>Link 2<img src = 'image2.jpg'/></a> <a href = 'three.html'>Link 3<img src = 'image3.jpg'/></a> </div> </body> </html>
You can construct the selector class instances by passing the text or TextResponse object. Based on the provided input type, the selector chooses the following rules −
from scrapy.selector import Selector from scrapy.http import HtmlResponse
Using the above code, you can construct from the text as −
Selector(text = body).xpath('//span/text()').extract()
It will display the result as −
[u'Hello world!!!']
You can construct from the response as −
response = HtmlResponse(url = 'http://mysite.com', body = body) Selector(response = response).xpath('//span/text()').extract()
It will display the result as −
[u'Hello world!!!']
Using the above simple code snippet, you can construct the XPath for selecting the text which is defined in the title tag as shown below −
>>response.selector.xpath('//title/text()')
Now, you can extract the textual data using the .extract() method shown as follows −
>>response.xpath('//title/text()').extract()
It will produce the result as −
[u'My Website']
You can display the name of all elements shown as follows −
>>response.xpath('//div[@class = "links"]/a/text()').extract()
It will display the elements as −
Link 1 Link 2 Link 3
If you want to extract the first element, then use the method .extract_first(), shown as follows −
>>response.xpath('//div[@class = "links"]/a/text()').extract_first()
It will display the element as −
Link 1
Using the above code, you can nest the selectors to display the page link and image source using the .xpath() method, shown as follows −
links = response.xpath('//a[contains(@href, "image")]') for index, link in enumerate(links): args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract()) print 'The link %d pointing to url %s and image %s' % args
It will display the result as −
Link 1 pointing to url [u'one.html'] and image [u'image1.jpg'] Link 2 pointing to url [u'two.html'] and image [u'image2.jpg'] Link 3 pointing to url [u'three.html'] and image [u'image3.jpg']
Scrapy allows to extract the data using regular expressions, which uses the .re() method. From the above HTML code, we will extract the image names shown as follows −
>>response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
The above line displays the image names as −
[u'Link 1', u'Link 2', u'Link 3']
When you are working with XPaths, which starts with the /, nested selectors and XPath are related to absolute path of the document, and not the relative path of the selector.
If you want to extract the <p> elements, then first gain all div elements −
>>mydiv = response.xpath('//div')
Next, you can extract all the 'p' elements inside, by prefixing the XPath with a dot as .//p as shown below −
>>for p in mydiv.xpath('.//p').extract()
The EXSLT is a community that issues the extensions to the XSLT (Extensible Stylesheet Language Transformations) which converts XML documents to XHTML documents. You can use the EXSLT extensions with the registered namespace in the XPath expressions as shown in the following table −
Sr.No | Prefix & Usage | Namespace |
---|---|---|
1 | re regular expressions |
|
2 | set set manipulation |
You can check the simple code format for extracting data using regular expressions in the previous section.
There are some XPath tips, which are useful when using XPath with Scrapy selectors. For more information, click this link.