As the name itself indicates, Link Extractors are the objects that are used to extract links from web pages using scrapy.http.Response objects. In Scrapy, there are built-in extractors such as scrapy.linkextractors import LinkExtractor. You can customize your own link extractor according to your needs by implementing a simple interface.
Every link extractor has a public method called extract_links which includes a Response object and returns a list of scrapy.link.Link objects. You can instantiate the link extractors only once and call the extract_links method various times to extract links with different responses. The CrawlSpiderclass uses link extractors with a set of rules whose main purpose is to extract links.
Normally link extractors are grouped with Scrapy and are provided in scrapy.linkextractors module. By default, the link extractor will be LinkExtractor which is equal in functionality with LxmlLinkExtractor −
from scrapy.linkextractors import LinkExtractor
class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow = (), deny = (), allow_domains = (), deny_domains = (), deny_extensions = None, restrict_xpaths = (), restrict_css = (), tags = ('a', 'area'), attrs = ('href', ), canonicalize = True, unique = True, process_value = None)
The LxmlLinkExtractor is a highly recommended link extractor, because it has handy filtering options and it is used with lxml’s robust HTMLParser.
Sr.No | Parameter & Description |
---|---|
1 | allow (a regular expression (or list of)) It allows a single expression or group of expressions that should match the url which is to be extracted. If it is not mentioned, it will match all the links. |
2 | deny (a regular expression (or list of)) It blocks or excludes a single expression or group of expressions that should match the url which is not to be extracted. If it is not mentioned or left empty, then it will not eliminate the undesired links. |
3 | allow_domains (str or list) It allows a single string or list of strings that should match the domains from which the links are to be extracted. |
4 | deny_domains (str or list) It blocks or excludes a single string or list of strings that should match the domains from which the links are not to be extracted. |
5 | deny_extensions (list) It blocks the list of strings with the extensions when extracting the links. If it is not set, then by default it will be set to IGNORED_EXTENSIONS which contains predefined list in scrapy.linkextractors package. |
6 | restrict_xpaths (str or list) It is an XPath list region from where the links are to be extracted from the response. If given, the links will be extracted only from the text, which is selected by XPath. |
7 | restrict_css (str or list) It behaves similar to restrict_xpaths parameter which will extract the links from the CSS selected regions inside the response. |
8 | tags (str or list) A single tag or a list of tags that should be considered when extracting the links. By default, it will be (’a’, ’area’). |
9 | attrs (list) A single attribute or list of attributes should be considered while extracting links. By default, it will be (’href’,). |
10 | canonicalize (boolean) The extracted url is brought to standard form using scrapy.utils.url.canonicalize_url. By default, it will be True. |
11 | unique (boolean) It will be used if the extracted links are repeated. |
12 | process_value (callable) It is a function which receives a value from scanned tags and attributes. The value received may be altered and returned or else nothing will be returned to reject the link. If not used, by default it will be lambda x: x. |
The following code is used to extract the links −
<a href = "javascript:goToPage('../other/page.html'); return false">Link text</a>
The following code function can be used in process_value −
def process_value(val): m = re.search("javascript:goToPage\('(.*?)'", val) if m: return m.group(1)