Hi, everyone. I'm asking you a question because I don't know anything while studying while following the scratch tutorial. The code is as follows.
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
Among the above function codes, ul.directory.dir-col > li > a:attr('href') and callback=self.This is a question about parse_dir_contents.
I think it's the first css path, but I wonder if it replaced "/" with the symbol ">". And callback=self.Parse_dir_contents I don't understand this part, so I'd appreciate it if you could briefly explain the concept. ^
^I'm searching on the Internet and looking for a manual, but I don't know. I'm asking you a question because I'm frustrated.
python scrapy
First ul.directory.dir-col > li > a::attr('href')
Above is the CSS Selector.
In the CSS Selector, for example, E>F, F stands for the child node of E. There may be several Fs, but it means that you only choose F below E.
Second callback=self.parse_dir_contents
Now the parse() method only extract the interesting links from the page, builds a full absolute URL using the response.urljoin method (since the links can be relative) and yields new requests to be sent later, registering as callback the method parse_dir_contents() that will ultimately scrape the data we want. What you see here is the Scrapy’s mechanism of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.
It seems to be a function to find additional links contained in the document in url. In the example of Note 3 above, it appears to extract sublink information from the response obtained from url and reserve it as the next crawling target.
© 2024 OneMinuteCode. All rights reserved.