I'm studying scraping in Python in a virtual environment.This is page 106 of Python Crawling & Scraping.
I copy the code exactly as it is, but the result is not output.I think there is no problem with the code because there are no errors, but why is there no output?
Normally, the following line will be followed by the URL.Thank you for your cooperation.
(scraping)vagrant@ubuntu-bionic:/vagrant$python python_crawler_1.py
import requests
import lxml.html
response=requests.get('https://gihyo.jp/dp')
html=lxml.html.fromstring(response.text)
html.make_links_absolute(response.url)
for a in html.cssselect('#listbook>li>a [itemprop="url"]'):
url = a.get('href')
print(url)
(scraping)vagrant@ubuntu-bionic:/vagrant$python python_crawler_1.py
(scrapping)vagrant@ubuntu-bionic: /vagrant$
Wrong CSS selector.Correctly #listBook
but #listbook
.
The print(url)
line must be executed in order for the URL to appear.If the URL does not appear, this line is probably not running.
The print(url)
line is in the for statement.If this line is not executed, the contents of the for statement have never been repeated.First html.cssselect('#listbook>li>a [itemprop="url"]')
is suspicious, so let's try print
.
print
and html.cssselect('#listbook>li>a[itemprop="url"]') You can see that the result of
is an empty list If you try to run each element in the list, but pass an empty list, it will never run because there are zero elements.
Now I understand why the URL was not displayed.Now let's think about why it's an empty list.
When I actually looked at the HTML source code of https://gihyo.jp/dp
in my browser, I noticed that there was no tag with the ID listbook
.There is a tag with a very similar listBook
ID, so you can guess that it was probably mistaken for this one.Also, I have confirmed that the tag structure around it can be selected in #listBook>li>a[itemprop="url"]
.
© 2024 OneMinuteCode. All rights reserved.