I'm scraping Python below.
Traceback (most recent call last):
File "link_network.py", line 81, in<module>
G=make_network(args.url,urls)
File "link_network.py", line 33, in make_network
article_name = url.replace(entry_url, "").replace("/", "-")
AttributeError: 'NoneType' object has no attribute' replace'
The error is displayed.
I don't know where to write it, so please let me know.
Fix
def extract_url(root_url):
page=1
is_articles=True
urls = [ ]
while is_articles:
UserAgent='Mozilla/5.0 (Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15'
html=request.urlopen(request.Request(root_url, None, {'User-Agent':UserAgent}))
soup = BeautifulSoup(html, "html.parser")
articles=soup.find_all("a")
For article in articles:
href=article.get("href")
if href:
urls.append(href)
is_articles=False
return urls
https://docs.python.org/ja/3.7/howto/urllib2.html#headers
If you refer to this page, you can set the user agent in the header.
In addition, some services may block per ip if very frequently accessed by scraping.
At that time, it is possible that 403 will be returned, so you should try to access it for at least one second.
If this happens, it will be impossible to access from the same ip, so you should be careful.
The worst case is the police.
For at least UserAgent, see
request.urlopen(URL)
You can specify the part that opens in such a format by giving the request.Request() object instead of the URL string.
UserAgent='Mozilla/5.0 (Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15'
request.urlopen(request.Request(URL, None, {'User-Agent':UserAgent}))
Note: urlib.request --- Expandable library to open URL
NameError: name 'html' is not defined
For def make_network(root_url, urls):
try:
UserAgent='Mozilla/5.0 (Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15'
html = request.urlopen(request.Request(url, None, {'User-Agent':UserAgent}))
except urllib.error.HTTPError as:
print(e.reason)
except urllib.error.URLError as:
print(e.reason)
soup = BeautifulSoup(html, "html.parser")
It is probably due toBecause html is in the scope inside the try block, it is different from and undefined in the first argument of BeautifulSoup(html, "html.parser")
.
html='"
try:
UserAgent='Mozilla/5.0 (Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15'
html = request.urlopen(request.Request(url, None, {'User-Agent':UserAgent}))
except urllib.error.HTTPError as:
print(e.reason)
except urllib.error.URLError as:
print(e.reason)
soup = BeautifulSoup(html, "html.parser")
Why don't you try it?
Maybe there is a placeholder hyperlink
where placeholder hyperlink
is a a
element with no href
attribute.
If the href attribute is not specified, the element presents a placeholder hyperlink.
If placeholder hyperlink
exists, the urls
variable (list) will contain None
in the following parts of the extract_url
function:
def extract_url(root_url):
:
while is_articles:
:
articles=soup.find_all("a")
For article in articles:
urls.append(article.get("href"))
:
return urls
If article.get("href")
has a return value of None
(placeholder hyperlink
), change it not to be included in urls
.
for article in articles:
href=article.get("href")
if href:
urls.append(href)
Try it now.
© 2024 OneMinuteCode. All rights reserved.