Error while scraping in python.

I'm scraping Python below.

Traceback (most recent call last):
  File "link_network.py", line 81, in<module>
    G=make_network(args.url,urls)
  File "link_network.py", line 33, in make_network
    article_name = url.replace(entry_url, "").replace("/", "-")
AttributeError: 'NoneType' object has no attribute' replace'

The error is displayed.
I don't know where to write it, so please let me know.

Fix

def extract_url(root_url):
    page=1
    is_articles=True
    urls = [ ]

    while is_articles:
        UserAgent='Mozilla/5.0 (Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15'
        html=request.urlopen(request.Request(root_url, None, {'User-Agent':UserAgent}))
        soup = BeautifulSoup(html, "html.parser")
        articles=soup.find_all("a")
        For article in articles:
            href=article.get("href")
            if href:
                urls.append(href)
        is_articles=False
    return urls

python python3 web-scraping

2022-09-30 11:29

3 Answers

https://docs.python.org/ja/3.7/howto/urllib2.html#headers

If you refer to this page, you can set the user agent in the header.

In addition, some services may block per ip if very frequently accessed by scraping.
At that time, it is possible that 403 will be returned, so you should try to access it for at least one second.
If this happens, it will be impossible to access from the same ip, so you should be careful.
The worst case is the police.

https://ja.wikipedia.org/wiki/%E5%B2%A1%E5%B4%8E%E5%B8%82%E7%AB%8B%E4%B8%AD%E5%A4%AE%E5%9B%B3%E6%9B%B8%E9%A4%A8%E4%BA%8B%E4%BB%B6

2022-09-30 11:29

For at least UserAgent, see

request.urlopen(URL)

You can specify the part that opens in such a format by giving the request.Request() object instead of the URL string.

UserAgent='Mozilla/5.0 (Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15'

request.urlopen(request.Request(URL, None, {'User-Agent':UserAgent}))

Note: urlib.request --- Expandable library to open URL

NameError: name 'html' is not defined

For def make_network(root_url, urls):

in

try:
    UserAgent='Mozilla/5.0 (Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15'
    html = request.urlopen(request.Request(url, None, {'User-Agent':UserAgent}))
except urllib.error.HTTPError as: 
    print(e.reason)
except urllib.error.URLError as: 
    print(e.reason)
soup = BeautifulSoup(html, "html.parser")

It is probably due toBecause html is in the scope inside the try block, it is different from and undefined in the first argument of BeautifulSoup(html, "html.parser").

html='"
try:
    UserAgent='Mozilla/5.0 (Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15'
    html = request.urlopen(request.Request(url, None, {'User-Agent':UserAgent}))
except urllib.error.HTTPError as: 
    print(e.reason)
except urllib.error.URLError as: 
    print(e.reason)
soup = BeautifulSoup(html, "html.parser")

Why don't you try it?

2022-09-30 11:29

Maybe there is a placeholder hyperlink where placeholder hyperlink is a a element with no href attribute.

Elements/a-W3C Wiki

If the href attribute is not specified, the element presents a placeholder hyperlink.

If placeholder hyperlink exists, the urls variable (list) will contain None in the following parts of the extract_url function:

def extract_url(root_url):
           :    
    while is_articles:
           :    
        articles=soup.find_all("a")
        For article in articles:
            urls.append(article.get("href"))
           :    
    return urls

If article.get("href") has a return value of None (placeholder hyperlink), change it not to be included in urls.

for article in articles:
            href=article.get("href")
            if href:
                urls.append(href)

Try it now.

2022-09-30 11:29

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656