I'm trying to crawl the website, but I can't proceed because of the HTML error.

import urllib.request
from bs4 import BeautifulSoup

url = 'https://kr.iherb.com/search?kw=21st%20century'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

address = soup.find_all(class_='absolute-link product-link')

for i in address:
    print(i.attrs['href'])
    print()

When I searched, they said that you can designate the user-agent as the header, so I tried, but there was an error... I can't move onㅜㅜ It worked when I did it on Naver or Google url, but I can't understand it no matter how much I look for HTml or css coursesㅜ<

html python beautifulsoup

2022-09-20 11:05

1 Answers

Try adding the User-Agent header:

url = 'https://somewhere.com'

request = urllib.request.Request(url)
request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:99.0) Gecko/20100101 Firefox/99.0')

html = urllib.request.urlopen(request).read()

The user agent you used in this example was imported from the Firefox browser.

2022-09-20 11:05

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656