I made a crawler with Python, but sometimes I get errors when I crawl and get the title of the web page.However, I don't know which page in the crawl gets the error, so I don't know the cause.There are pages that you can crawl properly.I would appreciate it if you could give me some advice.Thank you for your cooperation.
Research Results
I thought it might be an error without a title tag, so I looked it up, but it wasn't.
I thought the title was too long, but it wasn't either.
If the title tag is empty, it will appear empty.
error
Traceback (most recent call last):
File"/vagrant/pysearch-master/manage.py", line 15, in<module>
US>crawl_web('https://applech2.com/',8)
File"/vagrant/pysearch-master/web_crawler/crawler.py", line 147, incrawl_web
title=_get_page_tite(html)
File"/vagrant/pysearch-master/web_crawler/crawler.py", line61, in_get_page_tite
title=BeautifulSoup(html, "html.parser") .find('title').text
File"/home/vagrant/.virtualenvs/dev/local/lib/python 3.4/site-packages/bs4/_init__.py", line 192, in_init__
eliflen(markup)<=256 and (
TypeError: object of type 'NoneType' has no len()
The crawler code is listed in GitHub below.
https://github.com/wimpykid719/pythonengine/blob/master/web_crawler/crawler.py
This issue was caused by 403 denied access.
This post was posted as a community wiki based on @wataru's comments.
© 2024 OneMinuteCode. All rights reserved.