I want to get 5ch writes from scraping, but I can't.

I like the actual state of 5ch anime, so I thought about getting two different live boards, rearranging them in chronological order, and putting them together.

However, I was unable to scrape my writes on hawk.5ch.net.
Please let me know why you can't and how you can.

For example, himawari.5ch.net url displays HTML, but

import requests
from bs4 import BeautifulSoup
res=requests.get('https://himawari.5ch.net/test/read.cgi/livetx/1523962661/')
soup = BeautifulSoup(res.text, 'html.parser')
print(soup)

hawk.5ch.net url encountered an error.

import requests
from bs4 import BeautifulSoup
res=requests.get('https://hawk.5ch.net/test/read.cgi/livejupiter/1523982845/')
soup = BeautifulSoup(res.text, 'html.parser')
print(soup)

python web-scraping beautifulsoup

2022-09-29 22:46

2 Answers

I could use lxml.Also, garbled characters can be fixed by specifying response.encoding.

#res=requests.get('https://himawari.5ch.net/test/read.cgi/livetx/1523962661/')
res=requests.get('https://hawk.5ch.net/test/read.cgi/livejupiter/1523982845/')
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text, 'lxml')
print(soup)

2022-09-29 22:46

If you look at the Beautiful Soup Documentation, you will see the item Installing a parser.

Another alternative is the pure-Python html5 lib parser, which parses HTML the way a web browser does .Depending on your setup, you light install html5 lib with one of these commands:

$apt-get install python-html5lib

$easy_install html5lib

$pip install html5lib

If you want to use HTML parser, this html5lib will be one of the options.

$lsb_release-ir
Distributor ID: Ubuntu
Release: 21.04

$ pip3 install html5lib

The problem does not occur when loaded.The problem you are experiencing is that you are performing unlimited parsing of malformed HTML text (for example, without a close tag).

import requests
from bs4 import BeautifulSoup

res=requests.get('https://hawk.5ch.net/test/read.cgi/livejupiter/1523982845/')
soup = BeautifulSoup(res.content, 'html5lib')

print(soup)

You specify res.content, where res.text is the original text (encoding remains), but res.contentInitializing an instance of BeautifulSoup by specifying a byte string automatically translates the character encoding internally into the encoding obtained by sys.getdefaultencoding().

2022-09-29 22:46

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656