I want to fix the scraping garbled characters.

https://live23.5ch.net/test/read.cgi/livetbs/1220170942/
I'd like to scrap this url reply, but the following code will cause garbled characters.

res=requests.get("https://live23.5ch.net/test/read.cgi/livetbs/1220170942/")
soup = BeautifulSoup(res.text, 'lxml')
threadRes=soup.find_all('dd')
print(threadRes) = > garbled characters

Also, if the first argument in the second line is res.content, the garbled characters will be fixed, but all replies will not be scraped.
(This url has 1001 replies, but only 223 replies)

res=requests.get("https://live23.5ch.net/test/read.cgi/livetbs/1220170942/")
soup = BeautifulSoup(res.content, 'lxml')
print(soup)
threadRes=soup.find_all('dd')
print(len(threadRes))=>223

How can I correct garbled characters and scribble all replies?

python web-scraping beautifulsoup

2022-09-30 22:04

1 Answers

In my environment, I also garbled using res.encoding=res.apparent_encoding as one of the answer of the question comment, but I have verified that res.encoding="shift_jis" can display it correctly.

import requests
from bs4 import BeautifulSoup

res=requests.get("https://live23.5ch.net/test/read.cgi/livetbs/1220170942/")
# res.encoding = res.apparent_encoding
res.encoding="shift_jis" 

soup = BeautifulSoup(res.text, 'lxml')
threadRes=soup.find_all('dd')
print(threadRes)
print(len(threadRes))#1001

References

Python3 – requests anti-spark

2022-09-30 22:04

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656