I want to fix the scraping garbled characters.

Asked 1 years ago, Updated 1 years ago, 399 views

https://live23.5ch.net/test/read.cgi/livetbs/1220170942/
I'd like to scrap this url reply, but the following code will cause garbled characters.

res=requests.get("https://live23.5ch.net/test/read.cgi/livetbs/1220170942/")
soup = BeautifulSoup(res.text, 'lxml')
threadRes=soup.find_all('dd')
print(threadRes) = > garbled characters

Also, if the first argument in the second line is res.content, the garbled characters will be fixed, but all replies will not be scraped.
(This url has 1001 replies, but only 223 replies)

res=requests.get("https://live23.5ch.net/test/read.cgi/livetbs/1220170942/")
soup = BeautifulSoup(res.content, 'lxml')
print(soup)
threadRes=soup.find_all('dd')
print(len(threadRes))=>223

How can I correct garbled characters and scribble all replies?

python web-scraping beautifulsoup

2022-09-30 22:04

1 Answers

In my environment, I also garbled using res.encoding=res.apparent_encoding as one of the answer of the question comment, but I have verified that res.encoding="shift_jis" can display it correctly.

import requests
from bs4 import BeautifulSoup

res=requests.get("https://live23.5ch.net/test/read.cgi/livetbs/1220170942/")
# res.encoding = res.apparent_encoding
res.encoding="shift_jis" 

soup = BeautifulSoup(res.text, 'lxml')
threadRes=soup.find_all('dd')
print(threadRes)
print(len(threadRes))#1001

References


2022-09-30 22:04

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.