I like the actual state of 5ch anime, so I thought about getting two different live boards, rearranging them in chronological order, and putting them together.
However, I was unable to scrape my writes on hawk.5ch.net.
Please let me know why you can't and how you can.
For example, himawari.5ch.net
url displays HTML, but
import requests
from bs4 import BeautifulSoup
res=requests.get('https://himawari.5ch.net/test/read.cgi/livetx/1523962661/')
soup = BeautifulSoup(res.text, 'html.parser')
print(soup)
hawk.5ch.net
url encountered an error.
import requests
from bs4 import BeautifulSoup
res=requests.get('https://hawk.5ch.net/test/read.cgi/livejupiter/1523982845/')
soup = BeautifulSoup(res.text, 'html.parser')
print(soup)
I could use lxml
.Also, garbled characters can be fixed by specifying response.encoding
.
#res=requests.get('https://himawari.5ch.net/test/read.cgi/livetx/1523962661/')
res=requests.get('https://hawk.5ch.net/test/read.cgi/livejupiter/1523982845/')
res.encoding = res.apparent_encoding
soup = BeautifulSoup(res.text, 'lxml')
print(soup)
If you look at the Beautiful Soup Documentation, you will see the item Installing a parser.
Another alternative is the pure-Python html5 lib parser, which parses HTML the way a web browser does .Depending on your setup, you light install html5 lib with one of these commands:
$apt-get install python-html5lib
$easy_install html5lib
$pip install html5lib
If you want to use HTML parser, this html5lib
will be one of the options.
$lsb_release-ir
Distributor ID: Ubuntu
Release: 21.04
$ pip3 install html5lib
The problem does not occur when loaded.The problem you are experiencing is that you are performing unlimited parsing of malformed HTML text (for example, without a close tag).
import requests
from bs4 import BeautifulSoup
res=requests.get('https://hawk.5ch.net/test/read.cgi/livejupiter/1523982845/')
soup = BeautifulSoup(res.content, 'html5lib')
print(soup)
You specify res.content
, where res.text
is the original text (encoding remains), but res.content
Initializing an instance of BeautifulSoup by specifying a byte string automatically translates the character encoding internally into the encoding obtained by sys.getdefaultencoding()
.
© 2024 OneMinuteCode. All rights reserved.