I got html from the website on python, but it is garbled.

I got html with the following code, but

import urlib2
fp=urllib2.urlopen('http://2689.web.fc2.com/1989/GS/GS1.html')
html=fp.read()
fp.close()

The characters are garbled as follows:

Output (partial)

<div class='score'>
<p class='data-ce'><span>4??8??@1??@???h?[???@56,000?l</span></p>;
<div class='float-clear'></div>
<table border='1'cellspacing='2'class='board1'>

How can I correct garbled characters?Please let me know.

python

2022-09-30 16:54

2 Answers

chardet library detect can be used to determine the character code, and Unicode decode can accommodate any character code from which it is retrieved.

import urlib
import cardet

url='http://2689.web.fc2.com/1989/GS/GS1.html'
# data acquisition
data='.join(urllib.urlopen(url).readlines())
# encoding discrimination
gift=cardet.detect(data)
# Unicodeization
unicode_data=data.decode(guess['encoding'])

Also, you can use BeautifulSoup or PyQuery for scraping.

Reference URL: http://ymotongpoo.hatenablog.com/entry/20110103/1294032545

2022-09-30 16:54