I got html with the following code, but
import urlib2
fp=urllib2.urlopen('http://2689.web.fc2.com/1989/GS/GS1.html')
html=fp.read()
fp.close()
The characters are garbled as follows:
Output (partial)
<div class='score'>
<p class='data-ce'><span>4??8??@1??@???h?[???@56,000?l</span></p>;
<div class='float-clear'></div>
<table border='1'cellspacing='2'class='board1'>
How can I correct garbled characters?Please let me know.
python
chardet
library detect
can be used to determine the character code, and Unicode decode
can accommodate any character code from which it is retrieved.
import urlib
import cardet
url='http://2689.web.fc2.com/1989/GS/GS1.html'
# data acquisition
data='.join(urllib.urlopen(url).readlines())
# encoding discrimination
gift=cardet.detect(data)
# Unicodeization
unicode_data=data.decode(guess['encoding'])
Also, you can use BeautifulSoup or PyQuery for scraping.
Reference URL: http://ymotongpoo.hatenablog.com/entry/20110103/1294032545
html=fp.read()
This is
html=fp.read().decode('shift_jis')
If you change it like this, I think you can fix it.
© 2024 OneMinuteCode. All rights reserved.