I got html from the website on python, but it is garbled.

Asked 2 years ago, Updated 2 years ago, 18 views

I got html with the following code, but

import urlib2
fp=urllib2.urlopen('http://2689.web.fc2.com/1989/GS/GS1.html')
html=fp.read()
fp.close()

The characters are garbled as follows:

Output (partial)

<div class='score'>
<p class='data-ce'><span>4??8??@1??@???h?[???@56,000?l</span></p>;
<div class='float-clear'></div>
<table border='1'cellspacing='2'class='board1'>

How can I correct garbled characters?Please let me know.

python

2022-09-30 16:54

2 Answers

chardet library detect can be used to determine the character code, and Unicode decode can accommodate any character code from which it is retrieved.

import urlib
import cardet

url='http://2689.web.fc2.com/1989/GS/GS1.html'
# data acquisition
data='.join(urllib.urlopen(url).readlines())
# encoding discrimination
gift=cardet.detect(data)
# Unicodeization
unicode_data=data.decode(guess['encoding'])

Also, you can use BeautifulSoup or PyQuery for scraping.

Reference URL: http://ymotongpoo.hatenablog.com/entry/20110103/1294032545


2022-09-30 16:54

html=fp.read()

This is

html=fp.read().decode('shift_jis')

If you change it like this, I think you can fix it.


2022-09-30 16:54

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.