About Python Character Codes

Asked 2 years ago, Updated 2 years ago, 18 views

I'm a Python beginner.
I tried scraping using BeautifulSoup, but I don't know much about character codes and it doesn't work.

html=urllib2.urlopen(req)
html2 = html.read()
soup = BeautifulSoup(html2, "html.parser")
tag=soup.findAll("p", attrs={"class":"txt"})  
a=str(tag)

If you look at the character code of the text you got,

print cardet.detect(a)
{confidence:1.0, 'encoding': 'ascii'}

It turned out that
So when I tried writing to the file, I wanted to use Shift-jis as the character code, but it didn't work.

If anyone knows the solution, please write it down.

Note: http://www.goo-net.com/php/car_review/detail_list.php?car_cd=10101044
I'm trying to scribble text on this site.

python

2022-09-30 18:23

1 Answers

I think the str(tag) is just wrong.The SJIS file was created as follows:

import urlib2
from bs4 import BeautifulSoup

url='http://www.goo-net.com/php/car_review/detail_list.php?car_cd=10101044'
req = urlllib2.Request(url)
html=urllib2.urlopen(req)
html2 = html.read()
soup = BeautifulSoup(html2, "html.parser")
tags=soup.findAll("p", attrs={"class":"txt"})
with open('output.txt', 'wb') as f:
    Fort intags:
        print>>f, t.encode ('cp932')

Snakefoot

By the way, if you use PyQuery, you can write like this.

#!/usr/bin/env python
from pyquery import PyQuery aspq

url='http://www.goo-net.com/php/car_review/detail_list.php?car_cd=10101044'

dom = pq(url)

with open('output2.txt', 'wb') as f:
    for pin dom('p.txt').items():
        print>>f, p.text().encode('cp932')


2022-09-30 18:23

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.