I'm a Python beginner.
I tried scraping using BeautifulSoup, but I don't know much about character codes and it doesn't work.
html=urllib2.urlopen(req)
html2 = html.read()
soup = BeautifulSoup(html2, "html.parser")
tag=soup.findAll("p", attrs={"class":"txt"})
a=str(tag)
If you look at the character code of the text you got,
print cardet.detect(a)
{confidence:1.0, 'encoding': 'ascii'}
It turned out that
So when I tried writing to the file, I wanted to use Shift-jis as the character code, but it didn't work.
If anyone knows the solution, please write it down.
Note: http://www.goo-net.com/php/car_review/detail_list.php?car_cd=10101044
I'm trying to scribble text on this site.
I think the str(tag) is just wrong.The SJIS file was created as follows:
import urlib2
from bs4 import BeautifulSoup
url='http://www.goo-net.com/php/car_review/detail_list.php?car_cd=10101044'
req = urlllib2.Request(url)
html=urllib2.urlopen(req)
html2 = html.read()
soup = BeautifulSoup(html2, "html.parser")
tags=soup.findAll("p", attrs={"class":"txt"})
with open('output.txt', 'wb') as f:
Fort intags:
print>>f, t.encode ('cp932')
Snakefoot
By the way, if you use PyQuery, you can write like this.
#!/usr/bin/env python
from pyquery import PyQuery aspq
url='http://www.goo-net.com/php/car_review/detail_list.php?car_cd=10101044'
dom = pq(url)
with open('output2.txt', 'wb') as f:
for pin dom('p.txt').items():
print>>f, p.text().encode('cp932')
© 2024 OneMinuteCode. All rights reserved.