I scraped the Wikipedia site with scrapy, but it didn't crawl the web well.
Specifically, the following message is displayed, so I don't think I can get the information well.
Partial Message Excerpt
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
I will share the spider file and item file below.
Could you please let me know the possible corrections?
The sites to be crawled to are as follows.
List of Japanese Sake brands
https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E9%85%92%E3%81%AE%E9%8A%98%E6%9F%84%E4%B8%80%E8%A6%A7
I would like to obtain the prefecture name (prefect), brewery (syuzo), and sake name (sake_name).
scrapy_sake.py
#-*-coding:utf-8-*-
import scrapy
from take.items import TakeItem
class ScrapySakeSpider (scrapy.Spider):
name = 'scrapy_sake'
allowed_domains=['ja.wikipedia.org']
start_urls=['https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E9%85%92%E3%81%AE%E9%8A%98%E6%9F%84%E4%B8%80%E8%A6%A7']
def parse(self, response):
"""
Persistent Response Processing
"""
items = [ ]
for post in response.css('.mw-parser-uput divulli'):
# Generate Post objects defined in items and pass them to the next action
item=SakeItem()
item["prefecture"] = post.css('.mw-parser-ouput.mw-headline::text') .extract()
item["syuzo"] = post.css('.mw-parser-uput div pba::text').extract()
if syuzo! = None:
syuzo=syuzo
else:
syuzo=post.css('.mw-parser-uput div pb::text') .extract()
item["sake_name"] = post.css('.mw-parser-uput div li::text').extract()
items.append(item)
return items
item.py
import scrapy
classSakeItem(scrapy.Item):
perfect=scrapy.Field()
syuzo=scrapy.Field()
wake_name =scrapy.Field()
pass
I looked at the source code (HTML) of the Sake Brand List, but it's not very structured. For now, I'll extract all the tags I need and create an instance (SakeItem class).
def parse(self, response):
"""
Persistent Response Processing
"""
item, items = None, [ ]
for post in response.css('h3.mw-headline, h3-div p, h3-p, ul, # footnote'):
perfect=post.css('span.mw-headline::text').get(default=False)
maker=post.css('p::text').getall()
label=post.css('ulli::text').getall()
end=post.css('#footnote') .get(default=False)
if end and item is not None:
items.append(item)
break
if preference:
if item is not None:
items.append(item)
item=SakeItem({
'prefecture': perfect, 'syuzo': [ ], 'sake_name': [ ]
})
eliflen(maker)>0:
item['syuzo']+=['.join(maker).trip()]
eliflen(label)>0:
item['sake_name'] + = label
return items
output results
[{'prefecture':'Hokkaido',
'sake_name': ['Otaru no Kiyoshu',
"Kikkogura daiginjo" (literally, "ticket-gura daiginjo")",
'Takaragawa,'
:
'syuzo': ['Tanaka Brewery (Otaru City)',
"Otokozan (Asahikawa City)"
Takasago Brewery (Asahikawa City),
:
{'prefecture': 'Aomori Prefecture',
US>'sake_name': ['sake_name',
`A good cup,'
"""Reiho"""
:
'syuzo': ['Rokka Sake Brewery (Hirozen City),
Miura Sake Brewery (Hirozen City),
Saito Sake Brewery (Hirozen City),
:
© 2024 OneMinuteCode. All rights reserved.