Even if I code with Scrapy, it says "INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)" and I can't crawl.

I scraped the Wikipedia site with scrapy, but it didn't crawl the web well.
Specifically, the following message is displayed, so I don't think I can get the information well.

Partial Message Excerpt

INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

I will share the spider file and item file below.
Could you please let me know the possible corrections?

The sites to be crawled to are as follows.

List of Japanese Sake brands
https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E9%85%92%E3%81%AE%E9%8A%98%E6%9F%84%E4%B8%80%E8%A6%A7

I would like to obtain the prefecture name (prefect), brewery (syuzo), and sake name (sake_name).

scrapy_sake.py

#-*-coding:utf-8-*-
import scrapy
from take.items import TakeItem

class ScrapySakeSpider (scrapy.Spider):
    name = 'scrapy_sake'
    allowed_domains=['ja.wikipedia.org']
    start_urls=['https://ja.wikipedia.org/wiki/%E6%97%A5%E6%9C%AC%E9%85%92%E3%81%AE%E9%8A%98%E6%9F%84%E4%B8%80%E8%A6%A7']

    def parse(self, response):
        """
        Persistent Response Processing
        """
        items = [ ]
        for post in response.css('.mw-parser-uput divulli'):
            # Generate Post objects defined in items and pass them to the next action            
            item=SakeItem()
            item["prefecture"] = post.css('.mw-parser-ouput.mw-headline::text') .extract()

            item["syuzo"] = post.css('.mw-parser-uput div pba::text').extract()
            if syuzo! = None:
                    syuzo=syuzo
            else:
                syuzo=post.css('.mw-parser-uput div pb::text') .extract()
            item["sake_name"] = post.css('.mw-parser-uput div li::text').extract()
            items.append(item)
            return items

item.py

import scrapy


classSakeItem(scrapy.Item):
    perfect=scrapy.Field()
    syuzo=scrapy.Field()
    wake_name =scrapy.Field()
    pass

python web-scraping

2022-09-30 21:48

1 Answers

I looked at the source code (HTML) of the Sake Brand List, but it's not very structured. For now, I'll extract all the tags I need and create an instance (SakeItem class).

def parse(self, response):
  """
  Persistent Response Processing
  """
  item, items = None, [ ]
  for post in response.css('h3.mw-headline, h3-div p, h3-p, ul, # footnote'):
    perfect=post.css('span.mw-headline::text').get(default=False)
    maker=post.css('p::text').getall()
    label=post.css('ulli::text').getall()
    end=post.css('#footnote') .get(default=False)

    if end and item is not None:
      items.append(item)
      break

    if preference:
      if item is not None:
        items.append(item)
      item=SakeItem({
        'prefecture': perfect, 'syuzo': [ ], 'sake_name': [ ]
      })
    eliflen(maker)>0:
      item['syuzo']+=['.join(maker).trip()]
    eliflen(label)>0:
      item['sake_name'] + = label

  return items

output results

[{'prefecture':'Hokkaido',
  'sake_name': ['Otaru no Kiyoshu',
                "Kikkogura daiginjo" (literally, "ticket-gura daiginjo")",
                'Takaragawa,'
                        :
  'syuzo': ['Tanaka Brewery (Otaru City)',
            "Otokozan (Asahikawa City)"
            Takasago Brewery (Asahikawa City),
                        :

 {'prefecture': 'Aomori Prefecture',
  US>'sake_name': ['sake_name',
                `A good cup,'
                """Reiho"""
                        :
  'syuzo': ['Rokka Sake Brewery (Hirozen City),
            Miura Sake Brewery (Hirozen City),
            Saito Sake Brewery (Hirozen City),
                        :

2022-09-30 21:48

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656