Google search results cannot be scraped.

Asked 2 years ago, Updated 2 years ago, 136 views

In order to scrap Google search results, I ran the code below referring to the site below.
I can't solve the following questions, so please let me know.

下記I would like to know how to correct the following errors.

a=str(list[i]).trip('')
IndexError:list index out of range

tI would like to know how to get the title and its URL.

referenced site
https://qiita.com/ShinKano/items/d4b95ed809bd80329880

code

import requests
from bs4 import BeautifulSoup

with open('keys.csv') ascsv_file:
    with open('result.csv', 'w') as f:

    for keys incsv_file:
        result=requests.get('https://www.google.com/search?q={}/'.format(keys))
        soup = BeautifulSoup(result.text, 'html.parser')
        list=soup.findAll (True, {'class': 'BNeawe vvjwJb AP7Wnd'})
        for i in range (3):
            a=str(list[i]).trip('<div class="BNeawe vvjwJbAP7Wnd">')
            result_title=a.trip('</')
            keyword=keys.rstrip("\n")
            f.write('{0},{1}\n'.format(keyword, result_title))

python python3 web-scraping beautifulsoup

2022-09-30 14:01

2 Answers

First, you should not substitute anything for the built-in keyword list

.

Now, if list[i] is now list index out of range, there is no element in the specified i, so the contents of the list are empty or range(3) so at least no element exists when i is 2.

There are several possible reasons why it does not exist.

  • The variable name list size is less than 3.
  • Class names are not constant—If the class name BNeawe vvjwJbAP7Wnd is dynamically generated, specifying it does not always exist.
  • Scraping blocked: Google detects poor behavior and blocks IP.

To resolve the variable name list size issue, use the loop using range(3) to

for i,_in enumerate(list):

Change as shown in

To resolve the issue of dynamically generating class names, you must obtain them using non-class names.Below is the HTML displayed after Google search:

<div class="XXXXXX">
  <a href="https://example.com/" ping="/url?blahblah">br>
    <h3class="XXXXXXXXXXXX">>span>TITLE</span>/h3>
    <div class="XXXXXXXXXXX">
      <cite class="XXXXXXXXXXXXXXXXX">example.com</cite>
    </div>
 </a>
 approximately
</div>

In this case, it would be better to think of a way to get a tag element that matches a specific regular expression, or a way to get all the a elements in bs4 and filter what you don't need.It's a dirty way, but

 from bs4 import BeautifulSoup
    
default_google_url(soup):
    for x in group.find_all("a"):
        try:
            x ["href"]
            assert x.find("span") is not None
            assert x.find("div") is not None
            assert x.find("h3") is not None
        except:
            continue
        yield x
        
with open("./Downloads/test-Google Search.html") asf:
    data=f.read()

soup = BeautifulSoup(data, "html.parser")    
results=list(extract_google_url(soup))

An example of this is to do so as shown in .The data contains the html file of the search results list page that you googled with the query "test".

Google also blocks IP when it finds poor behavior.To be exact, you will see "I'm not a robot" captcha.In other words, it does not allow scraping.
https://stackoverflow.com/questions/22657548/is-it-ok-to-scrape-data-from-google-results

If you're scraping without being blocked, you have to think about the potential legal risks.Additional techniques may be required:

  • Scrap using the headless browser
  • Think about bypassing captcha authentication
  • Delete and modify cookies and other identifiable information for each access.
  • Use multiple proxies to switch access over multiple IPs.
  • Consider sleeping over a period of time, total number of views per unit of time, etc. to make the scraping behavior similar to human.

It is recommended that you read the BeautifulSoup documentation for specific mechanisms to parse the title and URL in HTML.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/


2022-09-30 14:01

I think you are a beginner, but you should try using the reference as much as possible.
If you're using selenium in python, the following reverse reference is useful in Japanese. (Of course, if you can speak English, it should be official.)
https://www.seleniumqref.com/api/webdriver_gyaku.html
Also, you need some knowledge of the structure of html, so learn how to get html element paths from chrome's developer tool.Also, if you know pyhton grammar, you can do it right away.


2022-09-30 14:01

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.