In order to scrap Google search results, I ran the code below referring to the site below.
I can't solve the following questions, so please let me know.
下記I would like to know how to correct the following errors.
a=str(list[i]).trip('')
IndexError:list index out of range
tI would like to know how to get the title and its URL.
referenced site
https://qiita.com/ShinKano/items/d4b95ed809bd80329880
code
import requests
from bs4 import BeautifulSoup
with open('keys.csv') ascsv_file:
with open('result.csv', 'w') as f:
for keys incsv_file:
result=requests.get('https://www.google.com/search?q={}/'.format(keys))
soup = BeautifulSoup(result.text, 'html.parser')
list=soup.findAll (True, {'class': 'BNeawe vvjwJb AP7Wnd'})
for i in range (3):
a=str(list[i]).trip('<div class="BNeawe vvjwJbAP7Wnd">')
result_title=a.trip('</')
keyword=keys.rstrip("\n")
f.write('{0},{1}\n'.format(keyword, result_title))
First, you should not substitute anything for the built-in keyword list
.Now, if list[i] is now list index out of range, there is no element in the specified i, so the contents of the list are empty or range(3) so at least no element exists when i is 2.
There are several possible reasons why it does not exist.
To resolve the variable name list size issue, use the loop using range(3) to
for i,_in enumerate(list):
Change as shown in
To resolve the issue of dynamically generating class names, you must obtain them using non-class names.Below is the HTML displayed after Google search:
<div class="XXXXXX">
<a href="https://example.com/" ping="/url?blahblah">br>
<h3class="XXXXXXXXXXXX">>span>TITLE</span>/h3>
<div class="XXXXXXXXXXX">
<cite class="XXXXXXXXXXXXXXXXX">example.com</cite>
</div>
</a>
approximately
</div>
In this case, it would be better to think of a way to get a tag element that matches a specific regular expression, or a way to get all the a elements in bs4 and filter what you don't need.It's a dirty way, but
from bs4 import BeautifulSoup
default_google_url(soup):
for x in group.find_all("a"):
try:
x ["href"]
assert x.find("span") is not None
assert x.find("div") is not None
assert x.find("h3") is not None
except:
continue
yield x
with open("./Downloads/test-Google Search.html") asf:
data=f.read()
soup = BeautifulSoup(data, "html.parser")
results=list(extract_google_url(soup))
An example of this is to do so as shown in .The data contains the html file of the search results list page that you googled with the query "test".
Google also blocks IP when it finds poor behavior.To be exact, you will see "I'm not a robot" captcha.In other words, it does not allow scraping.
https://stackoverflow.com/questions/22657548/is-it-ok-to-scrape-data-from-google-results
If you're scraping without being blocked, you have to think about the potential legal risks.Additional techniques may be required:
It is recommended that you read the BeautifulSoup documentation for specific mechanisms to parse the title and URL in HTML.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
I think you are a beginner, but you should try using the reference as much as possible.
If you're using selenium in python, the following reverse reference is useful in Japanese. (Of course, if you can speak English, it should be official.)
https://www.seleniumqref.com/api/webdriver_gyaku.html
Also, you need some knowledge of the structure of html, so learn how to get html element paths from chrome's developer tool.Also, if you know pyhton grammar, you can do it right away.
© 2024 OneMinuteCode. All rights reserved.