Hi, everyone qpldocs.dla.mil/search/parts.aspx?qpl=1780 We would like to extract the lists shown in the image below as text from the url above. (Since we are going to go through each list and perform additional curling, the use of selenium is essential, not just BS4 coding.)
To do this, I tried to extract the list through lxml and for statements through beautiful soup, but if I try to use lxml, I get an error as shown below.
> Max retries exceeded with url: /search/parts.aspx?qpl=1780
(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate
If you print out only one list using selenium without using beautiful, the error will not appear and works well. (You can check that it works by processing 3 lines of #error annotation written in the code.)
I tried running Install Certificates.command again, but the problem didn't go away. I look forward to your valuable answer. Thank you!
If my explanation is not enough, please leave a comment so I can explain it in more detail!
Below is my code.
import requests
from bs4 import BeautifulSoup
import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from urllib.request import urlopen
def set_chrome_driver():
chrome_options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
return driver
driver = webdriver.Chrome("./chromedriver")
url = "https://qpldocs.dla.mil/search/parts.aspx?qpl=1780"
driver.get(url)
#3 lines below where errors occur
res = requests.get(url)
res.raise_for_status()
soup = BeautifulSoup(res.text,"lxml")
partlist = driver.find_element(By.ID, "Lu_gov_DG_ctl03_btnGovPartNo")
print(partlist.text)
driver.find_element(By.XPATH,"//*[@id='Lu_gov_DG_ctl03_btnGovPartNo']").click()
partauthorizationcompay = driver.find_element(By.ID, "Lu_man_DG_ctl03_lblCompany")
print(partauthorizationcompay.text)
partstatus = driver.find_element(By.ID, "Lu_man_DG_ctl03_imgCompanyStatus")
print(partstatus.text)
The following is a copy and paste of the error output as it is!
> /Users/seoyeonghun/Desktop/Project/WebCrawling_QPD/practice3.py:20: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
driver = webdriver.Chrome("./chromedriver")
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/connectionpool.py", line 699, in urlopen
httplib_response = self._make_request(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/connectionpool.py", line 382, in _make_request
self._validate_conn(conn)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
conn.connect()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/connection.py", line 416, in connect
self.sock = ssl_wrap_socket(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket
ssl_sock = _ssl_wrap_socket_impl(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py", line 512, in wrap_socket
return self.sslsocket_class._create(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py", line 1070, in _create
self.do_handshake()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py", line 1341, in do_handshake
self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/connectionpool.py", line 755, in urlopen
retries = retries.increment(
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/util/retry.py", line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='qpldocs.dla.mil', port=443): Max retries exceeded with url: /search/parts.aspx?qpl=1780 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)')))
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
File "/Users/seoyeonghun/Desktop/Project/WebCrawling_QPD/practice3.py", line 23, in <module>
res = requests.get(url)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/requests/sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/requests/adapters.py", line 514, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='qpldocs.dla.mil', port=443): Max retries exceeded with url: /search/parts.aspx?qpl=1780 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)'))
Thank you.
selenium beautifulsoup python
The reason for the error is the ssl certificate error. It's not important to solve this problem.
If you look at the code now, it's sending a total of two traffic, once to the selenium and once through the requests module. I'm trying to do something unnecessary.
To analyze html tags through selenium:
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
By the way, the site seems to be run by the U.S. government, so are you trying to crawl after looking closely at the site usage policy?
If crawling causes problems, it can be difficult to deal with, so we recommend you to look into the relevant information and try it, or request the necessary data through an officially.
© 2024 OneMinuteCode. All rights reserved.