I want to load HTML in python+selenium+chrome

Asked 1 years ago, Updated 1 years ago, 95 views

https://www.coin-laundry.co.jp/userp/shop_detail/10000543.html
I would like to extract a health table from the above site.

The table part seems to be generated by javascript, so I wrote the code to get the headerless browser (Chrome) from selenium.

import time

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options

options=Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--disable-desktop-notifications')
options.add_argument("--disable-extensions")
options.add_argument('--allow-running-insecure-content')
options.add_argument('--disable-web-security')
options.add_argument('--no-sandbox')
options.add_argument('--lang=ja')
options.add_argument('--window-size=1200x600')
driver=webdriver.Chrome(chrome_options=options)

# Open URL
TARGET='https://www.coin-laundry.co.jp/userp/shop_detail/10000543.html'

driver.get (TARGET)

time.sleep(2)# Wait 2 seconds.

page_source=driver.page_source
html = BeautifulSoup(page_source, 'html.parser')
print(html)

driver.quit()# Exit the browser.

Running this code will limit your initial writing site to

<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><>/html>

I can only get a reply.
You cannot retrieve HTML before rendering javascript.
I changed the time of time.sleep(), but the result is the same.
If the TARGET part is a different URL, the HTML code will be printed properly.

I also tried selenium+Phantomjs 2.1.1 and casperjs+Phantomjs 2.1.1 but the results were similar and I couldn't get them well.

Run in
Server AWS EC2
OS Ubuntu 16.04
Python 3.5.2
selenium Google Chrome 60.0.3112.90
BeautifulSoup4
node 8.2.1

At first, I used Phantomjs, but I couldn't get the above site.I thought it might be a browser problem, so I accessed it with Chrome on Ubuntu on the remote desktop, and it was displayed normally.So I thought Chrome's Headless mode would be the same as normal mode...

I'm at a loss because I'm stuck in all directions.
Thank you for your cooperation.

google-chrome python3 selenium web-scraping phantomjs

2022-09-29 22:25

2 Answers

Is the https connection failed?
Before you run it in Selenium, let's first check with the browser in your environment and the Selenium running user to see if they really have access to the target page without any errors.

I accessed it from a remote desktop using Chrome on Ubuntu and it looked normal.

There is probably an oversight here.
An intermediate CA certificate is required to access the sites, but since it is not installed in Ubuntu, https connection should have failed.
Install the corresponding intermediate CA certificate according to the administrative procedures in Chromium for Linux.
https://chromium.googlesource.com/chromium/src/+/lkcr/docs/linux_cert_management.md


2022-09-29 22:25

Google's genuine Node library, Puppeteer, is now available for easy operation of the Headless Chrome.The Python ported version of it is pyppeteer.I think it's much easier to use than using Selenium.

Installation can be done via pip

 python3-mp pip install pyppeteer

Chromium (~100MB) is automatically downloaded and installed when the program actually runs without having to install it in advance.

If you use pyppeteer, the acquisition of html is made with the following code.I think it can be used in this case.You don't need to install Google Chrome, so it's easy to use in a server environment.

import asyncio
from pyppeteer import launch


async def main():
    TARGET='example.com'
    browser=wait launch()
    page=wait browser.newPage()
    wait page.goto(TARGET)
    # Getting Screenshots
    wait page.screenshot({'path':'example.png'})

    html = wait page.content()
    wait browser.close()

asyncio.get_event_loop().run_until_complete(main())


2022-09-29 22:25

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.