https://www.coin-laundry.co.jp/userp/shop_detail/10000543.html
I would like to extract a health table from the above site.
The table part seems to be generated by javascript, so I wrote the code to get the headerless browser (Chrome) from selenium.
import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
options=Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--disable-desktop-notifications')
options.add_argument("--disable-extensions")
options.add_argument('--allow-running-insecure-content')
options.add_argument('--disable-web-security')
options.add_argument('--no-sandbox')
options.add_argument('--lang=ja')
options.add_argument('--window-size=1200x600')
driver=webdriver.Chrome(chrome_options=options)
# Open URL
TARGET='https://www.coin-laundry.co.jp/userp/shop_detail/10000543.html'
driver.get (TARGET)
time.sleep(2)# Wait 2 seconds.
page_source=driver.page_source
html = BeautifulSoup(page_source, 'html.parser')
print(html)
driver.quit()# Exit the browser.
Running this code will limit your initial writing site to
<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><>/html>
I can only get a reply.
You cannot retrieve HTML before rendering javascript.
I changed the time of time.sleep(), but the result is the same.
If the TARGET part is a different URL, the HTML code will be printed properly.
I also tried selenium+Phantomjs 2.1.1 and casperjs+Phantomjs 2.1.1 but the results were similar and I couldn't get them well.
Run in
Server AWS EC2
OS Ubuntu 16.04
Python 3.5.2
selenium
Google Chrome 60.0.3112.90
BeautifulSoup4
node 8.2.1
At first, I used Phantomjs, but I couldn't get the above site.I thought it might be a browser problem, so I accessed it with Chrome on Ubuntu on the remote desktop, and it was displayed normally.So I thought Chrome's Headless mode would be the same as normal mode...
I'm at a loss because I'm stuck in all directions.
Thank you for your cooperation.
Is the https connection failed?
Before you run it in Selenium, let's first check with the browser in your environment and the Selenium running user to see if they really have access to the target page without any errors.
I accessed it from a remote desktop using Chrome on Ubuntu and it looked normal.
There is probably an oversight here.
An intermediate CA certificate is required to access the sites, but since it is not installed in Ubuntu, https connection should have failed.
Install the corresponding intermediate CA certificate according to the administrative procedures in Chromium for Linux.
https://chromium.googlesource.com/chromium/src/+/lkcr/docs/linux_cert_management.md
Google's genuine Node library, Puppeteer, is now available for easy operation of the Headless Chrome.The Python ported version of it is pyppeteer.I think it's much easier to use than using Selenium.
Installation can be done via pip
python3-mp pip install pyppeteer
Chromium (~100MB) is automatically downloaded and installed when the program actually runs without having to install it in advance.
If you use pyppeteer, the acquisition of html is made with the following code.I think it can be used in this case.You don't need to install Google Chrome, so it's easy to use in a server environment.
import asyncio
from pyppeteer import launch
async def main():
TARGET='example.com'
browser=wait launch()
page=wait browser.newPage()
wait page.goto(TARGET)
# Getting Screenshots
wait page.screenshot({'path':'example.png'})
html = wait page.content()
wait browser.close()
asyncio.get_event_loop().run_until_complete(main())
© 2024 OneMinuteCode. All rights reserved.