href="javascript:void(0)" scraping using Selenium in Python

Asked 2 years ago, Updated 2 years ago, 369 views

http://bit.sikkou.jp/app/past/pt003/h01/
I would like to create a csv file that lists all the data of each district, court, and all the properties on the above site.

Google Chrome is unable to retrieve, click, or page transition elements from Selenium for local links.
The link to the location is href="javascript:void(0)" and we found that the following sites were hit and that Javascript's specific behavior was to be scraped accordingly.

How do I retrieve what my browser says?|How do I scrap web pages that I can't retrieve in requests

However, I do not understand the following items well, and I am having a hard time.

  • Is it possible to operate the elements of Javascript in selenium in the first place
  • How various link text hrefs display different contents on each page in "javascript:void(0)"
  • How do I search and narrow down elements in Headless mode of requests_html

I'm sure there are many things that I'm not good at as a beginner, but I'd appreciate it if you could teach me.

Environment
Operating System: Windows 10 Python: 3.8
Pycharm

python javascript web-scraping

2022-09-30 21:50

1 Answers

Is it possible to operate the elements of Javascript in selenium in the first place

Yes, it's possible.For example, the document says:
Run Javascript within the selenium-webdriver page

Python
To run Javascript on Python, execute_script("javascript script here"). execute_script is called in the webdriver instance and can be a valid javascript.

 from selenium import webdriver
driver=webdriver.Chrome()
driver.execute_script("alert('running javascript');")

How various link text hrefs display different contents on each page in "javascript:void(0)"

Looking at the source of html/css/javascript, it seems that each element is assigned an ID or number and registered with .click() processing, which is used as a parameter to call the page.
http://bit.sikkou.jp/app/past/pt003/h01/ html source
SCPT003.js, common.js, common_env.js Source

for http://bit.sikkou.jp/app/resource/app/js/

common.js has a list of IDs and names of each state and a cross-translation routine, and the commonSubmit() used below.
The following areas of SCPT003.js are likely to be affected:

$("#mapa").click(function(){
    idName=$(this).attr("id");
    tdfId = getTdfNameToId(idName);
    $("#prefecturesId").val(tdfId);
    eventID = "h02";
    copyToHiddenValue();
    (event.preventDefault)?event.preventDefault(): event.returnValue=false;
    (event.stopPropagation)?event.stopPropagation(): event.returnValue=false;
    commonSubmit (eventID);
});

$(".arrow_list area").click(function(){
    courtId=$(this).close("li").find("span")[0].innerText;
    $("#courtId").val(courtId);
    eventID = "h03";
    copyToHiddenValue();
    (event.preventDefault)?event.preventDefault(): event.returnValue=false;
    (event.stopPropagation)?event.stopPropagation(): event.returnValue=false;
    commonSubmit (eventID);
});

$("#search").click(function(event){
    var eventID = "h20";
    copyToHiddenValue();
    (event.preventDefault)?event.preventDefault(): event.returnValue=false;
    (event.stopPropagation)?event.stopPropagation(): event.returnValue=false;
    commonSubmit (eventID);
});

How do I search and narrow down elements in Headless mode of requests_html

It's old, but according to the article below, there seems to be no easy way to call requests_html.
There may be something up-to-date, or there may be a way to write and call JavaScript like in the second article.

Sending a click with requests_html and pyppeter python
There is a comment saying that it didn't work and switched to Selenium.
Python requests_html submit a form by clicking a button using JQuery
As with the question itself, I write and execute raw JavaScript processing.

Or, like this article, depending on how you access it, you may be able to obtain all the items directly.
How to click 'Next' for pagination using Requests - HTML library
This is an example of how all the data was retrieved first.
Python scraping table cannot be retrieved
There is a comment saying that there is a way to get all the cases behind the scenes.

Furthermore, since we know the prefecture ID and court ID, it might be good to submit a request by assembling the form data like @metropolis's answer in this article.
Python scraping fails to retrieve data

With Selenuim, you will be able to do the following:

Retrieve past data search page
driver.get('http://bit.sikkou.jp/app/past/pt003/h01/')

Select and click on the prefecture (Tokyo example below)
driver.find_element_by_xpath('//*[@id="tokyo"]').click() or driver.find_element_by_id('tokyo').click()

Select and click on a court
The first in the list is driver.find_element_by_xpath('//*[@id="left_box"]/div[3]/div[2]/ul/li[1]/a').click()
The second in the list is driver.find_element_by_xpath('//*[@id="left_box"]/div[3]/div[2]/ul/li[2]/a').click()
Determine how many courts each prefecture has based on the number of elements on the list

Select and specify search criteria if necessary

Click the
Search
button driver.find_element_by_xpath('//*[@id="search"]') .click() or driver.find_element_by_id('search') .click()

List and retrieve property information from read data


2022-09-30 21:50

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.