Python Scraping Can't Get Data

Asked 2 years ago, Updated 2 years ago, 385 views

I'm sorry I'm a Python Python.I'd like to get information from a website at once, but is there any way to get all of them at once because there are various menus on the web page?
I think it's the basics of scraping, but I'd appreciate it if you could help me.

The web pages are as follows:

https://nintei.nurse.or.jp/certification/General/(X(1)S(efl0y555pect3x45oxjzfw3x))/General/GCPP01LS/GCPP01LS.aspx?AspxAutoDetectCookieSupport=1

I have tried the following as a template, but I do not get any errors, but I cannot get them.

#For saving
driver_path=r'C:\Anaconda3\chromedriver.exe' #My Chromedriver Location

# Location of folder you want to load
URL='https://nintei.nurse.or.jp/certification/General/(X(1)S(efl0y555pect3x45oxjzfw3x))/GCPP01LS/GCPP01LS.aspx'

# Location of folder you want to store
send_path=r'C:\Users\akira\Documents\Python\Company'

from selenium import webdriver
import time
import bs4
import re
importos
import time
import shutil
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

start = time.time()

driver=webdriver.Chrome(driver_path)
driver.get (URL)
time.sleep(3)

soup=bs4.BeautifulSoup(driver.page_source, 'html5lib')

base='https://nintei.nurse.or.jp/certification/General/'

soup_file1 = soup.find_all('a')
href_list = [ ]

file_num = 1
sum_file=1

cc = 0
for sinsoup_file1:
    ifs.string == 'Search':
        path=base+s.get('href')
        href_list.append(path)

        print(path)
        driver.get(path)

        WebDriverWait(driver,300).until(EC.element_to_be_clickable(By.XPATH, '//*[@id="ctl00_plhContent_btnSearchMain"]')))
        driver.find_element_by_xpath('//*[@id="ctl00_plhContent_btnSearchMain" ]').click()

        while sum_file==file_num:
            sum_file=len(os.listdir(r'C:\Users\akira\Downloads')))

        else:
            print("Current number of download files_{}".format(sum_file-1))
            file_num+=1

        cc+=1

# Allow some time for temporary files to get in the way
time.sleep(60)

# Moving Files
dw_path=r'C:\Users\akira\Documents\Python\Company'
dw_list=os.listdir(dw_path)

dw_xlsx = [f for find dw_list ]
for dwindw_xlsx:
    shutil.move(r'C:\Users\akira\Documents\Python\Company')

python python3 web-scraping

2022-09-30 21:50

2 Answers

The page contains several hidden parameters (for example, __EVENTVALIDATION).Obtain the values of these parameters on the first access and submit the form data.

Below is the python script that does this, but the response (HTML file) contains only the first 50 search results.To get all the search results, you need to add a link to each page to retrieve the HTML file.

import urlib
from bs4 import BeautifulSoup

# first access —get hidden parameters
url=r'https://nintei.nurse.or.jp/certification/General/(X(1)S(efl0y555pect3x45oxjzfw3x))/General/GCPP01LS/GCPP01LS.aspx'
html=urllib.request.urlopen(url)
soup = BeautifulSoup(html, 'lxml')

# generate form data
count=int(soup.select('#__VIEWSTATEFIELDCOUNT')[0]['value'])
form_data = {
  '__VIEWSTATEFIELDCOUNT': count,
  '__VIEWSTATE': group.select('#__VIEWSTATE') [0] ['value'],
  '__EVENTVALIDATION': group.select('#__EVENTVALIDATION') [0] ['value']
}
for i in range (1, count):
  form_data [f'___VIEWSTATE{i}'] = group.select(f'#___VIEWSTATE{i}') [0] ['value']

form_data ['ctl00$plhContent$btnSearchMain'] = 'Search'
form_data ['ctl00$plhContent$drpField'] = -1
form_data ['ctl00$plhContent$drpNameOwnerWorking'] = -1
form_data ['ctl00$plhContent$drpWorkPrefecture'] = -1
form_data ['ctl00$plhContent$drpWorkType'] = -1
form_data ['ctl00$plhContent$radlstCert'] = 1

# second access:get search result
form_data = urllib.parse.urlencode(form_data).encode()
html=urllib.request.urlopen(url,form_data).read().decode()
print(html)


2022-09-30 21:50

There are many different ways of holding/showing information on a website, and there is no such thing as a template that can be applied to different sites, regardless of whether it is created by the same company for the same purpose.

Forget the idea that some template exists (applicable) and start by looking at the content/configuration of the target site.

  • On the page, it seems that all information can be obtained at the initial value unless otherwise specified.
    By specifying the categories of Certified Nurses, Certified Nursing Managers, and Professional Nurses, you will be able to get everything.
  • However, there are only 50 views each, so you need to click the Next button to proceed.
    The source provided is looped with a list of a tags, and you click the "Find" button in it, but it does not match the content of this page.
  • Therefore, the how to get all of these at onceis probably not.
    Currently, there are 2,072 registered "certified nurses" and 415 pages of loops are required.
    There is a possibility that it can be retrieved at once in some way, but it is probably private and not always available.

Note:
Full data published officially: (available on Excel sheet)
Information Processing Security Advisor Search Service

Examples that can be retrieved by behind-the-scenes techniques:
(The comments show behind-the-scenes techniques) Python scraping table cannot be retrieved
I am also responding to the reference information for obtaining it by scraping.
Slape with selenium (next page available)

Here's how you think.

  • base='https://nintei.nurse.or.jp/certification/General/' and later review
  • First time driver.find_element_by_xpath('//*[@id="ctl00_plhContent_btnSearchMain"]Get it with ').click()
  • Use Selenium or BeautifulSoup to extract data with XPath in '//* [@id="ctl00_plhContent_dlvMain"] (use it if there's another way you think it's easy to use)
    The links to the front and rear pages and item names are also included in the table, so if you don't need them, I'll shave them off
  • Next button click
  • using driver.find_element_by_xpath('//*[@id="page_navi_wrapper"]/ul/li[9]/a').click()
  • If you try to display it to the end and go further, there will be no element and it will be an exception, so check for element in advance or exit the loop by detecting an exception in try:exception:


2022-09-30 21:50

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.