We use python, xpath to perform web scraping.
The following sites are scraping:
https://www.e-stat.go.jp/stat-search/files?page=1&toukei=00100405&tstat=000001014549
The purpose is to extract tags with the information you want (href attribute), but the code is as follows:
import requests
from bs4 import BeautifulSoup
from lxml import html
# xpath for tags you want to extract (middle tier of download files)
target_xpath='//a[contains(text(), "Proliferation and Retention of Major Durable Goods, etc.")]/parent::div/following-sibling::div[3]/div/a[contains(@data-file_type, "CSV")]'
# url with entrances to data for each fiscal year
cause_url="https://www.e-stat.go.jp/stat-search/files?page=1&toukei=00100405&tstat=000001014549"
# url, the basis for entering data for each fiscal year
base_url="https://www.e-stat.go.jp/stat-search/files?page=1&layout=datalist&toukei=00100405&tstat=000001014549&cycle=0"
# Acquisition of url site data with entry points to data for each fiscal year
response=requests.get(cause_url)
# Prevention of garbled characters
response.encoding=response.apparent_encoding
# Analysis of sites with entrances to each year's data
soup = BeautifulSoup(response.content, "html.parser")
# Obtain the data-value value from url on the site where the entry to each year's data is posted
for span in group.find_all('span', attrs={'data-value1': True, 'data-value2': True}):
if 'March Survey' in span.text:
val1="&tclass1="+str(span['data-value1'])
val2="&tclass2="+str(span['data-value2'])
# Create additional url
add_url=val1+val2+"&tclass3val=0"
# url for loading to each year data page
load_url = base_url + add_url
# Analysis of download destination web page (using xpath because path acquisition is troublesome)
load_request=requests.get(load_url)
load_html = load_request.text
load_root=html.fromstring(load_html)
# Extracting Required Elements
looking_tag=load_root.xpath(target_xpath)
print(looking_tag)
Output Results
[<Element a at 0x2584b9b1040>, <Element a at 0x2584b9b17c0>, <Element a at 0x2584b9b1d60>]
<Element a at 0x2584b92e680>, <Element a at 0x2584b92e2c0>, <Element a at 0x2584b9b9db0>
<Element a at 0x2584c3b5e50>, <Element a at 0x2584c3b5c70>, <Element a at 0x2584cf82f40>]
<Element a at 0x2584c81a590>, <Element a at 0x2584c81a5e0>, <Element a at 0x2584cd2def 0>
<Element a at 0x2584cf82db0>, <Element a at 0x2584c518b80>, <Element a at 0x2584c518360>
<Element a at 0x2584b92e090>, <Element a at 0x2584cf82f40>, <Element a at 0x2584cf82860>]
<Element a at 0x2584c81a0e0>, <Element a at 0x2584b92e680>, <Element a at 0x2584b92e2c0>]
<Element a at 0x2584c81a5e0>, <Element a at 0x2584b9b99a0>, <Element a at 0x2584b9b9db0>]
[]
<Element a at 0x2584c5184a0>, <Element a at 0x2584c81a590>, <Element a at 0x2584c81a5e0>]
[]
<Element a at 0x2584c81a0e0>, <Element a at 0x2584b92e680>, <Element a at 0x2584b92e2c0>]
<Element a at 0x2584b92e090>, <Element a at 0x2584b9b17c0>, <Element a at 0x2584b9b10e0>]
<Element a at 0x2584c3b5e50>, <Element a at 0x2584cd2def 0>, <Element a at 0x2584cd2dd60>
<Element a at 0x2584b92e2c0>, <Element a at 0x2584b92e680>, <Element a at 0x2584c3b5c70>
<Element a at 0x2584c5184a0>, <Element a at 0x2584cd2def 0>, <Element a at 0x2584cd2dd60>
<Element a at 0x2584c3b5c70>, <Element a at 0x2584c518b80>, <Element a at 0x2584b9b19a0>
[<Element a at 0x2584c81a5e0>, <Element a at 0x2584c81a0e0>, <Element a at 0x2584b9b99a0>]
I just got the output, but I don't know how to read it.I'd like to ask you to teach me.Also, if there is a best way to extract it, please let me know about it (by python method).
python xpath
Add /@href
to XPATH to get the value of the href attribute.
target_xpath='//a[contains(text(), "preservation status of major durable goods, etc.)]/parent::div/following-sibling::div[3]/div/a[contains(@data-file_type, "CSV")]/@href'
<Element a at 0x...>
is the Element
class object in lxml
.
If you use xpath to retrieve html elements, you can retrieve matching elements in list format.
Even if you display this in print(sealing_tag)
, the Element
class object will be displayed.
To view the contents of an element, the attribute for the element is attrb
.You can enumerate it as dict in the property.
sample code
#First half code omitted
# Extracting Required Elements
looking_tag=load_root.xpath(target_xpath)
# print(looking_tag)
for state in looking_tag:
print(f'href is {state.attrib["href"]}.')
print(state.attrib)
break
Run Results
href is /stat-search/file-download?statInfId=000032190006&fileKind=1.
{'href':'/stat-search/file-download?statInfId=000032190006&fileKind=1', 'class':'stat-dl_icon stat-icon_1stat-icon_format js-dl stat-download_icon_left', 'data-file_id':'dat9', 'dat9'_dat9'dat9:'dat9:'dat9:'dat9'dat9'dat9'dat9'dat9'dat9'dat9'dat9'dat9
href is /stat-search/file-download?statInfId=000032190011&fileKind=1.
{'href':'/stat-search/file-download?statInfId=000032190011&fileKind=1', 'class':'stat-dl_icon stat-icon_1stat-icon_format js-dl stat-download_icon_left', 'data-file_id':'dat9', 'dat9'dat9'_dat9:'dat9:'dat9:'dat9'dat9'dat9:'dat9'dat9'dat9'dat9'dat9'dat
href is /stat-search/file-download?statInfId=000032190016&fileKind=1.
{'href':'/stat-search/file-download?statInfId=000032190016&fileKind=1', 'class':'stat-dl_icon stat-icon_1stat-icon_format js-dl stat-download_icon_left', 'data-file_id':'dat9', 'dat9'dat9:'dat9:'dat9:'dat9:'dat9'dat9'dat9'dat2
Also, as @metropolis answered, you can get the value of the attribute in list format by adding /@attribute name
to the xpath.
This xpath returns a list of strings in the following format:
['/stat-search/file-download?statInfId=000032190006&fileKind=1', '/stat-search/file-download?statInfId=0032190011&fileKind=1', '/stat-search/file-download?statInfid=160016']
611 GDB gets version error when attempting to debug with the Presense SDK (IDE)
572 rails db:create error: Could not find mysql2-0.5.4 in any of the sources
581 PHP ssh2_scp_send fails to send files as intended
915 When building Fast API+Uvicorn environment with PyInstaller, console=False results in an error
© 2024 OneMinuteCode. All rights reserved.