About web scraping using python, xpath

Asked 1 years ago, Updated 1 years ago, 116 views

We use python, xpath to perform web scraping.
The following sites are scraping:

https://www.e-stat.go.jp/stat-search/files?page=1&toukei=00100405&tstat=000001014549

The purpose is to extract tags with the information you want (href attribute), but the code is as follows:

import requests
from bs4 import BeautifulSoup
from lxml import html

# xpath for tags you want to extract (middle tier of download files)
target_xpath='//a[contains(text(), "Proliferation and Retention of Major Durable Goods, etc.")]/parent::div/following-sibling::div[3]/div/a[contains(@data-file_type, "CSV")]'

# url with entrances to data for each fiscal year
cause_url="https://www.e-stat.go.jp/stat-search/files?page=1&toukei=00100405&tstat=000001014549"

# url, the basis for entering data for each fiscal year
base_url="https://www.e-stat.go.jp/stat-search/files?page=1&layout=datalist&toukei=00100405&tstat=000001014549&cycle=0"

# Acquisition of url site data with entry points to data for each fiscal year
response=requests.get(cause_url)

# Prevention of garbled characters
response.encoding=response.apparent_encoding

# Analysis of sites with entrances to each year's data
soup = BeautifulSoup(response.content, "html.parser")

# Obtain the data-value value from url on the site where the entry to each year's data is posted
for span in group.find_all('span', attrs={'data-value1': True, 'data-value2': True}):
    if 'March Survey' in span.text:
        val1="&tclass1="+str(span['data-value1'])
        val2="&tclass2="+str(span['data-value2'])

        # Create additional url
        add_url=val1+val2+"&tclass3val=0"

        # url for loading to each year data page
        load_url = base_url + add_url

        # Analysis of download destination web page (using xpath because path acquisition is troublesome)
        load_request=requests.get(load_url)
        load_html = load_request.text
        load_root=html.fromstring(load_html)

        # Extracting Required Elements
        looking_tag=load_root.xpath(target_xpath)
        print(looking_tag)

Output Results

[<Element a at 0x2584b9b1040>, <Element a at 0x2584b9b17c0>, <Element a at 0x2584b9b1d60>]
<Element a at 0x2584b92e680>, <Element a at 0x2584b92e2c0>, <Element a at 0x2584b9b9db0>
<Element a at 0x2584c3b5e50>, <Element a at 0x2584c3b5c70>, <Element a at 0x2584cf82f40>]
<Element a at 0x2584c81a590>, <Element a at 0x2584c81a5e0>, <Element a at 0x2584cd2def 0>
<Element a at 0x2584cf82db0>, <Element a at 0x2584c518b80>, <Element a at 0x2584c518360>
<Element a at 0x2584b92e090>, <Element a at 0x2584cf82f40>, <Element a at 0x2584cf82860>]
<Element a at 0x2584c81a0e0>, <Element a at 0x2584b92e680>, <Element a at 0x2584b92e2c0>]
<Element a at 0x2584c81a5e0>, <Element a at 0x2584b9b99a0>, <Element a at 0x2584b9b9db0>]
[]
<Element a at 0x2584c5184a0>, <Element a at 0x2584c81a590>, <Element a at 0x2584c81a5e0>]
[]
<Element a at 0x2584c81a0e0>, <Element a at 0x2584b92e680>, <Element a at 0x2584b92e2c0>]
<Element a at 0x2584b92e090>, <Element a at 0x2584b9b17c0>, <Element a at 0x2584b9b10e0>]
<Element a at 0x2584c3b5e50>, <Element a at 0x2584cd2def 0>, <Element a at 0x2584cd2dd60>
<Element a at 0x2584b92e2c0>, <Element a at 0x2584b92e680>, <Element a at 0x2584c3b5c70>
<Element a at 0x2584c5184a0>, <Element a at 0x2584cd2def 0>, <Element a at 0x2584cd2dd60>
<Element a at 0x2584c3b5c70>, <Element a at 0x2584c518b80>, <Element a at 0x2584b9b19a0>
[<Element a at 0x2584c81a5e0>, <Element a at 0x2584c81a0e0>, <Element a at 0x2584b9b99a0>]

I just got the output, but I don't know how to read it.I'd like to ask you to teach me.Also, if there is a best way to extract it, please let me know about it (by python method).

python xpath

2022-09-30 15:45

2 Answers

Add /@href to XPATH to get the value of the href attribute.

target_xpath='//a[contains(text(), "preservation status of major durable goods, etc.)]/parent::div/following-sibling::div[3]/div/a[contains(@data-file_type, "CSV")]/@href'


2022-09-30 15:45

<Element a at 0x...> is the Element class object in lxml.
If you use xpath to retrieve html elements, you can retrieve matching elements in list format.
Even if you display this in print(sealing_tag), the Element class object will be displayed.

To view the contents of an element, the attribute for the element is attrb .You can enumerate it as dict in the property.

sample code

#First half code omitted
        # Extracting Required Elements
        looking_tag=load_root.xpath(target_xpath)
        # print(looking_tag)
        for state in looking_tag:
            print(f'href is {state.attrib["href"]}.')
            print(state.attrib)
        break

Run Results

 href is /stat-search/file-download?statInfId=000032190006&fileKind=1.
{'href':'/stat-search/file-download?statInfId=000032190006&fileKind=1', 'class':'stat-dl_icon stat-icon_1stat-icon_format js-dl stat-download_icon_left', 'data-file_id':'dat9', 'dat9'_dat9'dat9:'dat9:'dat9:'dat9'dat9'dat9'dat9'dat9'dat9'dat9'dat9'dat9
href is /stat-search/file-download?statInfId=000032190011&fileKind=1.
{'href':'/stat-search/file-download?statInfId=000032190011&fileKind=1', 'class':'stat-dl_icon stat-icon_1stat-icon_format js-dl stat-download_icon_left', 'data-file_id':'dat9', 'dat9'dat9'_dat9:'dat9:'dat9:'dat9'dat9'dat9:'dat9'dat9'dat9'dat9'dat9'dat
href is /stat-search/file-download?statInfId=000032190016&fileKind=1.
{'href':'/stat-search/file-download?statInfId=000032190016&fileKind=1', 'class':'stat-dl_icon stat-icon_1stat-icon_format js-dl stat-download_icon_left', 'data-file_id':'dat9', 'dat9'dat9:'dat9:'dat9:'dat9:'dat9'dat9'dat9'dat2

Also, as @metropolis answered, you can get the value of the attribute in list format by adding /@attribute name to the xpath.
This xpath returns a list of strings in the following format:

['/stat-search/file-download?statInfId=000032190006&fileKind=1', '/stat-search/file-download?statInfId=0032190011&fileKind=1', '/stat-search/file-download?statInfid=160016']


2022-09-30 15:45

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.