CSS selector "table>tbody>tr" does not work in lxml

Asked 2 years ago, Updated 2 years ago, 141 views

https://eiga.com/theater/13/
I wanted to get the name and address of the movie theater from the website, so I wrote it like this based on the reference book.

However, the address does not appear as shown below.

{'url':'https://eiga.com/theater/13/130301/3271/','name':['Uplink','address':[]}

I don't know what to do, so could you give me some advice?I look forward to your kind cooperation.

import time
import re

import requests
import lxml.html

defmain():
    session=requests.Session()
    response=requests.get('https://eiga.com/theater/13/')
    urls=scrap_list_page(response)
    For urlin urls:
        time.sleep(1)
        response=session.get(url)
        THEATER=scrap_detail_page(response)
        print(theater)
        break

def scrap_list_page(response):
    root=lxml.html.fromstring(response.content)
    root.make_links_absolute(response.url)

    for a in root.cssselect('#pref_theaters a'):
        url = a.get('href')
        yield url

def scrap_detail_page(response):
    root=lxml.html.fromstring(response.content)
    THEATER = {
        'url' —Response.url,
        'name': [h2.text_content() for h2 in root.cssselect('#main>div.wrap_ctsBox>div>h2')],
        'address': [td.text_content() for td in root.cssselect('#ciBox>table>tbody>tr:nth-child(1)>td')],
    }
    return theater

if__name__=='__main__':
    main()

python css web-scraping

2022-09-30 20:19

1 Answers

Your CSS selector contains tbody, but the HTML code omitted tbody.
lxml parses this source code, so shave tbody and

#ciBox>table>tr:nth-child(1)>td

must be .
The DOM, such as the browser, compensates for the missing tag elements, so the selector above will not work in the browser.
For example, if you remove > between table and tr, you will be a selector that works in both environments.

#ciBox>table tr:nth-child(1)>td


2022-09-30 20:19

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.