https://eiga.com/theater/13/
I wanted to get the name and address of the movie theater from the website, so I wrote it like this based on the reference book.
However, the address does not appear as shown below.
{'url':'https://eiga.com/theater/13/130301/3271/','name':['Uplink','address':[]}
I don't know what to do, so could you give me some advice?I look forward to your kind cooperation.
import time
import re
import requests
import lxml.html
defmain():
session=requests.Session()
response=requests.get('https://eiga.com/theater/13/')
urls=scrap_list_page(response)
For urlin urls:
time.sleep(1)
response=session.get(url)
THEATER=scrap_detail_page(response)
print(theater)
break
def scrap_list_page(response):
root=lxml.html.fromstring(response.content)
root.make_links_absolute(response.url)
for a in root.cssselect('#pref_theaters a'):
url = a.get('href')
yield url
def scrap_detail_page(response):
root=lxml.html.fromstring(response.content)
THEATER = {
'url' —Response.url,
'name': [h2.text_content() for h2 in root.cssselect('#main>div.wrap_ctsBox>div>h2')],
'address': [td.text_content() for td in root.cssselect('#ciBox>table>tbody>tr:nth-child(1)>td')],
}
return theater
if__name__=='__main__':
main()
Your CSS selector contains tbody
, but the HTML code omitted tbody
.
lxml
parses this source code, so shave tbody
and
#ciBox>table>tr:nth-child(1)>td
must be .
The DOM, such as the browser, compensates for the missing tag elements, so the selector above will not work in the browser.
For example, if you remove >
between table
and tr
, you will be a selector that works in both environments.
#ciBox>table tr:nth-child(1)>td
© 2024 OneMinuteCode. All rights reserved.