(Beginner) There are some things that don't work as web crawlers "T"

Asked 2 years ago, Updated 2 years ago, 15 views

On the same website (e.g., two people's auction), I entered the categories "Auction" and "Short Sale" and I am studying webcrawling. Row 3 is auction, Row 4 is auction, If you run the auction (line 3 url_), you get the tots value If you run a public sale (line 4 url_), you won't get a tots value. (Only do one of the lines 3 and 4) /

By analyzing two HTMLs, 'div.page' is the only one.

Finally, you want to get the (square) value in the html code. (14032 at auction, 2153 at short sale) It is not easy to extract a value in the middle of a string. I don't know if I can ask you this. I've been doing it for hours, but I don't know.

import urllib.request<a>

from bs4 import BeautifulSoup

url = 'http://www.dooinauction.com/auction/ca_list.php' #The auction area<a> 

url = 'http://www.dooinauction.com/pubauct/list.php' #Salesfield<a>

req = urllib.request.Request(url)

html = urllib.request.urlopen(req).read()

soup = BeautifulSoup(html, 'html.parser')

tots = soup.select('div.pagn')

print('Test end')

Announcement page html

sold page html

python

2022-09-21 23:21

2 Answers

The reason why they say no is because they receive data dynamically and do it on the client side.

import re
import requests
from bs4 import BeautifulSoup

url = 'http://www.dooinauction.com/auction/ca_list.php' #The auction area

html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
tots = soup.select('div.pagn a')

results = [re.findall(r'total_record=([0-9]+)', link['href'])[0] for link in tots]

print(results)

['14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568']

The problem is that it's a public sale.You need to use the link below. You can get xml, so you can parse it and use it.

http://www.dooinauction.com/xml/pubauct_list.php?pdNo=&pdStatus=1&sdate=&edate=&g_sprice=0&g_eprice=0&ctgr1=0&ctgr2=0&l_sprice=0&l_eprice=0&sido=0&gugun=0&dong=0&ref_page=&ref_sido=&ref_gugun=&ref_dong=&decrease=0&order_type=0&list_scale=20&page_scale=10&start=0&total_record=0

import re
import requests
from bs4 import BeautifulSoup

url = 'http://www.dooinauction.com/xml/pubauct_list.php?pdNo=&pdStatus=1&sdate=&edate=&g_sprice=0&g_eprice=0&ctgr1=0&ctgr2=0&l_sprice=0&l_eprice=0&sido=0&gugun=0&dong=0&ref_page=&ref_sido=&ref_gugun=&ref_dong=&decrease=0&order_type=0&list_scale=20&page_scale=10&start=0&total_record=0' #public sale sector

html = requests.get(url).content
soup = BeautifulSoup(html, 'lxml-xml')

print(soup.find('total_record').text)
2153


2022-09-21 23:21

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.