(Beginner) There are some things that don't work as web crawlers "T"

On the same website (e.g., two people's auction), I entered the categories "Auction" and "Short Sale" and I am studying webcrawling. Row 3 is auction, Row 4 is auction, If you run the auction (line 3 url_), you get the tots value If you run a public sale (line 4 url_), you won't get a tots value. (Only do one of the lines 3 and 4) /

By analyzing two HTMLs, 'div.page' is the only one.

Finally, you want to get the (square) value in the html code. (14032 at auction, 2153 at short sale) It is not easy to extract a value in the middle of a string. I don't know if I can ask you this. I've been doing it for hours, but I don't know.

import urllib.request<a>

from bs4 import BeautifulSoup

url = 'http://www.dooinauction.com/auction/ca_list.php' #The auction area<a> 

url = 'http://www.dooinauction.com/pubauct/list.php' #Salesfield<a>

req = urllib.request.Request(url)

html = urllib.request.urlopen(req).read()

soup = BeautifulSoup(html, 'html.parser')

tots = soup.select('div.pagn')

print('Test end')

Announcement page html

sold page html

python

2022-09-21 23:21

2 Answers

The reason why they say no is because they receive data dynamically and do it on the client side.

import re
import requests
from bs4 import BeautifulSoup

url = 'http://www.dooinauction.com/auction/ca_list.php' #The auction area

html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
tots = soup.select('div.pagn a')

results = [re.findall(r'total_record=([0-9]+)', link['href'])[0] for link in tots]

print(results)

['14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568',
 '14568']

The problem is that it's a public sale.You need to use the link below. You can get xml, so you can parse it and use it.

http://www.dooinauction.com/xml/pubauct_list.php?pdNo=&pdStatus=1&sdate=&edate=&g_sprice=0&g_eprice=0&ctgr1=0&ctgr2=0&l_sprice=0&l_eprice=0&sido=0&gugun=0&dong=0&ref_page=&ref_sido=&ref_gugun=&ref_dong=&decrease=0&order_type=0&list_scale=20&page_scale=10&start=0&total_record=0

import re
import requests
from bs4 import BeautifulSoup

url = 'http://www.dooinauction.com/xml/pubauct_list.php?pdNo=&pdStatus=1&sdate=&edate=&g_sprice=0&g_eprice=0&ctgr1=0&ctgr2=0&l_sprice=0&l_eprice=0&sido=0&gugun=0&dong=0&ref_page=&ref_sido=&ref_gugun=&ref_dong=&decrease=0&order_type=0&list_scale=20&page_scale=10&start=0&total_record=0' #public sale sector

html = requests.get(url).content
soup = BeautifulSoup(html, 'lxml-xml')

print(soup.find('total_record').text)
2153

2022-09-21 23:21

Page 1 has 0 start

Page 2 starts at 20

Page 3 start is 40

I'm sure you understand.

2022-09-21 23:21

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656