If you access URL1 of the code below, there is a page at the bottom. The last page of URL 1 is from 1 to 93 pages I'd like to collect the links provided within the page.
URL1: https://www.jobplanet.co.kr/companies?sort_by=review_compensation_cache&industry_id=700&page=1
The pages that require collection of link information are pages that can be accessed from each page, such as URL2. URL2: https://www.jobplanet.co.kr/companies/42216
Please help me with the link to enterprise-specific information that can be accessed within each page, such as URL2.
(You can just check how to collect the 5-digit corporate code at the end of URL2.)
I ask for your help!
from bs4 import BeautifulSoup
import csv
import os
import re
import requests
import json
# # jobplanet
BaseUrl = 'https://www.jobplanet.co.kr/companies?sort_by=review_compensation_cache&industry_id=700&page='
for i in range(1, 5, 1):
url = BaseUrl + str(i)
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
body = soup.select('#listCompanies > div > div.section_group > section:nth-child(1) > div > div > dl.content_col2_3.cominfo > dt > a')
#print(body)
linkUrl = []
for item in body:
link = item.get('href')
linkUrl.append(link)
print(linkUrl)
The industrial group is divided according to the industrial_id
value, so you can put it in and use it appropriately.
import requests
import bs4
def industry_links(industry_id=700):
company_type = 'https://www.jobplanet.co.kr/companies?sort_by=review_compensation_cache&industry_id={}&page={}'
company_info = 'https://www.jobplanet.co.kr/companies/{}'
def get_page_cnt():
contents = requests.get(company_type.format(industry_id, 1)).content.decode('utf-8')
soup = bs4.BeautifulSoup(contents, 'html.parser')
return int(int(soup.find('div', {'class':'result'}).find('span', {'class':'num'}).text) / 10 + 1)
contents = (requests.get(company_type.format(industry_id, num + 1)).content.decode('utf-8') for num in range(get_page_cnt()))
return [company_info.format(button['data-company_id']) for content in contents
for button in bs4.BeautifulSoup(content, 'html.parser').find_all('button', {'class', 'btn_heart1'})]
links = industry_links()
print(links)
['https://www.jobplanet.co.kr/companies/90364',
'https://www.jobplanet.co.kr/companies/309507',
'https://www.jobplanet.co.kr/companies/94877',
'https://www.jobplanet.co.kr/companies/52769',
'https://www.jobplanet.co.kr/companies/307694',
'https://www.jobplanet.co.kr/companies/20575',
'https://www.jobplanet.co.kr/companies/16738',
...
...
© 2024 OneMinuteCode. All rights reserved.