Collect Python Scraping Destination Links

If you access URL1 of the code below, there is a page at the bottom. The last page of URL 1 is from 1 to 93 pages I'd like to collect the links provided within the page.

URL1: https://www.jobplanet.co.kr/companies?sort_by=review_compensation_cache&industry_id=700&page=1

The pages that require collection of link information are pages that can be accessed from each page, such as URL2. URL2: https://www.jobplanet.co.kr/companies/42216

Please help me with the link to enterprise-specific information that can be accessed within each page, such as URL2.

(You can just check how to collect the 5-digit corporate code at the end of URL2.)

I ask for your help!

from bs4 import BeautifulSoup
import csv
import os
import re
import requests
import json

# # jobplanet
BaseUrl = 'https://www.jobplanet.co.kr/companies?sort_by=review_compensation_cache&industry_id=700&page='


for i in range(1, 5, 1):
        url = BaseUrl + str(i)
        r = requests.get(url)
        soup = BeautifulSoup(r.text,'lxml')
        body = soup.select('#listCompanies > div > div.section_group > section:nth-child(1) > div > div > dl.content_col2_3.cominfo > dt > a')
        #print(body)

        linkUrl = []
        for item in body:
            link = item.get('href')
            linkUrl.append(link)
print(linkUrl)

python scraping beautifulsoup urllib link

2022-09-22 19:14

1 Answers

The industrial group is divided according to the industrial_id value, so you can put it in and use it appropriately.

import requests
import bs4

def industry_links(industry_id=700):
    company_type = 'https://www.jobplanet.co.kr/companies?sort_by=review_compensation_cache&industry_id={}&page={}'
    company_info = 'https://www.jobplanet.co.kr/companies/{}'
    def get_page_cnt():
        contents = requests.get(company_type.format(industry_id, 1)).content.decode('utf-8')
        soup = bs4.BeautifulSoup(contents, 'html.parser')
        return int(int(soup.find('div', {'class':'result'}).find('span', {'class':'num'}).text) / 10 + 1)
    contents = (requests.get(company_type.format(industry_id, num + 1)).content.decode('utf-8') for num in range(get_page_cnt()))
    return [company_info.format(button['data-company_id']) for content in contents 
                                                             for button in bs4.BeautifulSoup(content, 'html.parser').find_all('button', {'class', 'btn_heart1'})]

links = industry_links()
print(links)

['https://www.jobplanet.co.kr/companies/90364',
 'https://www.jobplanet.co.kr/companies/309507',
 'https://www.jobplanet.co.kr/companies/94877',
 'https://www.jobplanet.co.kr/companies/52769',
 'https://www.jobplanet.co.kr/companies/307694',
 'https://www.jobplanet.co.kr/companies/20575',
 'https://www.jobplanet.co.kr/companies/16738',
...
...

2022-09-22 19:14

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656