All html codes do not go down when crawling the web with Beautiful Soup

Asked 2 years ago, Updated 2 years ago, 137 views

While I was working on Python to download the code that can automatically download the past question from Ebsi, I downloaded the web page with beautiful soup, and I downloaded the code that was abbreviated than the html code on the site.

All html codes do not go down when crawling the web with Beautiful Soup.

    import requests
    from bs4 import BeautifulSoup as bs

    login_info = {
        'userid': '00000000',
        'passwd': '00000000'
    }


    with requests.session() as s:

        login_req = s.post('https://www.ebsi.co.kr/ebs/pot/potl/SSOLoginSubmit.ebs', data = login_info)

        print(login_req.status_code)

        page_req = s.get('http://www.ebsi.co.kr/ebs/xip/xipc/previousPaperList.ebs')

        html = page_req.text

        soup = bs(html, 'html.parser')

Looking at the soup value that came out of the code like this

The part that needs to come out like this

 (omitted)
    </select>
    </span>
    </em>
    </h4>
    <div id="div_contentList"></div>
    </div>
    </div>
    </form>
    </div>
    (Omitted)

It comes out like this.

(It was 5,000 lines to upload the entire result, so I uploaded it as a capture like this.)

How can I solve this cut-off phenomenon?

web-crawling beautifulsoup python

2022-09-22 18:32

1 Answers

There is a possibility that the client-side script will render additional pages.

In this case, I think we should use Selenium to render and crawl with a browser, not just HTML crawling.


2022-09-22 18:32

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.