What should I do when I have # instead of / in httpurl when I crawl Python?

Asked 2 years ago, Updated 2 years ago, 93 views

Hello, everyone I'm a beginner who's been doing Python crawl lately.ㅠ<

I'm learning bs4, selenium, and I'm doing crawl. I learned HTML, too. It hasn't been four weeks since I started, but HTML on most sites is easy to crawl, and it's not that hard to log in and crawl. But when I crawled on some sites, I found something different about url...!

For example, url including # came out like abc.com/abc/abc#bcd:100!!!

Like other url, I tried to crawl this site, but I keep getting errors because there is no element I kept looking for the cause of this, but I compared the html that I saw in chromium with the html that I received using selenium, and there were more missing information than the html of chromium.

The characteristic of the missing information was that it was an html code that changed every time the content after # changed.

So as always, I looked for information through googling, but it didn't come out easily. I just found out that it's related to jsp, that simple crawlers don't recognize it, that's hard to crawl (impossible?), and how to crawl #!

So, to summarize the question, is it possible to crawl the link with # when crawling? If possible, how do you do it? Should I learn jsp? I'd appreciate it if you could answer ㅠ<

crawling crawler python hash selenium

2022-09-22 18:00

1 Answers

Values below # are It depends on how you use it on that page.

It is a classic example that it is used as an anchor value for the header.

However, certain SPA sites use the hash value to manage history It is also used for customized page status values.

Usually, most of the cases you mentioned are adding or reconstructing HTML by dynamically adding additional actions after the page is loaded.

Due to the nature of static page crawling, it does not detect the actions that take place after page loading and parses the status of the response from the server. That's why the state that's different from the state rendered in the actual browser is crawled.

Typically, in a selenium or headless browser, after accessing a page, you use sleep to wait a certain amount of time for these dynamic tasks and then crawl the state after the dynamic change by parsing the page.

This isn't perfect either. If the dynamic operation is done in a callback after Ajax where the server request is made, the response time to the server request is slower than the sleep time, which is eventually a problem. You should consider this by adjusting the waiting time.


2022-09-22 18:00

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.