Please look after the web crawling error.

Hello, I am a student who studies Java and crawl.

I'm going to crawl the playlist of youtube.com and present it as a recycler view in the app. I coded it as below, but nothing was made at the Recycler View, so it was an empty screen...

What is the problem? Please give me some advice from the experts.

@Override
protected Void doInBackground(Void... params) {
    try {
        Document doc = Jsoup.connect("https://www.youtube.com/playlist?list=PLOb0oDPP-6vb0PHm5B-XNEAkPIFkm5po7").get();
        Elements mElementDataSize = doc.select("div[class=style-scope ytd-playlist-video-list-renderer]")
                .select("ytd-playlist-video-renderer");

        for(Element elem : mElementDataSize){
            String my_title = elem.select("h3[class=style-scope ytd-playlist-video-renderer] span")
                    .text();

            itemData.add(new ItemData(my_title));
        }

        Log.d("debug :", "List " + mElementDataSize);
    } } catch (IOException e) {
        e.printStackTrace();
    }
    return null;
}

▲ When writing the mElementDataSize part, "https://www.youtube.com/playlist?list=PLOb0oDPP-6vb0PHm5B-XNEAkPIFkm5po7" I referred to the source that comes out when I press F12 on this site.

▲(String) When I write my_title part, "https://www.youtube.com/playlist?list=PLOb0oDPP-6vb0PHm5B-XNEAkPIFkm5po7" I referred to the source that comes out when I press F12 on this site.

java android crawler crawling

2022-09-20 22:24

1 Answers

Unfortunately,

The page cannot be crawled assuming a static page you have tried.

Documents that can be crawled in that way are

view-source:https://hashcode.co.kr/ (Paste it into the Chrome address bar).)

This page is composed of HTML tags in response to server requests.

It needs to be like this so I can at least try it. You need to receive and insert the response data so that XML or HTML parsing and selecting libraries can properly parse.

Even in this state, if the information I want is received through additional asynchronous communication within the page or generated by JavaScript, it cannot be crawled. HTML parsing libraries cannot run JavaScript. I can't get data reflected in HTML after running JavaScript because I don't know the result of rendering.

https://www.youtube.com/playlist?list=PLOb0oDPP-6vb0PHm5B-XNEAkPIFkm5po7

The same goes for the above page. The HTML you are looking at is the result of rendering after JavaScript is fully executed on the page.

I think you just need to look at the HTML shown in F12, and it would seem strange that the result data is not coming out on the Java crawling code.

But after accessing the page and right-clicking,

Click on the list to view the document. This shows the response you receive from the server. If you can't see the properties or components to write with the tag or selector, you can't crawl.

The only possible way is...

You can crawl the results of all JavaScript runs through the headless browser. You can choose a headless browser and control the API yourself. Or you can get it from a more user-friendly selenium, or a puppeteer.

All this is not done in the app

on a remote server with a virtual browsing environment I think it's right to receive the crawling result on the network and show it on the app.

Another possibility... It's a method using Android webview.

The web view itself is already a great browser, so it can replace the headless browser.

I'm not sure about this, so I'm linking you to Traces of shoveling in StackOverflow instead.

P.S. If you know a better way, please leave a comment.

2022-09-20 22:24

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656