Hello, I am a student who studies Java and crawl.
I'm going to crawl the playlist of youtube.com and present it as a recycler view in the app. I coded it as below, but nothing was made at the Recycler View, so it was an empty screen...
What is the problem? Please give me some advice from the experts.
@Override
protected Void doInBackground(Void... params) {
try {
Document doc = Jsoup.connect("https://www.youtube.com/playlist?list=PLOb0oDPP-6vb0PHm5B-XNEAkPIFkm5po7").get();
Elements mElementDataSize = doc.select("div[class=style-scope ytd-playlist-video-list-renderer]")
.select("ytd-playlist-video-renderer");
for(Element elem : mElementDataSize){
String my_title = elem.select("h3[class=style-scope ytd-playlist-video-renderer] span")
.text();
itemData.add(new ItemData(my_title));
}
Log.d("debug :", "List " + mElementDataSize);
} } catch (IOException e) {
e.printStackTrace();
}
return null;
}
Unfortunately,
The page cannot be crawled assuming a static page you have tried.
Documents that can be crawled in that way are
view-source:https://hashcode.co.kr/ (Paste it into the Chrome address bar).)
This page is composed of HTML tags in response to server requests.
It needs to be like this so I can at least try it. You need to receive and insert the response data so that XML or HTML parsing and selecting libraries can properly parse.
Even in this state, if the information I want is received through additional asynchronous communication within the page or generated by JavaScript, it cannot be crawled. HTML parsing libraries cannot run JavaScript. I can't get data reflected in HTML after running JavaScript because I don't know the result of rendering.
https://www.youtube.com/playlist?list=PLOb0oDPP-6vb0PHm5B-XNEAkPIFkm5po7
The same goes for the above page. The HTML you are looking at is the result of rendering after JavaScript is fully executed on the page.
I think you just need to look at the HTML shown in F12, and it would seem strange that the result data is not coming out on the Java crawling code.
But after accessing the page and right-clicking,
Click on the list to view the document. This shows the response you receive from the server. If you can't see the properties or components to write with the tag or selector, you can't crawl.
The only possible way is...
You can crawl the results of all JavaScript runs through the headless browser. You can choose a headless browser and control the API yourself. Or you can get it from a more user-friendly selenium, or a puppeteer.
All this is not done in the app
on a remote server with a virtual browsing environment I think it's right to receive the crawling result on the network and show it on the app.
Another possibility... It's a method using Android webview.
The web view itself is already a great browser, so it can replace the headless browser.
I'm not sure about this, so I'm linking you to Traces of shoveling in StackOverflow instead.
P.S. If you know a better way, please leave a comment.
© 2024 OneMinuteCode. All rights reserved.