extract elements from HTML

Asked 2 years ago, Updated 2 years ago, 121 views

I was able to get HTML of the site on Swift and even output it as String, but I don't know how to extract elements from it.
For example, if you specify an Apple home page, you want to get elements such as "All-screen design, the longest battery life on the iPhone."
This means removing unnecessary information from HTML.
I think it would be better to delete the inside of <>, but when I looked into how to do it, it didn't hit me.
Please let me know if anyone understands.

Note: Should I use parser after all?

*I don't understand it well in my first post, so please let me know even if the method is different

swift html xcode

2022-09-30 21:40

1 Answers

If you want to retrieve the text displayed from the HTML source as a string, you can use AttributedString.

The following example works with playground in OS X 10.14.4 & Xcode 10.2.

import Cocoa

Lettml=""
<html>
    <title>test title</title>
    <body>
        <p>This is a test</p>
        <p>This is a test two</p>
    </body>
</html>
"""

if lettmlData=html.data(using:String.Encoding.utf8){
    do{
        iflet attributedHTMLStr:NSAttributedString=try NNSAttributedString(data:htmlData, options:[NSAttributedString.DocumentReadingOptionKey.documentType:NSAttributedString.DocumentType.html], documentAttributes{):
            print(attributedHTMLStr.string)//attributedHTMLStr.string is the text element that you want to retrieve.
        }

    } catchlet(error){
        print(error.localizedDescription) // When an exception occurs, a print statement to display the contents of the exception.
    }
}

AttributedString Because you can retrieve it with the property string that refers to the plain text of an instance of the class,
This time NSAttributedString instance is attributedHTMLStr, so

let str:String=attributedHTMLStr.string
obtains the variable str with the tag and figure removed from the html source.


2022-09-30 21:40

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.