You want to parse a document and create only the content you want as a separate file.
There is a PDF box library in the PDF, so I am using it.
The problem is that I don't know how to read the document and tokenize it.
Text can be read as a string and processed by Java itself
As I receive pictures, boxes, and text at once, I need to tokenize them and process them myself.
It's not easy.
Is there a way to parse and tokenize documents such as Hancom even if it is not a pdf?
pdf pdfbox
If you want to extract only the desired part of a specific document, there is a way without tokenization.
You can extract a rule from a large set of strings.
Of course, it's not easy to find rules and make rules.
© 2024 OneMinuteCode. All rights reserved.