You want to parse a document and create only the content you want as a separate file.
There is a PDF box library in the PDF, so I am using it.
The problem is that I don't know how to read the document and tokenize it.
Text can be read as a string and processed by Java itself
As I receive pictures, boxes, and text at once, I need to tokenize them and process them myself.
It's not easy.
Is there a way to parse and tokenize documents such as Hancom even if it is not a pdf?
pdf pdfbox
If you want to extract only the desired part of a specific document, there is a way without tokenization.
You can extract a rule from a large set of strings.
Of course, it's not easy to find rules and make rules.
549 PHP ssh2_scp_send fails to send files as intended
546 Understanding How to Configure Google API Key
548 rails db:create error: Could not find mysql2-0.5.4 in any of the sources
537 Uncaught (inpromise) Error on Electron: An object could not be cloned
710 When building Fast API+Uvicorn environment with PyInstaller, console=False results in an error
© 2024 OneMinuteCode. All rights reserved.