Loading, parsing, and tokenizing documents?

Asked 2 years ago, Updated 2 years ago, 71 views

You want to parse a document and create only the content you want as a separate file.

There is a PDF box library in the PDF, so I am using it.

The problem is that I don't know how to read the document and tokenize it.

Text can be read as a string and processed by Java itself

As I receive pictures, boxes, and text at once, I need to tokenize them and process them myself.

It's not easy.

Is there a way to parse and tokenize documents such as Hancom even if it is not a pdf?

pdf pdfbox

2022-09-22 16:45

1 Answers

If you want to extract only the desired part of a specific document, there is a way without tokenization.

You can extract a rule from a large set of strings.

Of course, it's not easy to find rules and make rules.


2022-09-22 16:45

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.