Loading, parsing, and tokenizing documents?

You want to parse a document and create only the content you want as a separate file.

There is a PDF box library in the PDF, so I am using it.

The problem is that I don't know how to read the document and tokenize it.

Text can be read as a string and processed by Java itself

As I receive pictures, boxes, and text at once, I need to tokenize them and process them myself.

It's not easy.

Is there a way to parse and tokenize documents such as Hancom even if it is not a pdf?

2022-09-22 16:45

1 Answers

If you want to extract only the desired part of a specific document, there is a way without tokenization.

python x 4647

android x 1593

java x 1494

c x 927

c++ x 878

php x 692

html x 656