Apache Tika Has Continuous Characters in PDF Body

I'm parsing PDF in Java and Apache Tika.
Most PDFs can be loaded successfully, but they can cause parsing errors or
Even if you can parse it, the characters in the body are continuous.

What I would like to ask is the cause and countermeasures of the phenomenon of continuous characters in the body.
Below is an excerpt of some of the similar patterns from Perth's long sentences.

(1)(1)(1)(1)(1)(1), Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind,

"Probably, when Tika took out the part where ""(1) DB for Firin Volcanic"" is written in the PDF,
" PDF comments? Accessibility? I can't see it when I open it normally, but
I'm wondering if Tika took out the embedded in the PDF with Perth.(Imagination)

Source:

InputStream input=newFileInputStream("sample.pdf");
ContentHandler handler = new BodyContentHandler (Integer.MAX_VALUE);
Metadata metadata = new Metadata();
new PDFParser().parse(input, handler, metadata, new ParseContext());
String plainText=handler.toString();
System.out.println(plainText);

Are there any causes and countermeasures (setting to Tika?, etc.)?

■After trying the answers you gave me
I tried the following, but I couldn't get over the continuous text.
As expected, I think a set of config variables such as parser.parse() is necessary.
今 I can't find any examples on the Japanese site, so I'm looking for examples of sources on overseas sites, but I haven't found them.

If anyone knows how to solve it, why don't you try it? I don't mind
I look forward to hearing from you.

PDFParser parser=newPDFParser();
PDFParserConfig config = new PDFParserConfig();

// Whether to ignore overlapping characters when expressing bold characters, etc. by overlapping characters
config.setSuppressDuplicateOverlappingText(true);

// Ignore text underscores, etc.
config.setExtractAnnotationText(false);

parser.parse(input, handler, metadata, new ParseContext());

java

2022-09-30 21:18

1 Answers

PDFParser You can add the PDF parsing configuration to the class as follows:
PDFParser parser=newPDFParser(); PDFParserConfig config = new PDFParserConfig(); // Whether to ignore overlapping characters when expressing bold characters, etc. by overlapping characters config.setSuppressDuplicateOverlappingText(true); // Ignore text underscores, etc. config.setExtractAnnotationText(false); // Configuring Parser parser.setPDFParserConfig(config); parser.parse(input, handler, metadata, new ParseContext());
I don't know what the status of the file you asked me about, so I can't give you an accurate answer, but why don't you try changing the analysis options as above?

2022-09-30 21:18

If you have any answers or tips

Popular Tags

python x 4647
android x 1593
java x 1494
javascript x 1427
c x 927
c++ x 878
ruby-on-rails x 696
php x 692
python3 x 685
html x 656

Popular Questions

544 Understanding How to Configure Google API Key
543 rails db:create error: Could not find mysql2-0.5.4 in any of the sources
625 /usr/bin/google-chrome:symbol lookup error:/usr/bin/google-chrome: undefined symbol:gbm_bo_get_modifier
976 In Java servlet, when SHA-256 sends WW-Authenticate header for digest authentication, the client does not return the result.
545 PHP ssh2_scp_send fails to send files as intended