Apache Tika Has Continuous Characters in PDF Body

Asked 1 years ago, Updated 1 years ago, 31 views

I'm parsing PDF in Java and Apache Tika.
Most PDFs can be loaded successfully, but they can cause parsing errors or
Even if you can parse it, the characters in the body are continuous.

What I would like to ask is the cause and countermeasures of the phenomenon of continuous characters in the body.
Below is an excerpt of some of the similar patterns from Perth's long sentences.

(1)(1)(1)(1)(1)(1), Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind,

"Probably, when Tika took out the part where ""(1) DB for Firin Volcanic"" is written in the PDF,
" PDF comments? Accessibility? I can't see it when I open it normally, but
I'm wondering if Tika took out the embedded in the PDF with Perth.(Imagination)

Source:

InputStream input=newFileInputStream("sample.pdf");
ContentHandler handler = new BodyContentHandler (Integer.MAX_VALUE);
Metadata metadata = new Metadata();
new PDFParser().parse(input, handler, metadata, new ParseContext());
String plainText=handler.toString();
System.out.println(plainText);

Are there any causes and countermeasures (setting to Tika?, etc.)?

■After trying the answers you gave me
I tried the following, but I couldn't get over the continuous text.
As expected, I think a set of config variables such as parser.parse() is necessary.
今 I can't find any examples on the Japanese site, so I'm looking for examples of sources on overseas sites, but I haven't found them.

If anyone knows how to solve it, why don't you try it? I don't mind
I look forward to hearing from you.

PDFParser parser=newPDFParser();
PDFParserConfig config = new PDFParserConfig();

// Whether to ignore overlapping characters when expressing bold characters, etc. by overlapping characters
config.setSuppressDuplicateOverlappingText(true);

// Ignore text underscores, etc.
config.setExtractAnnotationText(false);

parser.parse(input, handler, metadata, new ParseContext());

java

2022-09-30 21:18

1 Answers

PDFParser You can add the PDF parsing configuration to the class as follows:

PDFParser parser=newPDFParser();
PDFParserConfig config = new PDFParserConfig();

// Whether to ignore overlapping characters when expressing bold characters, etc. by overlapping characters
config.setSuppressDuplicateOverlappingText(true);

// Ignore text underscores, etc.
config.setExtractAnnotationText(false);

// Configuring Parser
parser.setPDFParserConfig(config);

parser.parse(input, handler, metadata, new ParseContext());

I don't know what the status of the file you asked me about, so I can't give you an accurate answer, but why don't you try changing the analysis options as above?


2022-09-30 21:18

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.