Symptoms such as continuous characters in PDF sentences read by Apache Tika

Asked 1 years ago, Updated 1 years ago, 36 views

In Apache Tika, use Java and


when loading PDF When you open a PDF in the viewer, you can see
⇒ (1) DB for Firin Volcanic Mountain is displayed.(Bold with underbar)

However, if you take it in Tika, as shown in the following example,
⇒ (1) (1) (1) (1) (1) (1) (1) Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind

The same characters in the sentence are continuous.

Therefore, the other day, I was told to use the following Tika command.
config.setSuppressDuplicateOverlappingText(true);
I don't know how to associate it with parser.
(There seems to be no case on the Japanese site, so I am looking for an overseas site, but I have not found it.)

How should I re-code it?

File document=newFile(strFile_path);

Parser parser = new AutoDetectParser();
// PDFParser parser = new PDFParser();
ContentHandler handler = new BodyContentHandler (Integer.MAX_VALUE);
Metadata metadata = new Metadata();

PDFParserConfig config = new PDFParserConfig();
// Whether to ignore overlapping characters when expressing bold characters, etc. by overlapping characters
config.setSuppressDuplicateOverlappingText(true);
// Ignore text underscores, etc.
config.setExtractAnnotationText(false);

ParseContext context = new ParseContext();
context.set (PDFParserConfig.class, new PDFParserConfig());

try{
    // Parse a program to parse the syntax, but execute it.
    parser.parse(new FileInputStream( document), handler, metadata, new ParseContext());
}
catch(FileNotFoundExceptione){
   :
}
   :
catch(Exceptione){
}

// View PDF Sentences
System.out.println("handler:["+handler.toString()+"]");

java

2022-09-30 21:17

1 Answers

context.set (PDFParserConfig.class, new PDFParserConfig());

context.set (PDFParserConfig.class,config);

Isn't it?

try{
// Parse a program to parse the syntax, but execute it.
parser.parse(new FileInputStream( document), handler, metadata, new ParseContext());

This is also

try{
// Parse a program to parse the syntax, but execute it.
parser.parse(new FileInputStream( document), handler, metadata, context);

Isn't that right?

I don't have many settings in use...


2022-09-30 21:17

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.