In Apache Tika, use Java and
when loading PDF
When you open a PDF in the viewer, you can see
⇒ (1) DB for Firin Volcanic Mountain is displayed.(Bold with underbar)
However, if you take it in Tika, as shown in the following example,
⇒ (1) (1) (1) (1) (1) (1) (1) Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind, Wind
The same characters in the sentence are continuous.
Therefore, the other day, I was told to use the following Tika command.
config.setSuppressDuplicateOverlappingText(true);
I don't know how to associate it with parser.
(There seems to be no case on the Japanese site, so I am looking for an overseas site, but I have not found it.)
How should I re-code it?
File document=newFile(strFile_path);
Parser parser = new AutoDetectParser();
// PDFParser parser = new PDFParser();
ContentHandler handler = new BodyContentHandler (Integer.MAX_VALUE);
Metadata metadata = new Metadata();
PDFParserConfig config = new PDFParserConfig();
// Whether to ignore overlapping characters when expressing bold characters, etc. by overlapping characters
config.setSuppressDuplicateOverlappingText(true);
// Ignore text underscores, etc.
config.setExtractAnnotationText(false);
ParseContext context = new ParseContext();
context.set (PDFParserConfig.class, new PDFParserConfig());
try{
// Parse a program to parse the syntax, but execute it.
parser.parse(new FileInputStream( document), handler, metadata, new ParseContext());
}
catch(FileNotFoundExceptione){
:
}
:
catch(Exceptione){
}
// View PDF Sentences
System.out.println("handler:["+handler.toString()+"]");
context.set (PDFParserConfig.class, new PDFParserConfig());
↓
context.set (PDFParserConfig.class,config);
Isn't it?
try{
// Parse a program to parse the syntax, but execute it.
parser.parse(new FileInputStream( document), handler, metadata, new ParseContext());
This is also
try{
// Parse a program to parse the syntax, but execute it.
parser.parse(new FileInputStream( document), handler, metadata, context);
Isn't that right?
I don't have many settings in use...
© 2024 OneMinuteCode. All rights reserved.