Characters get garbled when JAVA reads a text file other than utf8

In JAVA, I made the process of loading the text file as follows.
The text file to be read is
like EUC, Shift-JIS, GB2312, etc. If it is not UTF-8, the imported String is garbled.

Converting to Unicode (UTF-8) and Not Loaded?

Also, I don't know how to determine the character code of the file to be read, so
Currently, UTF-8 is fixed as shown below.

★ By determining the character code in advance and adding/modifying it to this source,
JAVA in some kind of conversion process?Do you support them?If it's a character code,
Would it be possible to load UTF-8 into String?

InputStreamReader fr=null;
BufferedReader br=null;

try{
    fr = new InputStreamReader (new FileInputStream(strFile_name), "UTF-8");
    br = new BufferedReader (fr);

    // In the existing readLine(), \r or \n and \r\n consider it a new line.
    String strLine = null;
    intiCount = 0;
    while(strLine=br.readLine())!=null) {//null=end of file
        file_text_line.add(strLine);
        iCount++;
    }

    br.close();
    fr.close();

    returniCount;// Read successfully (Note: sometimes 0)
}
catch(FileNotFoundExceptione){

java

2022-09-30 19:57

2 Answers

As an old knowledge, EUC-JP or Shift_JIS can specify "JISAutoDetect".

I thought it might not be in the Java 8 era anymore, but it looks like There's more.

2022-09-30 19:57

First of all, the internal representation of String is UTF-16 fixed (not necessarily unchangeable by JVM options). The second argument in InputStreamReader specifies the encoding of the byte column that Reader is trying to read, so you must specify the encoding of the text file.
Unfortunately, the Java standard library does not have any functions that automatically determine encoding from a given byte string and make it a String, so you have to decide manually, write your own judgment function, or use some external library.
Here are some examples of Java libraries, but first of all, there is no way to guess 100% encoding from a byte string with no additional information (web browsers often get garbled, don't they?), so I think it's better to prepare a solution if it's out of place.

ICU4J (the one used in Google Chrome until very recently)
juniversalcardet (old Firefox algorithm migrated to Java)
jChardet (above)

To be honest, the accuracy is all subtle, but here's an example of ICU4J.

// Byte stream of files (through buffer)
BufferedInputStream bis=new BufferedInputStream(newFileInputStream(strFile_name)));

// Encoding can be inferred from ICU4J's CharsetDetector class
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
CharsetMatch cm = cd.detect();

Get probability that // cm.getConfidence() is correct (0-100)
// I think it is also possible to decide whether to adopt the decision based on this value.
// A short number of bytes is a basic low probability
if(cm.getConfidence()>70){
    // Over 70% of the time
} else{
    // 70% or less
}

// Get determined charset
String charset=cm.getName();
// or get Reader directly
Reader = cm.getReader();

2022-09-30 19:57

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656