Regular expression to verify that it is plain text

Asked 2 years ago, Updated 2 years ago, 94 views

Perl's program indicates that a file is written in plain text only. I'd like to distinguish by regular expressions, but I'm worried about how I should write a pattern to realize it.
Linux file commands can identify text/binary, but
Can you write something like that in a regular expression?

regular-expression perl

2022-09-29 22:33

3 Answers

I think it would be better to judge Shift_JIS and UTF-8 as binary data at the same time instead of using regular expressions, but I will write the method of judging Shift-JIS in the form of regular expressions.

UTF-8 starts multi-byte with 11****** and n-1 bytes 10**** if one of the higher bits is n consecutive (2 nn 66), that is, classifying C0 or higher bytes into five categories and then arranging the number of 80-BF bytes.For example, if n=3, [\xE0-\xEF]([^\x80-\xBF]|[\x80-\xBF][^\x80-\xBF]|[\x80-\xBF]{3}) is not allowed.

It's not practical, so it's better to check the same way without using regular expressions.


2022-09-29 22:33

If the definition of "plain text" is "contains only visible character data", then \p {Print} may be sufficient. See the document perluniprops(1) to determine who is Print.

For example, if the input character encoding is UTF-8, then the following is true:

$perl-MENcode-e'$/=";$i=<>eval{$i=Encode::decode("UTF-8",$i,Encode::FB_CROAK);};print(((!$@&&$i!~/[^\p{Print}\t\n]/)? "Text": "Binary", "\n")'/etc/passwd
Text
$ US>perl-MENcode-e'$/=";$i=<>eval{$i=Encode::decode("UTF-8", $i,Encode::FB_CROAK);};print((!$@&&$i!~/[^\p{Print}\t\n]/)? "Text": "Binary", "\n")'/bin/ls
Binary

It is important to flag UTF8 for the variables to be evaluated (see perlunitut(1) or Encode module documentation (read by running perldoc Encode), and include \t\n in the regular expression character class if you want to identify it as text.

If the input is Shift_JIS, adjust the first argument in Encode::decode.

$perl-MENcode-e'$/=";$i=<>;eval{$i=Encode::decode("Shift_JIS",$i,Encode::FB_CROAK);};print(((!$@&&$i!~/[^\p{Print}\t\n]/) ? "Text": "Binary", "\n")'sjis.txt
Text


2022-09-29 22:33

You do not need to use regular expressions directly.
Or rather, it's hard and ineffective to directly attach a regular expression to an undecoded byte string.
Instead, how about using Encode::Guess instead?

Plain text consists of a collection of binaries.
Therefore, there is no clear distinction between plain text and binary.
Plain text is just human eyes recognizing byte columns like that.

Shift_JIS also has a byte string pattern that is not used, so
If you're lucky enough to pick it up, you can say it's not Shift_JIS.
Of course, the Linux file command is no exception.

perlfaq6 - How do I match a string that contains multi-byte characters?
http://perldoc.jp/docs/perl/5.14.1/perlfaq6.pod#How32can32I32match32strings32with32multibyte32characters63

In order to mechanically distinguish between plain text and binary, you end up relying on statistical judgment.

If plain text only refers to ascii or utf8, you can use the -T file test without having to open the file.

$perl-E'say "TEXT" if-T "utf8.txt";'

The -T file is an ASCII or UTF-8 text file (discovered).
http://perldoc.jp/func/-X

If you want to use find(1) or ls(1) for your standard input, you can do it like this.

$ls | perl-nlE'say $_if-T;'

If Shift_JIS is included, the file must finally be open.

Please reconfirm beforehand whether Shift_JIS is acceptable as a condition for plain text.
Because of its historical background, it is commonly referred to as Shift_JIS, but it may actually refer to Windows-31j(cp932). If you consider Co., Ltd. or III as characters, and you don't have to worry about IBM extended characters.

$ls|perl-MENcode::Guess-nlE'open$ff, "<", $_;say$_if refguess_encoding (do {local$/;<$fh>}, qw/utf8cp932/);'


2022-09-29 22:33

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.