Regular expression to verify that it is plain text

Perl's program indicates that a file is written in plain text only. I'd like to distinguish by regular expressions, but I'm worried about how I should write a pattern to realize it.
Linux file commands can identify text/binary, but
Can you write something like that in a regular expression?

regular-expression perl

2022-09-29 22:33

3 Answers

I think it would be better to judge Shift_JIS and UTF-8 as binary data at the same time instead of using regular expressions, but I will write the method of judging Shift-JIS in the form of regular expressions.

UTF-8 starts multi-byte with 11****** and n-1 bytes 10**** if one of the higher bits is n consecutive (2 nn 66), that is, classifying C0 or higher bytes into five categories and then arranging the number of 80-BF bytes.For example, if n=3, [\xE0-\xEF]([^\x80-\xBF]|[\x80-\xBF][^\x80-\xBF]|[\x80-\xBF]{3}) is not allowed.

It's not practical, so it's better to check the same way without using regular expressions.

2022-09-29 22:33

If the definition of "plain text" is "contains only visible character data", then \p {Print} may be sufficient. See the document perluniprops(1) to determine who is Print.

For example, if the input character encoding is UTF-8, then the following is true:

$perl-MENcode-e'$/=";$i=<>eval{$i=Encode::decode("UTF-8",$i,Encode::FB_CROAK);};print(((!$@&&$i!~/[^\p{Print}\t\n]/)? "Text": "Binary", "\n")'/etc/passwd
Text
$ US>perl-MENcode-e'$/=";$i=<>eval{$i=Encode::decode("UTF-8", $i,Encode::FB_CROAK);};print((!$@&&$i!~/[^\p{Print}\t\n]/)? "Text": "Binary", "\n")'/bin/ls
Binary

It is important to flag UTF8 for the variables to be evaluated (see perlunitut(1) or Encode module documentation (read by running perldoc Encode), and include \t\n in the regular expression character class if you want to identify it as text.

If the input is Shift_JIS, adjust the first argument in Encode::decode.

$perl-MENcode-e'$/=";$i=<>;eval{$i=Encode::decode("Shift_JIS",$i,Encode::FB_CROAK);};print(((!$@&&$i!~/[^\p{Print}\t\n]/) ? "Text": "Binary", "\n")'sjis.txt
Text

2022-09-29 22:33

You do not need to use regular expressions directly.
Or rather, it's hard and ineffective to directly attach a regular expression to an undecoded byte string.
Instead, how about using Encode::Guess instead?

Plain text consists of a collection of binaries.
Therefore, there is no clear distinction between plain text and binary.
Plain text is just human eyes recognizing byte columns like that.

Shift_JIS also has a byte string pattern that is not used, so
If you're lucky enough to pick it up, you can say it's not Shift_JIS.
Of course, the Linux file command is no exception.

perlfaq6 - How do I match a string that contains multi-byte characters?
http://perldoc.jp/docs/perl/5.14.1/perlfaq6.pod#How32can32I32match32strings32with32multibyte32characters63

In order to mechanically distinguish between plain text and binary, you end up relying on statistical judgment.

If plain text only refers to ascii or utf8, you can use the -T file test without having to open the file.

$perl-E'say "TEXT" if-T "utf8.txt";'

The -T file is an ASCII or UTF-8 text file (discovered).
http://perldoc.jp/func/-X

If you want to use find(1) or ls(1) for your standard input, you can do it like this.

$ls | perl-nlE'say $_if-T;'

If Shift_JIS is included, the file must finally be open.

Please reconfirm beforehand whether Shift_JIS is acceptable as a condition for plain text.
Because of its historical background, it is commonly referred to as Shift_JIS, but it may actually refer to Windows-31j(cp932). If you consider Co., Ltd. or III as characters, and you don't have to worry about IBM extended characters.

$ls|perl-MENcode::Guess-nlE'open$ff, "<", $_;say$_if refguess_encoding (do {local$/;<$fh>}, qw/utf8cp932/);'



		
		
			

				

					
				

				
					2022-09-29 22:33

			
			If you have any answers or tips



		

	
		Popular Tags
	
	python x 4647
android x 1593
java x 1494
javascript x 1427
c x 927
c++ x 878
ruby-on-rails x 696
php x 692
python3 x 685
html x 656
	


	
		Popular Questions
	
	
	1022 In Java servlet, when SHA-256 sends WW-Authenticate header for digest authentication, the client does not return the result.

	610 GDB gets version error when attempting to debug with the Presense SDK (IDE)

	881 /usr/bin/google-chrome:symbol lookup error:/usr/bin/google-chrome: undefined symbol:gbm_bo_get_modifier

	912 When building Fast API+Uvicorn environment with PyInstaller, console=False results in an error

	578 Understanding How to Configure Google API Key