I want to use regular expressions to exclude surrogates and special classification characters.

Asked 2 years ago, Updated 2 years ago, 106 views

I'm creating an app with UnityC# and
To match the input characters on Android with the input characters on iPhone,
I'm trying to replace using C#'s Regax, but it doesn't work.

I have written the following code, but
0s, 1s, 8s, etc. were deleted on the actual Android machine.
All characters are excluded.

0=0x301=0x31, so I don't think it's going to be excluded.
I don't know why it is excluded.
Could you teach me?

■ Unicode reference site
 http://www.asahi-net.or.jp/~ax2s-vmtn/ref/unicode/index_u.html

■ List of regular expressions you want to exclude

private static readonly List<string>RegexList=new List<string>()
{
    "\u2600-\u26FF", // Miscellaneous Symbols in Unicode
    "[\u0530-\u058F],//Unicode Armenian Characters
    "[\u0A00-\u0A7F], // Unicode Glumky Characters (Gurmukhi)
    "[\uD800-\uDB7F], // High Surrogates
    "\uDB80-\uDBFF", // High Private Use Surrogates
    "[\uDC00-\uDFFF]", // Lower Surrogates
    "[\uE000-\uF8FF], // Private Use Area
    "[ uu{EFF80}- uu{EFFFF}]", // Unassigned
    "[ uu{F0000}- uu{FFFFFF}], // Auxiliary Private Area A - Unallocated (Side 15) (Supplementary Private Use Area-A)
    "[ uu{100000} - uu{10FFFF}], // Auxiliary Private Area B - Unassigned (Side 16) (Supplementary Private Use Area-B)
}; 

■ Internal Processing
·Loop in the regular expression list and make it empty if applicable.
·InputString contains a string entered by the keyboard in the app.

foreach (var regex in RegexList)
{
    inputString=Regex.Replace(inputString,regex,"";
}

■Development environment
 VisualStudio 2015
 Unity 5.6.3p1

c# unity3d regular-expression

2022-09-30 19:28

3 Answers

Depending on the font you are using, the first \ in the pattern string is a U+005c backslash (which appears in some fonts, mainly for Windows), while the \ in the last three lines is normally U+00A5.

In C#, the notation \u{...} is not valid as an escape sequence in a normal string or as an escape sequence in a Regex pattern, so did you replace it with a real circle symbol?

In the first place, in your example, "[\uD800-\uDB7F]", "[\uDB80-\uDBFF]", "[\uDC00-\uDFFF]]" excludes all surrogates, so patterns representing some ranges of non-BMP characters are meaningless.

The pattern in the last three lines should be deleted for now.(Whether or not that will delete all the characters you want is another matter.)

private static readonly List<string>RegexList=new List<string>()
{
    "\u2600-\u26FF", // Miscellaneous Symbols in Unicode
    "[\u0530-\u058F],//Unicode Armenian Characters
    "[\u0A00-\u0A7F], // Unicode Glumky Characters (Gurmukhi)
    "[\uD800-\uDBFF], // High Surrogates
    "[\uDC00-\uDFFF], // Low Surrogates
    "[\uE000-\uF8FF], // Private Use Area
};

For example, the regular expression [ uu{EFF80}- uu{EFFFF}] does not treat < as meta-characters, and the characters on both sides of - cannot be correctly treated as character ranges.

If you specify all the top and bottom of the surrogate pair, all non-BMP characters will be deleted as written above, so if you say, "Then there are too many objects to delete!" you have to be more specific about the string to delete.


2022-09-30 19:28

0=0x301=0x31, so I don't think it's going to be excluded, but I don't know why it's excluded.

As OOPer has already pointed out, < included in "[ uu{EFF80}- uu{EFFFF}]" is just a circle U+00A5.Therefore, this regular expression is interpreted as "[08EFu{}- ]]" and matches 0, 8, E, F, u, { and } to {{} by ).
Please understand that the program does not work as expected, but as stated.

"[ uu{EFF80}- uu{EFFFF}]", // Unassigned
"[ uu{F0000}- uu{FFFFFF}], // Auxiliary Private Area A - Unallocated (Side 15) (Supplementary Private Use Area-A)
"[ uu{100000} - uu{10FFFF}]" // Auxiliary Private Area B - Unassigned (Side 16) (Supplementary Private Use Area-B)

Decompose into surrogate pairs based on your intentions

  • \uDB7F\uDF80 through\uDB7F\uDFFF
  • \uDB80\uDC00 through\uDBF\uDFFF
  • \uDBC0\uDC00 to \uDBFF\uDFF

is the case.Of course, these characters are included in the upper surrogate "[\uD800-\uDB7F]" and "[\uDB80-\uDBFF]" and the second character in the lower surrogate "[\uDC00-\uDFFF]".

Rather, as OOPer pointed out, I think it is too broad to exclude the entire surrogate pair.Some kanji characters will be deleted.
If you reduce the scope of the surrogate pair, you narrow down the upper surrogate, but if you delete the upper surrogate and the lower surrogate individually, especially if only the lower surrogate is deleted and the upper surrogate remains, it will be broken as a string.If you want to delete it, you should limit it to consecutive combinations.

private static readonly List<string>RegexList=new List<string>()
{
    "\u2600-\u26FF", // Miscellaneous Symbols
    "[\u0530-\u058F]", // Armenian Characters
    "[\u0A00-\u0A7F]", // Gurmukhi characters
    "[\uD800-\uDBFF][\uDC00-\uDFFF]", // Surrogates pair
    "[\uE000-\uF8FF], // Private Use Area
};


2022-09-30 19:28

I was able to resolve it with the following code, but when I did the IL2CPP build of Unity, the regular expression validation was
We are investigating as it will not work.

//<summary>
/// List of regular expression strings to restrict input
/// </summary>
private static readonly List<string>InputBlockRegexList=new List<string>()
{
    "[\u23F3]", // (HOURGLASS WITH FLOWING SAND)
    "[\u25FD-\u25FE]", // (WHITE MEDIUM SMALL SQUAR) (BLACK MEDIUM SMALL SQUARE)
    "\u2600-\u26FF", // Miscellaneous Symbols
    "[\u2705]", // (WHITE HEAVY CHECK MARK)
    "[\u2714]", // (HEAVY CHECK MARK)
    "[\u2764]", // (HEAVY BLACK HEART)
    "[\u274C]", // (CROSS MARK)
    "[\u274E]", // (NEGATIVE SQUARED CROSS MARK)
    "[\u2753-\u2755]", // (BLACK QUESTION MARK ORNAMENT / WHITE QUESTION MARK ORNAMENT / WHITE EXCLUSION MARK ORNAMENT)
    "[\u2757]", // (HEAVY EXCLAMATION MARK SYMBOL)
    "[\u27BF]", // (DOUBLE CURLY LOOP)
    [\u2795-\u2797] , // (HEAVY PLUS SIGN) · (HEAVY MINUS SIGN) · (HEAVY DIVISION SIGN)
    "[\u2B50]", // (WHITE MEDIUM STAR)
    "[\u2B55]", // (HEAVY LARGE CIRCLE)
    [\u2B1B-\u2B1C], // (BLACK LARGE SQUARE) · (WHITE LARGE SQUARE)
    "[\u0530-\u058F]", // Armenian Characters
    "[\u0A00-\u0A7F]", // Gurmukhi characters
    "[\uD800-\uDBFF][\uDC00-\uDFFF]", // Surrogates pair
    "[\uE000-\uF8FF], // Private Use Area
};

■Validation Implementation Department

//<summary>
/// input character limit
/// </summary>
/// <param name="inputString"> input string</param>
/// <returns>String after validation</returns>
public static string InputValueValidate (string inputString)
{
    // based on a regular expression list
    for (inti=0;i<InputBlockRegexList.Count;i++)
    {
        inputString=Regex.Replace(inputString,InputBlockRegexList[i], ");
    }
    return inputString;
}


2022-09-30 19:28

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.