I'd like to know how to easily exclude Chinese characters and Roman characters such as 같은 from strings such as isalnum.
Example
"Ganada-abbc-123 Kim's Gimbap" => "Ganada ABC 123 Kim's Gimbap"
Or
"Ganada-abbc-123 Kim's Gimbap" => "Ganada ABC 123 Kim's Gimbap."
Or
"Ganada-abbc-123 Kim family kimbap" => "Ganada abc123 Kimne kimbap"
I tried to extract Korean, English, and numbers from sentences or phrases, but at first, I tried to extract them using filter and isalnum.
However, both Chinese and Roman characters that I thought were classified as special characters are returning True.
python
import re
a = "Ganada-abbc-123 Kim's Gimbap."
t = re.sub('[^a-zA-Z0-9----ᇂ-he]','',a)
"It's ABC123 Kimne Gimbap."
Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:57:54) [MSC v.1924 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> s = "Ganada-abbc-123 Kimbap"
>>> import unicodedata
>>> for c in s:
print(c, unicodedata.name(c))
HANGUL SYLLABLE GA
I'm HANGUL SYLLABLE NA
HANGUL SYLLABLE DA
- - HYPHEN-MINUS
a LATIN SMALL LETTER A
β GREEK SMALL LETTER BETA
b LATIN SMALL LETTER B
c LATIN SMALL LETTER C
- - HYPHEN-MINUS
1 DIGIT ONE
2 DIGIT TWO
3 DIGIT THREE
SPACE
Kim HANGUL SYLLABLE GIM
家 CJK UNIFIED IDEOGRAPH-5BB6
Yes, HANGUL SYLLABLE
SPACE
Kim HANGUL SYLLABLE GIM
Bob HANGUL SYLLABLE BABY
>>> for c in s:
print(c, unicodedata.name(c), c.isalpha())
HANGUL SYLLABLE GA True
I'm HANGUL SYLLABLE NA TRUE
Everything is HANGUL SYLLABLE DA True
- - HYPHEN-MINUS False
a LATIN SMALL LETTER A True
β GREEK SMALL LETTER BETA True
b LATIN SMALL LETTER B True
c LATIN SMALL LETTER C True
- - HYPHEN-MINUS False
1 DIGIT ONE False
2 DIGIT TWO False
3 DIGIT THREE False
SPACE False
Kim HANGUL SYLLABLE GIM True
家 CJK UNIFIED IDEOGRAPH-5BB6 True
Yes, HANGUL SYLLABLE NE TRUE
SPACE False
Kim HANGUL SYLLABLE GIM True
Bob HANGUL SYLLABLE BABY TRUE
>>> s
"Ganada-abbc-123 Kim's Gimbap."
>>> def filterout_hanja_greek(s):
r = ""
for c in s:
if not c.isalpha():
r += c
continue
unicodename = unicodedata.name(c)
if unicodename.startswith("GREEK") or unicodename.startswith("CJK"):
continue
r += c
return r
>>> filterout_hanja_greek(s)
"Ganada-abc-123 Kim's Gimbap."
>>>
Unicodeedata allows you to determine the Unicode name of a character, which indicates what the first part of the name is. You can use this to remove certain characters.
There is also a package called hanja that has the function of changing Chinese characters into Korean.
© 2024 OneMinuteCode. All rights reserved.