How do you pick out Python Chinese characters and Roman letters?

I'd like to know how to easily exclude Chinese characters and Roman characters such as 같은 from strings such as isalnum.

Example
"Ganada-abbc-123 Kim's Gimbap" => "Ganada ABC 123 Kim's Gimbap"
Or
"Ganada-abbc-123 Kim's Gimbap" => "Ganada ABC 123 Kim's Gimbap."
Or
"Ganada-abbc-123 Kim family kimbap" => "Ganada abc123 Kimne kimbap"

I tried to extract Korean, English, and numbers from sentences or phrases, but at first, I tried to extract them using filter and isalnum.

However, both Chinese and Roman characters that I thought were classified as special characters are returning True.

python

2022-09-20 18:58

2 Answers

import re
a = "Ganada-abbc-123 Kim's Gimbap."
t = re.sub('[^a-zA-Z0-9----ᇂ-he]','',a)
"It's ABC123 Kimne Gimbap."

2022-09-20 18:58

Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:57:54) [MSC v.1924 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> s = "Ganada-abbc-123 Kimbap"
>>> import unicodedata
>>> for c in s:
    print(c, unicodedata.name(c))


HANGUL SYLLABLE GA
I'm HANGUL SYLLABLE NA
HANGUL SYLLABLE DA
- - HYPHEN-MINUS
a LATIN SMALL LETTER A
β GREEK SMALL LETTER BETA
b LATIN SMALL LETTER B
c LATIN SMALL LETTER C
- - HYPHEN-MINUS
1 DIGIT ONE
2 DIGIT TWO
3 DIGIT THREE
  SPACE
Kim HANGUL SYLLABLE GIM
家 CJK UNIFIED IDEOGRAPH-5BB6
Yes, HANGUL SYLLABLE
  SPACE
Kim HANGUL SYLLABLE GIM
Bob HANGUL SYLLABLE BABY

>>> for c in s:
    print(c, unicodedata.name(c), c.isalpha())


HANGUL SYLLABLE GA True
I'm HANGUL SYLLABLE NA TRUE
Everything is HANGUL SYLLABLE DA True
- - HYPHEN-MINUS False
a LATIN SMALL LETTER A True
β GREEK SMALL LETTER BETA True
b LATIN SMALL LETTER B True
c LATIN SMALL LETTER C True
- - HYPHEN-MINUS False
1 DIGIT ONE False
2 DIGIT TWO False
3 DIGIT THREE False
  SPACE False
Kim HANGUL SYLLABLE GIM True
家 CJK UNIFIED IDEOGRAPH-5BB6 True
Yes, HANGUL SYLLABLE NE TRUE
  SPACE False
Kim HANGUL SYLLABLE GIM True
Bob HANGUL SYLLABLE BABY TRUE

>>> s
"Ganada-abbc-123 Kim's Gimbap."
>>> def filterout_hanja_greek(s):
    r = ""
    for c in s:
        if not c.isalpha():
            r += c
            continue
        unicodename = unicodedata.name(c)
        if unicodename.startswith("GREEK") or unicodename.startswith("CJK"):
            continue
        r += c
    return r

>>> filterout_hanja_greek(s)
"Ganada-abc-123 Kim's Gimbap."
>>>

Unicodeedata allows you to determine the Unicode name of a character, which indicates what the first part of the name is. You can use this to remove certain characters.

There is also a package called hanja that has the function of changing Chinese characters into Korean.

2022-09-20 18:58

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656