How do you pick out Python Chinese characters and Roman letters?

Asked 2 years ago, Updated 2 years ago, 13 views

I'd like to know how to easily exclude Chinese characters and Roman characters such as 같은 from strings such as isalnum.

Example
"Ganada-abbc-123 Kim's Gimbap" => "Ganada ABC 123 Kim's Gimbap"
Or
"Ganada-abbc-123 Kim's Gimbap" => "Ganada ABC 123 Kim's Gimbap."
Or
"Ganada-abbc-123 Kim family kimbap" => "Ganada abc123 Kimne kimbap"

I tried to extract Korean, English, and numbers from sentences or phrases, but at first, I tried to extract them using filter and isalnum.

However, both Chinese and Roman characters that I thought were classified as special characters are returning True.

python

2022-09-20 18:58

2 Answers

import re
a = "Ganada-abbc-123 Kim's Gimbap."
t = re.sub('[^a-zA-Z0-9----ᇂ-he]','',a)
"It's ABC123 Kimne Gimbap."


2022-09-20 18:58

Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:57:54) [MSC v.1924 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> s = "Ganada-abbc-123 Kimbap"
>>> import unicodedata
>>> for c in s:
    print(c, unicodedata.name(c))


HANGUL SYLLABLE GA
I'm HANGUL SYLLABLE NA
HANGUL SYLLABLE DA
- - HYPHEN-MINUS
a LATIN SMALL LETTER A
β GREEK SMALL LETTER BETA
b LATIN SMALL LETTER B
c LATIN SMALL LETTER C
- - HYPHEN-MINUS
1 DIGIT ONE
2 DIGIT TWO
3 DIGIT THREE
  SPACE
Kim HANGUL SYLLABLE GIM
家 CJK UNIFIED IDEOGRAPH-5BB6
Yes, HANGUL SYLLABLE
  SPACE
Kim HANGUL SYLLABLE GIM
Bob HANGUL SYLLABLE BABY

>>> for c in s:
    print(c, unicodedata.name(c), c.isalpha())


HANGUL SYLLABLE GA True
I'm HANGUL SYLLABLE NA TRUE
Everything is HANGUL SYLLABLE DA True
- - HYPHEN-MINUS False
a LATIN SMALL LETTER A True
β GREEK SMALL LETTER BETA True
b LATIN SMALL LETTER B True
c LATIN SMALL LETTER C True
- - HYPHEN-MINUS False
1 DIGIT ONE False
2 DIGIT TWO False
3 DIGIT THREE False
  SPACE False
Kim HANGUL SYLLABLE GIM True
家 CJK UNIFIED IDEOGRAPH-5BB6 True
Yes, HANGUL SYLLABLE NE TRUE
  SPACE False
Kim HANGUL SYLLABLE GIM True
Bob HANGUL SYLLABLE BABY TRUE

>>> s
"Ganada-abbc-123 Kim's Gimbap."
>>> def filterout_hanja_greek(s):
    r = ""
    for c in s:
        if not c.isalpha():
            r += c
            continue
        unicodename = unicodedata.name(c)
        if unicodename.startswith("GREEK") or unicodename.startswith("CJK"):
            continue
        r += c
    return r

>>> filterout_hanja_greek(s)
"Ganada-abc-123 Kim's Gimbap."
>>> 

Unicodeedata allows you to determine the Unicode name of a character, which indicates what the first part of the name is. You can use this to remove certain characters.

There is also a package called hanja that has the function of changing Chinese characters into Korean.


2022-09-20 18:58

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.