To distinguish names from Python strings

First of all, the purpose is to distinguish only Korean/English names from strings that combine Korean names, English names, special characters, and spaces, leaving only one duplicate name.

So far, what I've been doing is I've loaded the text file that I have, removed unnecessary special characters, and converted it into a string, and the code is as follows:

path = 'E:\Data Science\Personal_Project\Church\Data\original.txt'

def open_text(path):
    with open(path, "r", encoding='euc-kr') as f:
        text = f.readlines()
        string = ''.join(text)
        unicode_line = string.translate({ord(c): None for c in '.;*\n'})
        cleaned = re.split('-|,', unicode_line)


        print(type(unicode_line), type(cleaned))
        return(cleaned)

Here's the question What I want to add to Function above is

1) If there's a letter in front of the dash line (for example, 'attendance ----' like this), I want to remove the letter in front of it before I split it with the dash.

2) Or I'd like to make a list -- [attendance, early leave, late leave, vacation] --- I'd like to get rid of the words in here.

I would appreciate it if you could tell me if there is a better way or if you could tell me in a more pythonic way!

For your convenience, I will add the sample text.

 January 20th status


** Attendance
-----------

Kang Dae-se, Kang Bong-sung, Kang Sang-moo, Kang Seok-sim Unknown 4, Daniel Lee, Dong H Kim, Jeannie Oh, Jessica Yi, McAleer Chong, Shiu K Lew, Song H Kwak, Steve Jang, Tok Kim, 







** absence
---------

vacation, unauthorized, unpaid leave, emergency
------------------------------------------------------------------------------------------- 
Go Haeng-ok, Kim Gye-soon

string regex python

2022-09-22 19:31

1 Answers

I solved it. Haha Before we finished dividing the lines, we skipped the line that starts with * or -, and we made the rest of the non-name words into a set and added only the ones that weren't in it to the name set.

If you have a better way, please feel free to let me know.

non_names = set ("attendance, absence, vacation, unauthorized, unpaid leave, emergency").split(", "))
with open("text.txt") as f:
    next(f) # skip first line
    names = set()
    for line in f:
        if not line.startswith(("*", "-")):
            for name in line.strip().split(", "):
                if name and name not in non_names:
                    names.add(name)

2022-09-22 19:31

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656