Problems that do not delete disused words during natural language processing (correction)

Asked 2 years ago, Updated 2 years ago, 71 views

Hi, everyone I'm inquiring because the unused terms have not been deleted during natural language preprocessing with Jupiter's laptop.

import konlpy
import re

def tokenize_korean_text(text):
    text = re.sub(r'[^,.?!\w\s]','', text)

    okt = konlpy.tag.Okt()
    Okt_morphs = okt.pos(text)

    words = []
    for word, pos in Okt_morphs:
        if pos == 'Verb' or pos == 'Noun':
            words.append(word)

    return words


tokenized_list = []

for text in df['Keyword']:
    tokenized_list.append(tokenize_korean_text(text))

print(len(tokenized_list))
print(tokenized_list[1800])

Set the tokenized_list here, and

stop_words="It is", "It is", "We", "Hal", "Su", "Do", "Everyone", "Daehan", "Hae", "Su", "It is", "Hae"]

Specify the disused term as shown above, and

clean_words = [i for i in tokenized_list if i not in stop_words]

I did it like this.

And in order to run text analysis,

dictionary = corpora.Dictionary(clean_words)  
dictionary.filter_extremes(no_below=2, no_above=0.05) 
corpus = [dictionary.doc2bow(text) for text in clean_words]

ldamodel = LdaModel(corpus, num_topics=8, id2word=dictionary, passes=20, iterations=500) 
ldamodel.print_topics(num_words=8) 

I executed the code above. Indentative terms such as 'ha' and 'su' that were designated as disentanglement still appear in the results.

Is the code wrong?

I'd really appreciate your help.

jupyter-notebook

2022-09-20 10:21

2 Answers

# What to do
clean_words = []
for i in tokenized_list:
    a = 0
    for ii in stop_words:
        if ii in i:
            a += 1
    if a < 1:
        clean_words.append(i)
# What I'm doing
clean_words = []
for i in tokenized_list:
    if stop_words not in i:
        clean_words.append(i)


2022-09-20 10:21

The tokenized_list is not a list of strings, but a list of words in the string.


2022-09-20 10:21

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.