Hi, everyone I'm inquiring because the unused terms have not been deleted during natural language preprocessing with Jupiter's laptop.
import konlpy
import re
def tokenize_korean_text(text):
text = re.sub(r'[^,.?!\w\s]','', text)
okt = konlpy.tag.Okt()
Okt_morphs = okt.pos(text)
words = []
for word, pos in Okt_morphs:
if pos == 'Verb' or pos == 'Noun':
words.append(word)
return words
tokenized_list = []
for text in df['Keyword']:
tokenized_list.append(tokenize_korean_text(text))
print(len(tokenized_list))
print(tokenized_list[1800])
Set the tokenized_list here, and
stop_words="It is", "It is", "We", "Hal", "Su", "Do", "Everyone", "Daehan", "Hae", "Su", "It is", "Hae"]
Specify the disused term as shown above, and
clean_words = [i for i in tokenized_list if i not in stop_words]
I did it like this.
And in order to run text analysis,
dictionary = corpora.Dictionary(clean_words)
dictionary.filter_extremes(no_below=2, no_above=0.05)
corpus = [dictionary.doc2bow(text) for text in clean_words]
ldamodel = LdaModel(corpus, num_topics=8, id2word=dictionary, passes=20, iterations=500)
ldamodel.print_topics(num_words=8)
I executed the code above. Indentative terms such as 'ha' and 'su' that were designated as disentanglement still appear in the results.
Is the code wrong?
I'd really appreciate your help.
jupyter-notebook
# What to do
clean_words = []
for i in tokenized_list:
a = 0
for ii in stop_words:
if ii in i:
a += 1
if a < 1:
clean_words.append(i)
# What I'm doing
clean_words = []
for i in tokenized_list:
if stop_words not in i:
clean_words.append(i)
The tokenized_list is not a list of strings, but a list of words in the string.
© 2024 OneMinuteCode. All rights reserved.