Hi, everyone I'm inquiring because the unused terms have not been deleted during natural language preprocessing with Jupiter's laptop.
import konlpy
import re
def tokenize_korean_text(text):
text = re.sub(r'[^,.?!\w\s]','', text)
okt = konlpy.tag.Okt()
Okt_morphs = okt.pos(text)
words = []
for word, pos in Okt_morphs:
if pos == 'Verb' or pos == 'Noun':
return words
tokenized_list = []
for text in df['Keyword']:
Set the tokenized_list here, and
stop_words="It is", "It is", "We", "Hal", "Su", "Do", "Everyone", "Daehan", "Hae", "Su", "It is", "Hae"]
Specify the disused term as shown above, and
clean_words = [i for i in tokenized_list if i not in stop_words]
I did it like this.
And in order to run text analysis,
dictionary = corpora.Dictionary(clean_words)
dictionary.filter_extremes(no_below=2, no_above=0.05)
corpus = [dictionary.doc2bow(text) for text in clean_words]
ldamodel = LdaModel(corpus, num_topics=8, id2word=dictionary, passes=20, iterations=500)
I executed the code above. Indentative terms such as 'ha' and 'su' that were designated as disentanglement still appear in the results.
Is the code wrong?
I'd really appreciate your help.
# What to do
clean_words = []
for i in tokenized_list:
a = 0
for ii in stop_words:
if ii in i:
a += 1
if a < 1:
# What I'm doing
clean_words = []
for i in tokenized_list:
if stop_words not in i:
The tokenized_list is not a list of strings, but a list of words in the string.
