Hi, how are you?
I would like to inquire about word2vec learning data.
from gensim.models.word2vec import Word2Vec
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/DoosanB/files/master/test.csv', encoding = 'utf-8')
df = pd.DataFrame(df)
model = Word2Vec(df['corpus'].values, sg=1, window=5, min_count=1, workers=4, iter=100)
model_result1 = model.wv.most_similar ("government")
I used the code as above
KeyError: "word 'government' not in vocational"
An error has occurred.
I'm asking because I don't know how to handle it.
word2vec python
The values of df['corpus'].values
used to learn word2vec in your code are as follows:
# df['corpus']values
[
["Housing", "Apartment", "Real Estate", "Price", "Stability", "Apartment"]
["Common people", "dream", "hope", "Ilsoon", "other", "real estate", "price", "rise", "common people", "economy", "catch"]
...
]
By default, word2vec uses 'sentence' as its learning data.
The sentence must be in a list format organized by words.
Like this
# word2vec expected learning data type
[
["Housing", "Apartment", "Real Estate", "Price", "Stability", "EALUDA",
["Communist", "Dream", "Hope", "Ilsoon", "Other", "Real Estate", "Price", "Rising", "Communist", "Economic", "Catch",
...
]
Hand over the value of df['corpus'].values
to the learning data in word2vec.
word2vec is not the expected shape of the sentence, so it is automatically converted to the sentence form.
The data that word2vec learns will be the following values, with the df['corpus'].values
automatically modified:
# Data that the actual word2vec learns
[
['[', '''', 'Ju', 'Taek'', '''', '''', '''', ''''', 'A', ''''', 'Pa', 'T', '''', ''''', ''''', 'Bu', 'Dong', 'San', '''', '''', '''', 'A', 'PASS', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''',
['[', '''', 'Seo', 'Min', '''', '''', '''', '''', '''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', '', ''''', '', ''''', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '
...
]
Because learning proceeds in this state, only one character is stored in vocabulary.
# Vocabulary of learned word2vec (pre)
["", "", "", "3", "4", "G", "H", ..., "Fold", "Jeong", "Je"]
This results in an error that the word 'government' does not exist in vocabulary.
To resolve this issue,
Replace df['corpus'].values
with map(val, df['corpus'].values)
.
You can use map(val, df['corpus' values)
to replace all elements in df['corpus'].values
with a list format.
# df['corpus'].values[:2]
["Housing", "Apartment", "Real estate", "Price", "Stability", "EALUDA"]
["Common people", "dream", "hope", "Ilsoon", "other", "real estate", "price", "rise", "common people", "economy", "catch"]
# # list(map(eval, df['corpus'].values[:2]))
["Housing", "Apartment", "Real Estate", "Price", "Stability", "EALUDA",
["Communist", "Dream", "Hope", "Ilsoon", "Other", "Real Estate", "Price", "Rising", "Communist", "Economic", "Catch"]
The complete code with the above workaround is as follows.
from gensim.models.word2vec import Word2Vec
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/DoosanB/files/master/test.csv', encoding = 'utf-8')
df = pd.DataFrame(df)
model = Word2Vec(map(eval, df['corpus'].values), sg=1, window=5, min_count=1, workers=4, iter=100)
model_result1 = model.wv.most_similar ("government")
print(model_result1)
# Output results: [('Improvement', 0.3173496127128601), ('Power', 0.2743409156799316), ('Bomb', 0.2881121039390564), (Final', 0.2811095118522644), (Seoul City', 0.2597326040267604604), (Resource 0.248407354', 0.27458', '17458', (708), 0.224', 0.224', 0.224', 0.224', 0.224', 0.224', 0.224', 0.224', 0.224', 0.224', 0
Was it helpful? The explanation was too long If you don't understand anything, please leave a comment!
© 2024 OneMinuteCode. All rights reserved.