I have a question about word2vec learning data

Hi, how are you?

I would like to inquire about word2vec learning data.

from gensim.models.word2vec import Word2Vec
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/DoosanB/files/master/test.csv', encoding = 'utf-8')
df = pd.DataFrame(df)

model = Word2Vec(df['corpus'].values, sg=1, window=5, min_count=1, workers=4, iter=100)

model_result1 = model.wv.most_similar ("government")

I used the code as above

KeyError: "word 'government' not in vocational"

An error has occurred.

I'm asking because I don't know how to handle it.

word2vec python

2022-09-21 12:28

1 Answers

The values of df['corpus'].values used to learn word2vec in your code are as follows:

# df['corpus']values
[
    ["Housing", "Apartment", "Real Estate", "Price", "Stability", "Apartment"]
    ["Common people", "dream", "hope", "Ilsoon", "other", "real estate", "price", "rise", "common people", "economy", "catch"] 
    ...
]

By default, word2vec uses 'sentence' as its learning data.

The sentence must be in a list format organized by words.

Like this

# word2vec expected learning data type
[
    ["Housing", "Apartment", "Real Estate", "Price", "Stability", "EALUDA",
    ["Communist", "Dream", "Hope", "Ilsoon", "Other", "Real Estate", "Price", "Rising", "Communist", "Economic", "Catch",
    ...
]

Hand over the value of df['corpus'].values to the learning data in word2vec.

word2vec is not the expected shape of the sentence, so it is automatically converted to the sentence form.

The data that word2vec learns will be the following values, with the df['corpus'].values automatically modified:

# Data that the actual word2vec learns
[
    ['[', '''', 'Ju', 'Taek'', '''', '''', '''', ''''', 'A', ''''', 'Pa', 'T', '''', ''''', ''''', 'Bu', 'Dong', 'San', '''', '''', '''', 'A', 'PASS', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''', '''',
    ['[', '''', 'Seo', 'Min', '''', '''', '''', '''', '''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', ''''', '', ''''', '', ''''', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ' 
    ...
]

Because learning proceeds in this state, only one character is stored in vocabulary.

# Vocabulary of learned word2vec (pre)
["", "", "", "3", "4", "G", "H", ..., "Fold", "Jeong", "Je"]

This results in an error that the word 'government' does not exist in vocabulary.

To resolve this issue,

Replace df['corpus'].values with map(val, df['corpus'].values).

You can use map(val, df['corpus' values) to replace all elements in df['corpus'].values with a list format.

# df['corpus'].values[:2]
["Housing", "Apartment", "Real estate", "Price", "Stability", "EALUDA"]
 ["Common people", "dream", "hope", "Ilsoon", "other", "real estate", "price", "rise", "common people", "economy", "catch"]

# # list(map(eval, df['corpus'].values[:2]))
["Housing", "Apartment", "Real Estate", "Price", "Stability", "EALUDA",
 ["Communist", "Dream", "Hope", "Ilsoon", "Other", "Real Estate", "Price", "Rising", "Communist", "Economic", "Catch"]

The complete code with the above workaround is as follows.

from gensim.models.word2vec import Word2Vec
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/DoosanB/files/master/test.csv', encoding = 'utf-8')
df = pd.DataFrame(df)

model = Word2Vec(map(eval, df['corpus'].values), sg=1, window=5, min_count=1, workers=4, iter=100)
model_result1 = model.wv.most_similar ("government")

print(model_result1)
# Output results: [('Improvement', 0.3173496127128601), ('Power', 0.2743409156799316), ('Bomb', 0.2881121039390564), (Final', 0.2811095118522644), (Seoul City', 0.2597326040267604604), (Resource 0.248407354', 0.27458', '17458', (708), 0.224', 0.224', 0.224', 0.224', 0.224', 0.224', 0.224', 0.224', 0.224', 0.224', 0

Was it helpful? The explanation was too long If you don't understand anything, please leave a comment!

2022-09-21 12:28

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656