Key error in Word2vec

Based on the reference URL below, I would like to move word2vec.
The following files and procedures for similars.py and train.py are all diverted from this site.

After spacing the files of Aozora Bunko with mecab, I made them learn from the following files.
Save the generated model as data22.model.

train.py

-*-coding:utf-8-*-

from gensim.models import word 2vec
import logging
import sys

logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s', 
level=logging.INFO)

sentences=word2vec.LineSentence(sys.argv[1])
model=word2vec.word2vec(sentences,
                          sg = 1,
                          size = 100,
                          min_count = 1,
                          window=10,
                          hs = 1,
                          negative = 0)
model.save(sys.argv[2])

Ran train.py in python.Save the resulting model as data22.model.

$python train.py data22.txt data22.model
2017-04-0801:49:31,381:INFO:collecting all words and their counts
2017-04-0801:49:31,382:INFO:PROGRESS:at presence#0,processed0 
words, keeping 0 word types
2017-04-0801:49:31,389:INFO:collected1684 word types from a 
corpus of 9554 raw words and 228 sentences
2017-04-0801:49:31,389:INFO: Loading a fresh vocabulary
2017-04-0801:49:31,395:INFO:min_count=1 contains1684 unique words 
(100% of original 1684, drops 0)
2017-04-0801:49:31,395:INFO:min_count = 1 leaves 9554 word corpus (100% of original 9554, drops 0)
2017-04-0801:49:31,405:INFO:deleting the raw counts dictionary of 1684 items
2017-04-0801:49:31,406:INFO:sample=0.001 downsamples45most-common words
2017-04-0801:49:31,407:INFO:downsampling leaves estimated 5687 word corpus (59.5% of priority9554)
2017-04-0801:49:31,407:INFO:estimated required memory for 1684 words and 100 dimensions: 2526000 bytes
2017-04-0801:49:31,410:INFO:construction a huffman tree from 1684 words
2017-04-0801:49:31,496:INFO:build huffman tree with maximum node depth 13
2017-04-0801:49:31,496:INFO:resetting layer weights
2017-04-08 01:49:31,544:INFO:training model with 3 workers on 1684 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=0 window=10
2017-04-0801:49:31,544:INFO:expecting 228 sentences, matching count from corpus used for vocabulary survey
2017-04-08 01:49:31,708:INFO:worker thread finished; waiting finish of 2 more threads
2017-04-08 01:49:31,766: INFO:worker thread finished; waiting finish of 1 more threads
2017-04-08 01:49:31,767:INFO:worker thread finished; waiting finish of 0 more threads
2017-04-0801:49:31,767:INFO:training on 47770 raw words (28489 effective words)took 0.2s, 128642 effective words/s
2017-04-08 01:49:31,767: WARNING:under 10 jobs per worker:consider setting a smaller `batch_words' for smoother alpha decay
2017-04-0801:49:31,767:INFO:saving Word2Veject under data22.model, separate None
2017-04-0801:49:31,767:INFO:not storage attribute syn 0 norm
2017-04-0801:49:31,767:INFO:not storage attribute cum_table
2017-04-0801:49:31,870:INFO:saved data22.model

A script that lists words that are similar to the specified word is available at similars.py.

similars.py

#-*-coding:utf-8-*-

from gensim.models import word 2vec
import sys

model=word2vec.Word2Vec.load(sys.argv[1])
results=model.most_similar(positive=sys.argv[2], topn=10)

for result in results:
    print(result[0], '\t', result[1])

Run similars.py with the word "book" as an argument in the model file you just created.Then the following error will appear:The argument does not appear to recognize the word "book", but the cause is unknown.

$python similars.py data22.model book
Traceback (most recent call last):
  File "similars.py", line 7, in <module>
    results=model.most_similar(positive=sys.argv[2], topn=10)
  File"/usr/local/lib/python 2.7/site-
packages/gensim/models/word2vec.py", line 1285, in most_similar
    return self.wv.most_similar(positive, negative, topn, 
restrict_vocab, indexer)
  File"/usr/local/lib/python 2.7/site-
packages/gensim/models/keyedvectors.py", line 97, in most_similar
    **raise KeyError ("word'%s' not in vocabulary"%word)**
**KeyError: "word'\xe6\x9c\xac' not in vocabulary"**

I would appreciate it if someone could give me some tips on how to solve this problem.Thank you for your cooperation.

python word2vec mecab

2022-09-30 17:29

1 Answers

A similar error occurred, but I resolved it below.
word=unicode(sys.argv[2], 'utf-8')
results=model.most_similar(positive=word,topn=10)

2022-09-30 17:29

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656