I want to convert the output in word2vec in text mode to the output in binary mode in Python.

Asked 2 years ago, Updated 2 years ago, 85 views

I have a text file that was printed when I vectorized word information in a C language program called word2vec. The output is from the else statement of the following code:

for(a=0;a<vocab_size;a++){
    fprintf(fo, "%s", vocab[a].word);
    if(binary) for(b=0;b<layer1_size;b++)fwrite(&syn0[a*layer1_size+b], size of(real), 1, fo);
    else for(b=0;b<layer1_size;b++)fprintf(fo, "%lf", syn0[a*layer1_size+b]);
    fprintf(fo, "\n");
} 

Source

I have a vector file with else output of if(binary), but then I needed a file with if(binary). The contents of the file are as follows:

<Words>>Spaces>>Numerical>Spaces>>Numerical>Spaces>

The code I'm writing now is as follows:

#-*-coding:utf-8-*-
import sys
import structure
# Loading text vector data
title=sys.argv [1:]
i = 0
fp = open('binaryVec.bin', 'wb')
odata='"

for line in open (title[0]):
    chars=list(map(hex,map(ord,line)))))
    print line
    print chars
    odata+=struct.pack('s',chars)
fp.write(odata)
fp.close()

It is good to have an efficient conversion program because of the large data.

Thank you for your cooperation.

Both the text and python programs use utf-8 for the character code. Include a portion of the text vector.

多い 0.205392 -0.245325 0.240983 0.533283 0.087030 -0.198588 0.395930 0.331363 -0.212541 0.383991 0.391010 0.140275 0.178444 -0.331018 -0.303288 -0.168199 0.227571 -0.133808 -0.583108 -0.004697 -0.068092 -0.057790 0.199027 -0.443492 0.006436 -0.098054 0.221261 -0.413350 -0.274608 -0.266688 0.198686 -0.347939 -0.272542 -0.005835 0.195161 0.255993 -0.435598 0.083113 -0.061061 -0.602378 0.244479 -0.090220 0.053294 0.225144 0.084010 0.150409 -0.078552 0.184509 0.068329 -0.045706 -0.037543 -0.347720 0.363027 -0.251563 -0.293957 0.201196 -0.062295 -0.102561 -0.093551 -0.212615 -0.000832 -0.071720 -0.404002 0.124075 0.283026 0.108321 -0.177551 -0.681601 0.051641 0.324483 0.078215 -0.282532 0.313095 -0.250052 -0.872598 0.035464 -0.266010 -0.389549 -0.120772 0.243341 -0.255850 0.044791 -0.151454 0.159697 -0.320580 -0.663053 0.167484 0.361221 0.185417 0.342295 0.889678 -0.302563 0.289107 -0.102576 0.263508 -0.012531 0.298031 -0.515175 -0.127688 -0.260832 
可能 0.108951 -0.258674 0.629972 0.311664 -0.077146 -0.124886 -0.096122 0.011065 -0.309206 0.867305 0.633274 0.006818 0.267469 -0.119733 0.521135 -0.064882 -0.018288 -0.010180 -0.729432 -0.028794 -0.299309 -0.141295 0.623287 -0.417451 0.007524 0.092700 0.215297 -0.506577 -0.271396 -0.184997 -0.198890 -0.349385 -0.178141 0.230034 0.141386 0.193577 0.223477 0.341060 -0.165425 -0.397568 0.020117 0.154478 0.313013 0.013119 0.172535 0.277345 -0.347708 0.686350 -0.181311 0.344334 -0.119619 -0.433781 0.426598 -0.588145 -0.155892 0.060375 0.023153 0.062405 0.193624 -0.262037 0.259582 0.140148 -0.697635 -0.071356 0.526129 -0.122136 -0.622095 -0.284502 0.130523 0.427264 0.295688 -0.340023 0.310286 -0.043206 -0.201572 -0.319277 0.377619 0.101276 -0.208789 0.099027 0.056171 -0.081605 -0.523134 0.181316 -0.018701 -0.517925 -0.108934 0.514148 0.504512 0.430822 0.481150 -0.165199 0.472695 0.080885 -0.141376 0.324130 0.128912 -0.219854 -0.160605 -0.224664 

python c word2vec natural-language-processing

2022-09-30 17:28

2 Answers

I'd like to explain two ways to spew numbers binary in Python. First of all, suppose you get the data in some way and have it in the float list as follows:

data=[0.1,0.2,0.3]
import structure
with open('binaryVec.bin', 'wb') as f:
    for x in data:
        four_bytes = structure.pack ('f', x)
        f.write(four_bytes)

The above method uses a for loop, so the speed is slow. With the tofile method, you can export to binary files at once.

import numpy as np
X=np.array(data,dtype='float32')
X.tofile('binaryVec.bin')

In this case, the binaryVec.bin file created using these two methods will be exactly the same. (32-bit float saved)

Supplement: About Reading Text Files

Vec.txt is given in the following format:(Entries separated by spaces)

Japanese 0.10.20.3 

One easy way to parse this file is to use csv.reader.

import csv
with open('Vec.txt', 'rt') as g:
    for row in csv.reader(g,delimiter='): # There is only one line of input file, so all data is now in row.
        pass
data=map(float,row[1:])# Now data=[0.1,0.2,0.3].


2022-09-30 17:28

First half I deleted the first half because the appearance of the question changed a lot.

I don't know the character code, so I'm worried about that, but I think I can print it as follows.
As @ywat indicated here, I will use the library in a solid way.

import structure

def write(out_filename, word, float_string):
    with open(out_filename, "ab") as fp:
        fp.write(word)
        fp.write(struct.pack('b', 0x20))
        for x in float_string:
            fp.write(struct.pack('f', float(x)))))
        fp.write(struct.pack('b', 0x0a))

def convert(from_filename, to_filename):
    with open(from_filename, "rt") as f:
        with open(to_filename, "wb") as fp:
            # to empty
            pass

        for line inf:
            split = line.split("")
            write(to_filename, split[0], split[1:])


if__name__=='__main__':
    convert('sampleIn.txt', 'binaryVec.bin')

SampleIn.txt is the text below.

Data 1.12.2
Data 21.12.2

For SJIS, the binary is as follows.
Enter a description of the image here


2022-09-30 17:28

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.