python - Ensure the gensim generate the same Word2Vec model for different runs on the same data -


in lda model generates different topics everytime train on same corpus , setting np.random.seed(0), lda model initialized , trained in same way.

is same word2vec models gensim? setting random seed constant, different run on same dataset produce same model?

but strangely, it's giving me same vector @ different instances.

>>> nltk.corpus import brown >>> gensim.models import word2vec >>> sentences = brown.sents()[:100] >>> model = word2vec(sentences, size=10, window=5, min_count=5, workers=4) >>> model[word0] array([ 0.04985042,  0.02882229, -0.03625415, -0.03165979,  0.06049283,         0.01207791,  0.04722737,  0.01984878, -0.03026265,  0.04485954], dtype=float32) >>> model = word2vec(sentences, size=10, window=5, min_count=5, workers=4) >>> model[word0] array([ 0.04985042,  0.02882229, -0.03625415, -0.03165979,  0.06049283,         0.01207791,  0.04722737,  0.01984878, -0.03026265,  0.04485954], dtype=float32) >>> model = word2vec(sentences, size=20, window=5, min_count=5, workers=4) >>> model[word0] array([ 0.02596745,  0.01475067, -0.01839622, -0.01587902,  0.03079717,         0.00586761,  0.02367715,  0.00930568, -0.01521437,  0.02213679,         0.01043982, -0.00625582,  0.00173071, -0.00235749,  0.01309298,         0.00710233, -0.02270884, -0.01477827,  0.01166443,  0.00283862], dtype=float32) >>> model = word2vec(sentences, size=20, window=5, min_count=5, workers=4) >>> model[word0] array([ 0.02596745,  0.01475067, -0.01839622, -0.01587902,  0.03079717,         0.00586761,  0.02367715,  0.00930568, -0.01521437,  0.02213679,         0.01043982, -0.00625582,  0.00173071, -0.00235749,  0.01309298,         0.00710233, -0.02270884, -0.01477827,  0.01166443,  0.00283862], dtype=float32) >>> exit() alvas@ubi:~$ python python 2.7.11 (default, dec 15 2015, 16:46:19)  [gcc 4.8.4] on linux2 type "help", "copyright", "credits" or "license" more information. >>> nltk.corpus import brown >>> gensim.models import word2vec >>> sentences = brown.sents()[:100] >>> model = word2vec(sentences, size=10, window=5, min_count=5, workers=4) >>> word0 = sentences[0][0] >>> model[word0] array([ 0.04985042,  0.02882229, -0.03625415, -0.03165979,  0.06049283,         0.01207791,  0.04722737,  0.01984878, -0.03026265,  0.04485954], dtype=float32) >>> model = word2vec(sentences, size=20, window=5, min_count=5, workers=4) >>> model[word0] array([ 0.02596745,  0.01475067, -0.01839622, -0.01587902,  0.03079717,         0.00586761,  0.02367715,  0.00930568, -0.01521437,  0.02213679,         0.01043982, -0.00625582,  0.00173071, -0.00235749,  0.01309298,         0.00710233, -0.02270884, -0.01477827,  0.01166443,  0.00283862], dtype=float32) 

is true default random seed fixed? if so, default random seed number? or because i'm testing on small dataset?

if it's true the random seed fixed , different runs on same data returns same vectors, link canonical code or documentation appreciated.

yes, default random seed fixed 1, described author in https://radimrehurek.com/gensim/models/word2vec.html. vectors each word initialised using hash of concatenation of word + str(seed).

hashing function used, however, python’s rudimentary built in hash function , can produce different results if 2 machines differ in

above list not exhaustive. cover question though?

edit

if want ensure consistency, can provide own hashing function argument in word2vec

a simple (and bad) example be:

def hash(astring):    return ord(aastring[0])  model = word2vec(sentences, size=10, window=5, min_count=5, workers=4, hashfxn=hash)  print model[sentences[0][0]] 

Comments

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

css - Make div keyboard-scrollable in jQuery Mobile? -

ruby on rails - Seeing duplicate requests handled with Unicorn -