python - Ensure the gensim generate the same Word2Vec model for different runs on the same data -

in lda model generates different topics everytime train on same corpus , setting np.random.seed(0), lda model initialized , trained in same way.

is same word2vec models gensim? setting random seed constant, different run on same dataset produce same model?

but strangely, it's giving me same vector @ different instances.

>>> nltk.corpus import brown >>> gensim.models import word2vec >>> sentences = brown.sents()[:100] >>> model = word2vec(sentences, size=10, window=5, min_count=5, workers=4) >>> model[word0] array([ 0.04985042,  0.02882229, -0.03625415, -0.03165979,  0.06049283,         0.01207791,  0.04722737,  0.01984878, -0.03026265,  0.04485954], dtype=float32) >>> model = word2vec(sentences, size=10, window=5, min_count=5, workers=4) >>> model[word0] array([ 0.04985042,  0.02882229, -0.03625415, -0.03165979,  0.06049283,         0.01207791,  0.04722737,  0.01984878, -0.03026265,  0.04485954], dtype=float32) >>> model = word2vec(sentences, size=20, window=5, min_count=5, workers=4) >>> model[word0] array([ 0.02596745,  0.01475067, -0.01839622, -0.01587902,  0.03079717,         0.00586761,  0.02367715,  0.00930568, -0.01521437,  0.02213679,         0.01043982, -0.00625582,  0.00173071, -0.00235749,  0.01309298,         0.00710233, -0.02270884, -0.01477827,  0.01166443,  0.00283862], dtype=float32) >>> model = word2vec(sentences, size=20, window=5, min_count=5, workers=4) >>> model[word0] array([ 0.02596745,  0.01475067, -0.01839622, -0.01587902,  0.03079717,         0.00586761,  0.02367715,  0.00930568, -0.01521437,  0.02213679,         0.01043982, -0.00625582,  0.00173071, -0.00235749,  0.01309298,         0.00710233, -0.02270884, -0.01477827,  0.01166443,  0.00283862], dtype=float32) >>> exit() alvas@ubi:~$ python python 2.7.11 (default, dec 15 2015, 16:46:19)  [gcc 4.8.4] on linux2 type "help", "copyright", "credits" or "license" more information. >>> nltk.corpus import brown >>> gensim.models import word2vec >>> sentences = brown.sents()[:100] >>> model = word2vec(sentences, size=10, window=5, min_count=5, workers=4) >>> word0 = sentences[0][0] >>> model[word0] array([ 0.04985042,  0.02882229, -0.03625415, -0.03165979,  0.06049283,         0.01207791,  0.04722737,  0.01984878, -0.03026265,  0.04485954], dtype=float32) >>> model = word2vec(sentences, size=20, window=5, min_count=5, workers=4) >>> model[word0] array([ 0.02596745,  0.01475067, -0.01839622, -0.01587902,  0.03079717,         0.00586761,  0.02367715,  0.00930568, -0.01521437,  0.02213679,         0.01043982, -0.00625582,  0.00173071, -0.00235749,  0.01309298,         0.00710233, -0.02270884, -0.01477827,  0.01166443,  0.00283862], dtype=float32) 

is true default random seed fixed? if so, default random seed number? or because i'm testing on small dataset?

if it's true the random seed fixed , different runs on same data returns same vectors, link canonical code or documentation appreciated.

yes, default random seed fixed 1, described author in vectors each word initialised using hash of concatenation of word + str(seed).

hashing function used, however, python’s rudimentary built in hash function , can produce different results if 2 machines differ in

above list not exhaustive. cover question though?


if want ensure consistency, can provide own hashing function argument in word2vec

a simple (and bad) example be:

def hash(astring):    return ord(aastring[0])  model = word2vec(sentences, size=10, window=5, min_count=5, workers=4, hashfxn=hash)  print model[sentences[0][0]] 


Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

css - Make div keyboard-scrollable in jQuery Mobile? -

android - Keyboard hides my half of edit-text and button below it even in scroll view -