python - Ensure the gensim generate the same Word2Vec model for different runs on the same data -
in lda model generates different topics everytime train on same corpus , setting np.random.seed(0)
, lda model initialized , trained in same way.
is same word2vec models gensim
? setting random seed constant, different run on same dataset produce same model?
but strangely, it's giving me same vector @ different instances.
>>> nltk.corpus import brown >>> gensim.models import word2vec >>> sentences = brown.sents()[:100] >>> model = word2vec(sentences, size=10, window=5, min_count=5, workers=4) >>> model[word0] array([ 0.04985042, 0.02882229, -0.03625415, -0.03165979, 0.06049283, 0.01207791, 0.04722737, 0.01984878, -0.03026265, 0.04485954], dtype=float32) >>> model = word2vec(sentences, size=10, window=5, min_count=5, workers=4) >>> model[word0] array([ 0.04985042, 0.02882229, -0.03625415, -0.03165979, 0.06049283, 0.01207791, 0.04722737, 0.01984878, -0.03026265, 0.04485954], dtype=float32) >>> model = word2vec(sentences, size=20, window=5, min_count=5, workers=4) >>> model[word0] array([ 0.02596745, 0.01475067, -0.01839622, -0.01587902, 0.03079717, 0.00586761, 0.02367715, 0.00930568, -0.01521437, 0.02213679, 0.01043982, -0.00625582, 0.00173071, -0.00235749, 0.01309298, 0.00710233, -0.02270884, -0.01477827, 0.01166443, 0.00283862], dtype=float32) >>> model = word2vec(sentences, size=20, window=5, min_count=5, workers=4) >>> model[word0] array([ 0.02596745, 0.01475067, -0.01839622, -0.01587902, 0.03079717, 0.00586761, 0.02367715, 0.00930568, -0.01521437, 0.02213679, 0.01043982, -0.00625582, 0.00173071, -0.00235749, 0.01309298, 0.00710233, -0.02270884, -0.01477827, 0.01166443, 0.00283862], dtype=float32) >>> exit() alvas@ubi:~$ python python 2.7.11 (default, dec 15 2015, 16:46:19) [gcc 4.8.4] on linux2 type "help", "copyright", "credits" or "license" more information. >>> nltk.corpus import brown >>> gensim.models import word2vec >>> sentences = brown.sents()[:100] >>> model = word2vec(sentences, size=10, window=5, min_count=5, workers=4) >>> word0 = sentences[0][0] >>> model[word0] array([ 0.04985042, 0.02882229, -0.03625415, -0.03165979, 0.06049283, 0.01207791, 0.04722737, 0.01984878, -0.03026265, 0.04485954], dtype=float32) >>> model = word2vec(sentences, size=20, window=5, min_count=5, workers=4) >>> model[word0] array([ 0.02596745, 0.01475067, -0.01839622, -0.01587902, 0.03079717, 0.00586761, 0.02367715, 0.00930568, -0.01521437, 0.02213679, 0.01043982, -0.00625582, 0.00173071, -0.00235749, 0.01309298, 0.00710233, -0.02270884, -0.01477827, 0.01166443, 0.00283862], dtype=float32)
is true default random seed fixed? if so, default random seed number? or because i'm testing on small dataset?
if it's true the random seed fixed , different runs on same data returns same vectors, link canonical code or documentation appreciated.
yes, default random seed fixed 1
, described author in https://radimrehurek.com/gensim/models/word2vec.html. vectors each word initialised using hash of concatenation of word + str(seed).
hashing function used, however, python’s rudimentary built in hash function , can produce different results if 2 machines differ in
- 32 vs 64 bit, reference
- python versions, reference
- different operating systems/ interpreters, reference1, reference2
above list not exhaustive. cover question though?
edit
if want ensure consistency, can provide own hashing function argument in word2vec
a simple (and bad) example be:
def hash(astring): return ord(aastring[0]) model = word2vec(sentences, size=10, window=5, min_count=5, workers=4, hashfxn=hash) print model[sentences[0][0]]
Comments
Post a Comment