python - Top of the most frequent substrings in file of sequences -

July 15, 2013

i have file many sequences (all of them have different length) these:

5-2-6 5-6-6-6-1-8 4 1-7-6-6-2-6-6-1 18-24-2-6-6-1-8

i need calculate frequent substrings in sequences , result this:

6-6       5 times 2-6       3 times 6-6-1     3 times 6-1       3 times 6-1-8     2 times 1-8       2 times 2-6-6-1   2 times 2-6-6     2 times

substrings can length starting 2 numbers. should provide opportunity in code change length of substring, example, find substrings longer 3 numbers.

i know should use suffix tree, can't find working example case((

language: r or python.

you don't need suffix tree; split sequences , use sliding window iterator produce substrings; loop starting @ minimum size looping number of numbers on line let produce different sizes of substrings.

a collections.counter() object can keep track of counts:

from collections import counter itertools import islice  def window(seq, n=2):     "returns sliding window (of width n) on data iterable"     "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "     = iter(seq)     result = tuple(islice(it, n))     if len(result) == n:         yield result         elem in it:         result = result[1:] + (elem,)         yield result  minimum_size = 2 counts = counter()  open(filename) infh:     line in infh:         parts = line.strip().split('-')         # count substrings of various sizes         counts.update('-'.join(sp)                        size in range(minimum_size, len(parts) + 1)                       sp in window(parts, size))

the counter.most_common() method gives sorted list substrings sorted frequency:

for substring, count in counts.most_common():     print '{:<20} {:>3d} times'.format(substring, count)

for sample data, top 10 entries are:

6-6                    5 times 6-6-1                  3 times 2-6                    3 times 6-1                    3 times 2-6-6                  2 times 1-8                    2 times 6-1-8                  2 times 6-6-1-8                2 times 2-6-6-1                2 times 18-24-2-6-6            1 times

if want focus on different minimum substring lengths, don't have re-count whole file. instead filter counter object; count - characters in keys:

for substr, freq in counts.most_common():     if substr.count('-') < 2:         # skip substrings shorter 3 elements         continue

Search This Blog

Two

python - Top of the most frequent substrings in file of sequences -

Comments

Post a Comment

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

android - Keyboard hides my half of edit-text and button below it even in scroll view -

css - Make div keyboard-scrollable in jQuery Mobile? -