python - Top of the most frequent substrings in file of sequences -
i have file many sequences (all of them have different length) these:
5-2-6 5-6-6-6-1-8 4 1-7-6-6-2-6-6-1 18-24-2-6-6-1-8 i need calculate frequent substrings in sequences , result this:
6-6 5 times 2-6 3 times 6-6-1 3 times 6-1 3 times 6-1-8 2 times 1-8 2 times 2-6-6-1 2 times 2-6-6 2 times substrings can length starting 2 numbers. should provide opportunity in code change length of substring, example, find substrings longer 3 numbers.
i know should use suffix tree, can't find working example case((
language: r or python.
you don't need suffix tree; split sequences , use sliding window iterator produce substrings; loop starting @ minimum size looping number of numbers on line let produce different sizes of substrings.
a collections.counter() object can keep track of counts:
from collections import counter itertools import islice def window(seq, n=2): "returns sliding window (of width n) on data iterable" " s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... " = iter(seq) result = tuple(islice(it, n)) if len(result) == n: yield result elem in it: result = result[1:] + (elem,) yield result minimum_size = 2 counts = counter() open(filename) infh: line in infh: parts = line.strip().split('-') # count substrings of various sizes counts.update('-'.join(sp) size in range(minimum_size, len(parts) + 1) sp in window(parts, size)) the counter.most_common() method gives sorted list substrings sorted frequency:
for substring, count in counts.most_common(): print '{:<20} {:>3d} times'.format(substring, count) for sample data, top 10 entries are:
6-6 5 times 6-6-1 3 times 2-6 3 times 6-1 3 times 2-6-6 2 times 1-8 2 times 6-1-8 2 times 6-6-1-8 2 times 2-6-6-1 2 times 18-24-2-6-6 1 times if want focus on different minimum substring lengths, don't have re-count whole file. instead filter counter object; count - characters in keys:
for substr, freq in counts.most_common(): if substr.count('-') < 2: # skip substrings shorter 3 elements continue
Comments
Post a Comment