Python url string match -


my problem following. have long list of urls such as:

www.foo.com/davidbobmike1joe www.foo.com/mikejoe2bobkarl www.foo.com/joemikebob www.foo.com/bobjoe 

i need compare entries (urls) in list each other, extract keywords in subdomains of urls (in case: david, joe, bob, mike, karl) , order them frequency. i've been reading several libraries such nltk. problem here there no spaces tokenise each word independently. recommendations on how job done?

limitations

if refuse use dictionary you're algorithm require lot of computation. above that, impossible distinguish keyword occurs once (e.g: "karl") crappy sequence (e.g: "e2bo"). solution best effort , work if list of url's contains keywords multiple times.

the basic idea

i assume word sequence of characters occur of @ least 3 characters. prevents letter "o" being popular word.

the basic idea following.

  • count n letter sequences , select once occur multiple times.
  • cut sequences part of larger sequence.
  • order them popularity , have solution comes close solving problem. (left exercise reader)

in code

import operator  sentences = ["davidbobmike1joe" , "mikejoe2bobkarl", "joemikebob", "bobjoe", "bobbyisawesome", "david", "bobbyjoe"]; dict = {}  def countwords(n):     """count possible character sequences/words of length n occuring in given sentences"""     sentence in sentences:         countwordssentence(sentence, n);  def countwordssentence(sentence, n):     """count possible character sequence/words of length n occuring in sentence"""     in range(0,len(sentence)-n+1):         word = sentence[i:i+n]         if word not in dict:             dict[word] = 1;         else:             dict[word] = dict[word] +1;  def cropdictionary():     """removes words occur once."""     key in dict.keys():         if(dict[key]==1):             dict.pop(key);  def removepartials(word):     """removes partial occurences of given word dictionary."""     in range(3,len(word)):         j in range(0,len(word)-i+1):             key in dict.keys():                if key==word[j:j+i] , dict[key]==dict[word]:                    dict.pop(key);  def removeallpartials():     """removes partial words in dictionary"""     word in dict.keys():         removepartials(word);  in range(3,max(map(lambda x: len(x), sentences))):     countwords(i);     cropdictionary();     removeallpartials();  print dict; 

output

>>> print dict; {'mike': 3, 'bobby': 2, 'david': 2, 'joe': 5, 'bob': 6} 

some challenges reader

  • sort dictionary value before printing it. (sort python dictionary value)
  • in example "bob" occurs 6 times, 2 times partial word of "bobby". determine if problematic , fix if necessary.
  • take capitalization account.

Comments

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

css - Make div keyboard-scrollable in jQuery Mobile? -

ruby on rails - Seeing duplicate requests handled with Unicorn -