c++ - Closest match algorithm for strings/sentences not working? -


i wrote program takes question user. matches question list of pre-defined questions , returns answer. supposed accurate , match questions close (fuzzy matches) or user entered.

my sssce:

http://ideone.com/jtcf73

the code:

#include <iostream> #include <cstdint> #include <algorithm> #include <numeric> #include <functional>  int min_index(const std::vector<int>& list) {     int index = 0;     int smallest = list[0];      (size_t = 0; < list.size(); ++i) {         if (list[i] < smallest) {             smallest = list[i];             index = i;         }     }     return index; }  std::uint32_t levenshteindistance(const std::string &first, const std::string &second) {     const std::size_t firstlength = first.size();     const std::size_t secondlength = second.size();      std::vector<std::uint32_t> current(secondlength + 1);     std::vector<std::uint32_t> previous(secondlength + 1);      std::size_t = 0;     std::generate(previous.begin(), previous.end(), [&] {return i++; });      (i = 0; < firstlength; ++i)     {         current[0] = + 1;          (std::size_t j = 0; j < secondlength; ++j)         {             auto cost = first[i] == second[j] ? 0 : 1;             current[j + 1] = std::min(std::min(current[j] + 1, previous[j + 1] + 1), previous[j] + cost);         }          current.swap(previous);     }     return previous[secondlength]; }  std::vector<std::string> questions =  {     "what popular program @ gbc?",     "how tuition @ gbc?",     "do have pay fees before can register?",     "what fee payment options?",     "how know when i'm allowed register?",     "how add and/or drop courses timetable?",     "what do if can't find password?",     "how withdraw program?",     "what college policies?",     "how math need know?",     "what program code computer programming?",     "what stu-view?",     "what college best known for?",     "how easy find work after program completion?",     "what if plan continue education after?" };  std::vector<std::string> answers = {     "fashion",     "3000 semester",     "yes have pay fees before registering",     "you may pay online on student account through student portal",     "you may register 2 weeks or more before start of program",     "you may drop courses online through student portal",     "you can call ... , agent assist you",     "you may withdraw using student portal online",     "they located @ following link...",     "it depends on program entering",     "t127 code computer science",     "stu-view student portal manage student account , view marks.",     "the cafeteria food",     "depends on field of work , timing",     "you may within 3 years after program completion" };  int main() {     std::string user_question = "program";      std::vector<int> distances = std::vector<int>(questions.size(), 0);      (size_t = 0; < questions.size(); ++i)     {         int dist = levenshteindistance(user_question, questions[i]);         distances[i] = dist;     }      std::cout<<"distance:      "<<distances[min_index(distances)]<<"\n";     std::cout<<"user-question: "<<user_question<<"\n";     std::cout<<"question-key:  "<<questions[min_index(distances)]<<"\n";     std::cout<<"answer-value:  "<<answers[min_index(distances)]<<"\n";      return 0; } 

so in above, user enters "program".. supposed find closest match list of questions , return corresponding answer..

however, prints:

distance:      17 user-question: program question-key:  stu-view? answer-value:  stu-view student portal manage student account , view marks. 

it should have had way better results or accuracy doesn't seem care whether or not sentence has keywords user entered :s works small cases large database or above there more 5 sentences, has hard time.. little keywords ;l

what doing wrong? ideas how can fix , make more accurate? tried hammingdistance, similar results..

both "program" , "what stu-view?" quite short compared other strings. it's easier transform "program" "what stu-view?" transform "program" "what popular program @ gbc?" despite fact word "program" common.

what doing wrong?

i don't think you're doing wrong. if you're not satisfied result, means current formalism (minimizing levenshtein distance) not 1 want.

you can go more local solutions: e.g. tokenizing strings, compute pairwise levenshtein distance between words merge results (average, sup inf...)

better solutions require doing bibliography (probably unsupervised machine learning topic)


Comments

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

android - Keyboard hides my half of edit-text and button below it even in scroll view -

css - Make div keyboard-scrollable in jQuery Mobile? -