c++ - Closest match algorithm for strings/sentences not working? -
i wrote program takes question user. matches question list of pre-defined questions , returns answer. supposed accurate , match questions close (fuzzy matches) or user entered.
my sssce:
the code:
#include <iostream> #include <cstdint> #include <algorithm> #include <numeric> #include <functional> int min_index(const std::vector<int>& list) { int index = 0; int smallest = list[0]; (size_t = 0; < list.size(); ++i) { if (list[i] < smallest) { smallest = list[i]; index = i; } } return index; } std::uint32_t levenshteindistance(const std::string &first, const std::string &second) { const std::size_t firstlength = first.size(); const std::size_t secondlength = second.size(); std::vector<std::uint32_t> current(secondlength + 1); std::vector<std::uint32_t> previous(secondlength + 1); std::size_t = 0; std::generate(previous.begin(), previous.end(), [&] {return i++; }); (i = 0; < firstlength; ++i) { current[0] = + 1; (std::size_t j = 0; j < secondlength; ++j) { auto cost = first[i] == second[j] ? 0 : 1; current[j + 1] = std::min(std::min(current[j] + 1, previous[j + 1] + 1), previous[j] + cost); } current.swap(previous); } return previous[secondlength]; } std::vector<std::string> questions = { "what popular program @ gbc?", "how tuition @ gbc?", "do have pay fees before can register?", "what fee payment options?", "how know when i'm allowed register?", "how add and/or drop courses timetable?", "what do if can't find password?", "how withdraw program?", "what college policies?", "how math need know?", "what program code computer programming?", "what stu-view?", "what college best known for?", "how easy find work after program completion?", "what if plan continue education after?" }; std::vector<std::string> answers = { "fashion", "3000 semester", "yes have pay fees before registering", "you may pay online on student account through student portal", "you may register 2 weeks or more before start of program", "you may drop courses online through student portal", "you can call ... , agent assist you", "you may withdraw using student portal online", "they located @ following link...", "it depends on program entering", "t127 code computer science", "stu-view student portal manage student account , view marks.", "the cafeteria food", "depends on field of work , timing", "you may within 3 years after program completion" }; int main() { std::string user_question = "program"; std::vector<int> distances = std::vector<int>(questions.size(), 0); (size_t = 0; < questions.size(); ++i) { int dist = levenshteindistance(user_question, questions[i]); distances[i] = dist; } std::cout<<"distance: "<<distances[min_index(distances)]<<"\n"; std::cout<<"user-question: "<<user_question<<"\n"; std::cout<<"question-key: "<<questions[min_index(distances)]<<"\n"; std::cout<<"answer-value: "<<answers[min_index(distances)]<<"\n"; return 0; } so in above, user enters "program".. supposed find closest match list of questions , return corresponding answer..
however, prints:
distance: 17 user-question: program question-key: stu-view? answer-value: stu-view student portal manage student account , view marks. it should have had way better results or accuracy doesn't seem care whether or not sentence has keywords user entered :s works small cases large database or above there more 5 sentences, has hard time.. little keywords ;l
what doing wrong? ideas how can fix , make more accurate? tried hammingdistance, similar results..
both "program" , "what stu-view?" quite short compared other strings. it's easier transform "program" "what stu-view?" transform "program" "what popular program @ gbc?" despite fact word "program" common.
what doing wrong?
i don't think you're doing wrong. if you're not satisfied result, means current formalism (minimizing levenshtein distance) not 1 want.
you can go more local solutions: e.g. tokenizing strings, compute pairwise levenshtein distance between words merge results (average, sup inf...)
better solutions require doing bibliography (probably unsupervised machine learning topic)
Comments
Post a Comment