approximation - Algorithm for Matching Hospital Names -

January 15, 2011

i work in health care company , have trouble hospitalization report data. have data coming various sources: excel reports, plain text file, , in cases paper. managed data excel file. running problem each person spelled , referred same hospital.

for example: new york presbyterian hospital, have seen more 10 variation.

new york presbyterian hospital
ny presbyterian hospital
presbyterian hospital
presb hospital
presbhosp
new_york_presb_hosp
nypresbhosp
columbia presbyterian medical center
nyp/columbia university medical center
new york presbyterian hospital columbia university medical
a more more cases hospital name misspelled
a few of different system string limit , cut off string in random places, or maybe copy , pasted incorrectly.
different nurses refer hospital in differently

in effect trying create true database can store membership's information, running wall because each staff/department naming hospital in different way. (there provider id unique each hospital), of reports received included "name". have on 2000 members 100-150 hospitals, 3 or 4 times amount of different names.

i know levenshtein distance in use, in such extreme case, there strategy build match? there data hands (time consuming), since 1 of dozens or reports assigned. suggestion appreciated.

this pretty standard , pretty difficult problem. entire companies exist solve big data.

the usual strategy encode known data domain in heuristic algorithm classify data before putting in database.

a standard classification method create set of pattern strings each hospital. examples gave might go in pattern set initially.

then each incoming string , each pattern, calculate metric that's difference between string , pattern. levenshtein starting point. set containing least distance pattern (in case columbia presbyterian) wins. excessive least distance means pattern set no good. (you tweak "excessive" means.) more 1 low distance (you define "low," too) means pattern set has inadvertent overlaps.

both problems may handled in various ways, involving human intervention either classify data or enhance pattern sets or both.

a second possibility use regexes patterns. match equivalent distance 0 above, , non-match distance infinity. might expect, makes algorithm less flexible. yet kinds of data - not yours though - it's best choice.

Search This Blog

Two

approximation - Algorithm for Matching Hospital Names -

Comments

Post a Comment

Popular posts from this blog

get url and add instance to a model with prefilled foreign key :django admin -

android - Keyboard hides my half of edit-text and button below it even in scroll view -

css - Make div keyboard-scrollable in jQuery Mobile? -