approximation - Algorithm for Matching Hospital Names -
i work in health care company , have trouble hospitalization report data. have data coming various sources: excel reports, plain text file, , in cases paper. managed data excel file. running problem each person spelled , referred same hospital.
for example: new york presbyterian hospital, have seen more 10 variation.
- new york presbyterian hospital
- ny presbyterian hospital
- presbyterian hospital
- presb hospital
- presbhosp
- new_york_presb_hosp
- nypresbhosp
- columbia presbyterian medical center
- nyp/columbia university medical center
- new york presbyterian hospital columbia university medical
- a more more cases hospital name misspelled
- a few of different system string limit , cut off string in random places, or maybe copy , pasted incorrectly.
- different nurses refer hospital in differently
in effect trying create true database can store membership's information, running wall because each staff/department naming hospital in different way. (there provider id unique each hospital), of reports received included "name". have on 2000 members 100-150 hospitals, 3 or 4 times amount of different names.
i know levenshtein distance in use, in such extreme case, there strategy build match? there data hands (time consuming), since 1 of dozens or reports assigned. suggestion appreciated.
this pretty standard , pretty difficult problem. entire companies exist solve big data.
the usual strategy encode known data domain in heuristic algorithm classify data before putting in database.
a standard classification method create set of pattern strings each hospital. examples gave might go in pattern set initially.
then each incoming string , each pattern, calculate metric that's difference between string , pattern. levenshtein starting point. set containing least distance pattern (in case columbia presbyterian) wins. excessive least distance means pattern set no good. (you tweak "excessive" means.) more 1 low distance (you define "low," too) means pattern set has inadvertent overlaps.
both problems may handled in various ways, involving human intervention either classify data or enhance pattern sets or both.
a second possibility use regexes patterns. match equivalent distance 0 above, , non-match distance infinity. might expect, makes algorithm less flexible. yet kinds of data - not yours though - it's best choice.
Comments
Post a Comment