In general, disambiguation is challenge in any data science project where entities are identified by non-unique records, like ordinary names, or errors are introduced into the data stream through a manual entry point. I looked at some Python libraries that handle disambiguation, but they didn't seem precisely suited for this project. I absolutely need to avoid improperly combining records for different scientists, the data is small enough that I can clean it by hand (with a little computer assistance), and the ambiguities have regular patterns, rather than being the result of a random process that swaps characters of similar.
The most common problems are:
- A scientist is inconsistent in use of a middle initial
- A scientist is referred to by a nickname
- A scientist's first name is misspelled in one instance
And now there are files in my github account! https://github.com/mburnamfink/scientometrics_101_project