Friday, July 22, 2005

ID me

I once did a match between a database of ~25,000 people to one of ~250,000 - the smaller being more-or-less a subset of the larger. It was a few years ago, but I remember being stuck with nearly a thousand cases where my matching rules had failed, and I wasn't being very strict: I expect a match on surname, first initial, date of birth and title/gender would have been enough for that match (you need title or gender because of "John and Jane Smith - twins, resident at same address, known by initial only").

When you do these kind of matches, you end up with two distinct but closely related problems:

  • Two records which appear to be two different people in fact relate to one person
  • Two records which appear to be the same person in fact relate to two different people
There can be more than two of course, and these ambiguous or even fictitious (clerks get bored, people lie) records may be in the same database or different - nobody's data is perfect.

I registered my disapproval of the concept of ID cards the other day, I aimed on a whim to be the 10,000th person but missed because I hadn't sussed I had to register to confirm the pledge via an email address and gmail was being very slow. If you secretly wanted to sign up but were too pussy, you can still contribute.

Anyway, after I had done the automated match I had to sit and go through the list of partial matches, deciding who was who and who wasn't. It's a decision only a human being can make (as a programmer I say that advisedly) and they need as much information as they have or can get to make the best possible match. For an ID card that might mean knowing who you call or email, or where you've been, or who you know - to make the distinction the analyst is using all the information she has to decide who you are. As long as you're happy with that.

No comments: