While working on my 'let's make OSM usable for normal people project aka OpenSuperMaps', I've been drawn into OpenAddresses. Coming from the OSM side, a big question about this data has been quality. I'm going to point out bad data I've found and try to draw some conclusions.
The data comes from numerous sources, many with their own idiosyncrasies. Since the bad data sources are currently limiting me, let's start with those. My methods for finding bad data are mostly manual, only illegal XML characters throw errors from my build pipeline. I'll be looking at data quality from those sources especially and few others that show red flags like punctuation and special characters in the data.
The city of Rancho Cucamonga, CA decided to put -- for the number and street fields in 4521/56714 records(8%). Statewide Florida has a record with number=\x02, aka a Unicode control character. Looking further, the 32 rows where street starts with &, 13 different where number is &B.
Many of the statewide NY number has non-numeric contents are valid since 1/2 addresses are a thing, but there's a lot of unit designations in there too.
The county of Yakima, WA is a prolific offender: lack of standardization, normalization and with a topping of garbage data. It has 326 records where the street starts with &. For example, number=41 street=& 43 DEER COVE LN. Unit numbers are comingled in the street field and not standardized. There's number=109 with street = E 3RD AVE, 113 E 3RD AVE, 115 E 3RD AVE; number=140 with street = PATRIOT LN, (UNIT A & B); number=11, street = ROCKY RD UNITS 1 - 6; number = 515, street = 515, 517 ,519 ELM ST. There's more, but I'll stop there. 20,138(~%20) of the records are number=0 which aren't valid addresses.
Source | number has non-numeric contents | illegal XML | number=0 |
---|---|---|---|
Statewide FL | 13675 | 1 | 0 |
Statewide NY | 46117 | 0 | 32 |
Yakima WA | 0 | 0 | 20138 |
As you can see, each dataset tends to mangle the data in unique ways and the fixes range from trivial to probably impossible. Someone could easily exclude the bad data if needed, trying to fix it would be the labor intensive part.
I'm planning to follow this up with an overall characterization of OpenAddresses data quality.