Address parsing: State of the art, and beyond
Disclaimer: I did not quantifiably break SOTA, but I claim my parser is dramatically better than SOTA for geocoding purposes, for reasons I elaborate on.
Street address parsing is a very hard problem. It’s one of those topics that I could write a pages-long “Myths programmers believe about X” think-piece on. For the curious, Al Barrentine’s post on the subject announcing libpostal 1.0 is a must-read. For our purposes, suffice to say it’s a tough nut to crack.
The state of the art circa 2024
Libpostal is the ultimate address parser for people who don’t make typing errors. It is unmatched, remains unmatched, and after toiling at this problem for a few weekends with modern NLP tools like transformers, I have personally concluded that nobody will be able to beat the novel CRF architecture of libpostal for well-formed queries for a long time. Libpostal’s dominance comes from its ability to simply memorize everything. The vocabulary is enormous, and the CRF weights are very sparse, which allows it to parse a query in close to 100μs. This memorization comes with two big downsides that motivated my search for a better (but also worse) address parser.
Libpostal downside 1: The model is enormous.
The SOTA libpostal model weighs in at ~2.2GiB after post-processing for sparsity, and it all needs to be in RAM for parsing to work. It currently can’t be memory mapped due to some of the complexities of the on-disk format. For the geocoder I’m working on, Airmail, storage and memory requirements like that are a deal-breaker.
Libpostal downside 2: Typos and prefix queries
Unfortunately for those of us out there working on geocoders, libpostal breaks when presented with typos or partial queries. For example, if you ask libpostal to parse the address “1600 pennsylvania ave nw washingto”, it will tell you that 1600 is a house number (great!) and “pennsylvania ave nw washingto” is a street (less great).
Libpostal’s CRF-based architecture is approachable and comprehensible, rare in the machine learning world. Because of its approachability, I’m willing to claim that I know why this case fails. Simply put, Libpostal has an enormous vocabulary that it uses very opportunistically. “Washington” is a highly predictive token, so libpostal’s CRF rightly assigns a very high weight to it being present. Other features that it could associate with that token (for example, various substrings or root words) are largely ignored by libpostal’s training routine once the model is able to classify it correctly. The net effect of performing this type of training over the entire corpus is that the most predictive features associated with each token are incorporated into the CRF weights, and less predictive features are ignored unless they are required for correct classification. Worse still, sparsifying a libpostal model for a production deployment involves selective removal of non-predictive weights, further reducing the model’s ability to generalize.
If you’re building a geocoder around libpostal, what this means is that you can’t parse incorrect or prefix queries consistently. To illustrate this, I’ve included some selected examples of typos and partial queries. The label after the arrow is how the last token of the query was parsed. Whenever an example is marked as “Road” that means that libpostal includes the last token in the portion of the query it identifies as a road. I include asterisks around the label in examples it gets right.
# Prefix queries
1600 Pennsylvania Ave NW, W -> Road
1600 Pennsylvania Ave NW, Wa -> State
1600 Pennsylvania Ave NW, Was -> Road
1600 Pennsylvania Ave NW, Wash -> Road
1600 Pennsylvania Ave NW, Washi -> *City*
1600 Pennsylvania Ave NW, Washin -> Road
1600 Pennsylvania Ave NW, Washing -> Road
1600 Pennsylvania Ave NW, Washingt -> Road
1600 Pennsylvania Ave NW, Washingto -> Road
1600 Pennsylvania Ave NW, Washington -> *City*
# Typos
1600 Pennsylvania Ave NW, Ashington -> *City*
1600 Pennsylvania Ave NW, Wshington -> Road
1600 Pennsylvania Ave NW, Wahington -> Road
1600 Pennsylvania Ave NW, Wasington -> Road
1600 Pennsylvania Ave NW, Washngton -> Road
1600 Pennsylvania Ave NW, Washigton -> Road
1600 Pennsylvania Ave NW, Washinton -> Road
1600 Pennsylvania Ave NW, Washingon -> Road
1600 Pennsylvania Ave NW, Washingtn -> Road
1600 Pennsylvania Ave NW, Washingto -> Road
This is unfortunately not a great hit rate for those of us on mobile with clumsy thumbs.
New parser, new beginnings
In February 2024 I began exploring the use of very small BERT models for various geocoding tasks like fuzzy mapping of nightmare-fuel categorical queries like “where can I find falafel” and “I need to go swimming right now otherwise I'm literally going to die” onto OSM tags like “amenity=restaurant” and “leisure=swimming_pool”.
Once satisfied with the performance of my categorical search model I turned to address parsing. Aside from a short stint with a hand-coded address parser in the early days, Airmail has been largely parserless. I was initially surprised at how little this affected geocoding performance. It hasn’t been much of a problem, but I knew that we were leaving a little bit of performance and some QPS on the table by throwing every term into a single multi-valued string field, so it’s been on my to-do list to fix.
In March 2024, I decided to try training a BERT model on the libpostal dataset for token classification. As an aside, I owe Al Barrentine an enormous “thank you” for releasing the entire training dataset for libpostal, and another big “thank you” for putting so much effort into ensuring it is as complete as it can be across so many different axes. I may also owe Al an apology for only using part of the dataset in my own efforts.
Because of limited compute, I trained my BERT model on the OpenStreetMap formatted addresses and formatted places datasets until epoch ~0.3, or roughly 30% of the data. If you’d like to see it trained against the full dataset, please consider sponsoring my maps work on GitHub to help cover the cost of compute.
To temper expectations, my BERT model weighs in at under 80MiB when stored with bfloat16 weights. Very small for a transformer. It’s fast (for a transformer) on account of its low parameter count, but it still takes about 20ms to parse a query on the CPU. ~200x slower than libpostal.
Al chose to evaluate libpostal based on full-parse correctness. In other words, you get zero credit towards a correct parse if you mess up a single token of the query. By this metric, Airmail’s new parser achieves a correct parse rate of ~98.72% against withheld data (for the non-ML people: data that the model has never trained on). Quite a difference from libpostal’s ~99.45%. Keep in mind that my train/eval data is from the OpenStreetMap libpostal training data. I’m not testing any of the OpenAddresses data, so this is not a perfect comparison.
Moving the goalposts
Backtracking a bit, recall that Airmail is currently parserless. It indexes all tokens and token phrases associated with a POI in a single multi-valued string field. We achieve 100% parsing accuracy by simply not parsing, and geocoding performance is surprisingly good especially considering the project is still in its infancy. On account of this, I’m going to claim that for a geocoder, it’s not necessary to use all of libpostal’s 20 token categories. There are really 6 labels that I care deeply about distinguishing from one another.
Administrative areas like suburbs, cities, states, etc.
Postal codes
Roads
House numbers
Unit/level/entrance/staircase, etc. Any address component more specific than a house number.
Category or POI name. Notably, I don’t mind if these two are confused with one another.
By mapping libpostal labels onto this subset, I achieve a ~99.09% correct parse rate against withheld data, again applying the full-sequence correctness metric. Nearly twice as many errors as libpostal, but libpostal’s model is also 25x larger. Moreover, Airmail’s new parser is superior in some areas that libpostal falters.
Revisiting typos and prefix queries
Airmail’s new parser performs surprisingly well against malformed data, considering none is intentionally present in the training set. As before, the label after the arrow is how the last token of the query was classified.
# Prefix queries
1600 Pennsylvania Ave NW, W -> Road
1600 Pennsylvania Ave NW, Wa -> State
1600 Pennsylvania Ave NW, Was -> *City*
1600 Pennsylvania Ave NW, Wash -> Road
1600 Pennsylvania Ave NW, Washi -> *City*
1600 Pennsylvania Ave NW, Washin -> *City*
1600 Pennsylvania Ave NW, Washing -> *City*
1600 Pennsylvania Ave NW, Washingt -> *City*
1600 Pennsylvania Ave NW, Washingto -> *City*
1600 Pennsylvania Ave NW, Washington -> *City*
# Typos
1600 Pennsylvania Ave NW, Ashington -> *City*
1600 Pennsylvania Ave NW, Wshington -> *City*
1600 Pennsylvania Ave NW, Wahington -> *City*
1600 Pennsylvania Ave NW, Wasington -> *City*
1600 Pennsylvania Ave NW, Washngton -> *City*
1600 Pennsylvania Ave NW, Washigton -> *City*
1600 Pennsylvania Ave NW, Washinton -> *City*
1600 Pennsylvania Ave NW, Washingon -> *City*
1600 Pennsylvania Ave NW, Washingtn -> *City*
1600 Pennsylvania Ave NW, Washingto -> *City*
Not bad at all. This is just a small example, mind you, but it fits with the trend I’ve noticed in what spot-checking I’ve done. Whether this is a win for a particular application depends on how common malformed data is, but I claim that for a geocoder that a human is going to use on a mobile device, Airmail’s new parser is superior to libpostal, despite the gap in eval performance.
Where do we go from here?
I’d like to quantify this parser’s ability to handle malformed data. I’m not sure how I’m going to do that. I would also like to evaluate this parser against the Pelias parser, using both libpostal’s data and the Pelias test suite.
If you have ideas, want to say hi, ask me a question or anything of the sort, feel free to send me an email.
P.S. I’m still working on cleaning up the training pipeline code. I’ll publish it in a sister repository to the Airmail repo once it’s presentable, and the model itself will be published on HuggingFace in the near future once I’m confident I’ve picked the low hanging fruit for possible improvements. ✨