ellen's disorganized thoughts

Address parsing: State of the art, and beyond

March 22, 2024

Disclaimer: I did not quantifiably break SOTA, but I claim my parser is dramatically better than SOTA for geocoding purposes, for reasons I elaborate on.

Street address parsing is a very hard problem. It’s one of those topics that I could write a pages-long “Myths programmers believe about X” think-piece on. For the curious, Al Barrentine’s post on the subject announcing libpostal 1.0 is a must-read. For our purposes, suffice to say it’s a tough nut to crack.

The state of the art circa 2024

Libpostal is the ultimate address parser for people who don’t make typing errors. It is unmatched, remains unmatched, and after toiling at this problem for a few weekends with modern NLP tools like transformers, I have personally concluded that nobody will be able to beat the novel CRF architecture of libpostal for well-formed queries for a long time. Libpostal’s dominance comes from its ability to simply memorize everything. The vocabulary is enormous, and the CRF weights are very sparse, which allows it to parse a query in close to 100μs. This memorization comes with two big downsides that motivated my search for a better (but also worse) address parser.

Libpostal downside 1: The model is enormous.

The SOTA libpostal model weighs in at ~2.2GiB after post-processing for sparsity, and it all needs to be in RAM for parsing to work. It currently can’t be memory mapped due to some of the complexities of the on-disk format. For the geocoder I’m working on, Airmail, storage and memory requirements like that are a deal-breaker.

Libpostal downside 2: Typos and prefix queries

Unfortunately for those of us out there working on geocoders, libpostal breaks when presented with typos or partial queries. For example, if you ask libpostal to parse the address “1600 pennsylvania ave nw washingto”, it will tell you that 1600 is a house number (great!) and “pennsylvania ave nw washingto” is a street (less great).

Libpostal’s CRF-based architecture is approachable and comprehensible, rare in the machine learning world. Because of its approachability, I’m willing to claim that I know why this case fails. Simply put, Libpostal has an enormous vocabulary that it uses very opportunistically. “Washington” is a highly predictive token, so libpostal’s CRF rightly assigns a very high weight to it being present. Other features that it could associate with that token (for example, various substrings or root words) are largely ignored by libpostal’s training routine once the model is able to classify it correctly. The net effect of performing this type of training over the entire corpus is that the most predictive features associated with each token are incorporated into the CRF weights, and less predictive features are ignored unless they are required for correct classification. Worse still, sparsifying a libpostal model for a production deployment involves selective removal of non-predictive weights, further reducing the model’s ability to generalize.

If you’re building a geocoder around libpostal, what this means is that you can’t parse incorrect or prefix queries consistently. To illustrate this, I’ve included some selected examples of typos and partial queries. The label after the arrow is how the last token of the query was parsed. Whenever an example is marked as “Road” that means that libpostal includes the last token in the portion of the query it identifies as a road. I include asterisks around the label in examples it gets right.

# Prefix queries
1600 Pennsylvania Ave NW, W -> Road
1600 Pennsylvania Ave NW, Wa -> State
1600 Pennsylvania Ave NW, Was -> Road
1600 Pennsylvania Ave NW, Wash -> Road
1600 Pennsylvania Ave NW, Washi -> *City*
1600 Pennsylvania Ave NW, Washin -> Road
1600 Pennsylvania Ave NW, Washing -> Road
1600 Pennsylvania Ave NW, Washingt -> Road
1600 Pennsylvania Ave NW, Washingto -> Road
1600 Pennsylvania Ave NW, Washington -> *City*

# Typos
1600 Pennsylvania Ave NW, Ashington -> *City*
1600 Pennsylvania Ave NW, Wshington -> Road
1600 Pennsylvania Ave NW, Wahington -> Road
1600 Pennsylvania Ave NW, Wasington -> Road
1600 Pennsylvania Ave NW, Washngton -> Road
1600 Pennsylvania Ave NW, Washigton -> Road
1600 Pennsylvania Ave NW, Washinton -> Road
1600 Pennsylvania Ave NW, Washingon -> Road
1600 Pennsylvania Ave NW, Washingtn -> Road
1600 Pennsylvania Ave NW, Washingto -> Road

This is unfortunately not a great hit rate for those of us on mobile with clumsy thumbs.

New parser, new beginnings

In February 2024 I began exploring the use of very small BERT models for various geocoding tasks like fuzzy mapping of nightmare-fuel categorical queries like “where can I find falafel” and “I need to go swimming right now otherwise I'm literally going to die” onto OSM tags like “amenity=restaurant” and “leisure=swimming_pool”.

Once satisfied with the performance of my categorical search model I turned to address parsing. Aside from a short stint with a hand-coded address parser in the early days, Airmail has been largely parserless. I was initially surprised at how little this affected geocoding performance. It hasn’t been much of a problem, but I knew that we were leaving a little bit of performance and some QPS on the table by throwing every term into a single multi-valued string field, so it’s been on my to-do list to fix.

In March 2024, I decided to try training a BERT model on the libpostal dataset for token classification. As an aside, I owe Al Barrentine an enormous “thank you” for releasing the entire training dataset for libpostal, and another big “thank you” for putting so much effort into ensuring it is as complete as it can be across so many different axes. I may also owe Al an apology for only using part of the dataset in my own efforts.

Because of limited compute, I trained my BERT model on the OpenStreetMap formatted addresses and formatted places datasets until epoch ~0.3, or roughly 30% of the data. If you’d like to see it trained against the full dataset, please consider sponsoring my maps work on GitHub to help cover the cost of compute.

To temper expectations, my BERT model weighs in at under 80MiB when stored with bfloat16 weights. Very small for a transformer. It’s fast (for a transformer) on account of its low parameter count, but it still takes about 20ms to parse a query on the CPU. ~200x slower than libpostal.

Al chose to evaluate libpostal based on full-parse correctness. In other words, you get zero credit towards a correct parse if you mess up a single token of the query. By this metric, Airmail’s new parser achieves a correct parse rate of ~98.72% against withheld data (for the non-ML people: data that the model has never trained on). Quite a difference from libpostal’s ~99.45%. Keep in mind that my train/eval data is from the OpenStreetMap libpostal training data. I’m not testing any of the OpenAddresses data, so this is not a perfect comparison.

Moving the goalposts

Backtracking a bit, recall that Airmail is currently parserless. It indexes all tokens and token phrases associated with a POI in a single multi-valued string field. We achieve 100% parsing accuracy by simply not parsing, and geocoding performance is surprisingly good especially considering the project is still in its infancy. On account of this, I’m going to claim that for a geocoder, it’s not necessary to use all of libpostal’s 20 token categories. There are really 6 labels that I care deeply about distinguishing from one another.

Administrative areas like suburbs, cities, states, etc.
Postal codes
Roads
House numbers
Unit/level/entrance/staircase, etc. Any address component more specific than a house number.
Category or POI name. Notably, I don’t mind if these two are confused with one another.

By mapping libpostal labels onto this subset, I achieve a ~99.09% correct parse rate against withheld data, again applying the full-sequence correctness metric. Nearly twice as many errors as libpostal, but libpostal’s model is also 25x larger. Moreover, Airmail’s new parser is superior in some areas that libpostal falters.

Revisiting typos and prefix queries

Airmail’s new parser performs surprisingly well against malformed data, considering none is intentionally present in the training set. As before, the label after the arrow is how the last token of the query was classified.

# Prefix queries
1600 Pennsylvania Ave NW, W -> Road
1600 Pennsylvania Ave NW, Wa -> State
1600 Pennsylvania Ave NW, Was -> *City*
1600 Pennsylvania Ave NW, Wash -> Road
1600 Pennsylvania Ave NW, Washi -> *City*
1600 Pennsylvania Ave NW, Washin -> *City*
1600 Pennsylvania Ave NW, Washing -> *City*
1600 Pennsylvania Ave NW, Washingt -> *City*
1600 Pennsylvania Ave NW, Washingto -> *City*
1600 Pennsylvania Ave NW, Washington -> *City*

# Typos
1600 Pennsylvania Ave NW, Ashington -> *City*
1600 Pennsylvania Ave NW, Wshington -> *City*
1600 Pennsylvania Ave NW, Wahington -> *City*
1600 Pennsylvania Ave NW, Wasington -> *City*
1600 Pennsylvania Ave NW, Washngton -> *City*
1600 Pennsylvania Ave NW, Washigton -> *City*
1600 Pennsylvania Ave NW, Washinton -> *City*
1600 Pennsylvania Ave NW, Washingon -> *City*
1600 Pennsylvania Ave NW, Washingtn -> *City*
1600 Pennsylvania Ave NW, Washingto -> *City*

Not bad at all. This is just a small example, mind you, but it fits with the trend I’ve noticed in what spot-checking I’ve done. Whether this is a win for a particular application depends on how common malformed data is, but I claim that for a geocoder that a human is going to use on a mobile device, Airmail’s new parser is superior to libpostal, despite the gap in eval performance.

Where do we go from here?

I’d like to quantify this parser’s ability to handle malformed data. I’m not sure how I’m going to do that. I would also like to evaluate this parser against the Pelias parser, using both libpostal’s data and the Pelias test suite.

If you have ideas, want to say hi, ask me a question or anything of the sort, feel free to send me an email.

P.S. I’m still working on cleaning up the training pipeline code. I’ll publish it in a sister repository to the Airmail repo once it’s presentable, and the model itself will be published on HuggingFace in the near future once I’m confident I’ve picked the low hanging fruit for possible improvements. ✨

Host a planet-scale geocoder for $10/month

February 16, 2024

About a month ago I began work a new geocoder (demo), or search engine for places and addresses. I wanted to make something very inexpensive to run. A big barrier to entry for hosting a planet-scale headway instance is the geocoder. Right now we’re using Pelias which is great at what it does, but runs on ElasticSearch which doesn’t do well on <8GB of RAM. I’ve been poking at this problem off and on for years, and didn’t expect to get anything working, but much to my surprise things shaped up pretty quickly.

I was able to cobble together a mediocre address parser based on nom, drawing an immense amount of inspiration from the Pelias parser. Armed with an okay address parser, I turned to tantivy as a search engine, thinking I’d take advantage of its ability to memory map the search index. After a bit of digging, I found tantivy-wasm which runs in the browser and issues range queries to fetch bits of the index as needed. When I saw that the gears really started turning in my head. I didn’t want to fork a search engine library, so I implemented my own backing store for mainline tantivy using anonymous memory maps and userfaultfd to fetch chunks of the index from object storage on-demand via range queries. It worked, and after a bit of tuning the latency is getting into pretty acceptable ranges, around 1-3 seconds generally, and I’ve seen it as fast as 2ms for a simple query if the cache is hot.

Fly.io charges me about $3/mo for the machine it runs on, and just under $7/mo for the 320ish gigabytes of data in object storage, so this has been a very affordable project. I’m pretty happy with the results for the price. I’m planning on extending it handle more locales and use-cases over the coming months. I made a demo site where you can play with it, but don’t expect miracles. You get what you pay for, and I haven’t indexed OpenAddresses for the demo site so if something isn’t in OpenStreetMap I definitely do not have it in the index. The code is all open-source.

Cheers ✨✨

PID controlling a pre-millenium La Pavoni Europicolla

November 1, 2023

I’ve grown a bit tired of the temperature variation displayed by my pre-millenium La Pavoni. It makes it really difficult to pull a good shot without babysitting it for a while to get everything to the right temperature. I’ve thought about a few different solutions, including upgrading to machines clocking in north of 1000 USD. Ultimately I decided that the repairability and simplicity of the Europicolla isn’t something I’m willing to give up. I don’t drink or steam milk, so maintaining the single boiler at a reasonable temperature is all I need.

I ended up buying a very convenient bang-bang temperature controller (ITC-608T) by Inkbird for greenhouses and homebrewing. It comes with a thermocouple probe that I covered in thermal paste and taped to the side of the boiler with some polyimide tape. I’ve found that regardless of boiler temperature, my machine in thermal equilibrium will either have insufficient steam pressure or produce burnt-tasting espresso. My quest continues. I may try to add some heatsinks to the grouphead next.

Nokia 3310 teardown takeaways

October 25, 2023

The Nokia 3310. It's hard to think of a more iconic cell phone. I never had one as a kid; they were a few years before my time. But their legacy lives on, along with a shocking quantity of devices. If you've never taken one apart, it's a real trip. Nokia designed them for disassembly and reassembly. Compared to the repairability of a modern disposable mobile device, it's staggering how easy these things are to work on. All of the connections to the motherboard use spring contacts. While working on them, you won't encounter FPC cables, finicky connectors, or opportunities to lift solder pads.

The idea of a major manufacturer designing a mobile phone like this today is nearly unthinkable. There are many reasons for this. The two I want to highlight are competitive pressure for miniaturization and cost savings. Spring contacts add thickness to the device and are more expensive than FPC cables. An even cheaper option than an FPC cable is integrating everything onto the same board. This option is correspondingly worse for repairability.

A side effect of Nokia designing their devices like this is that they can continue to benefit society even after 2G network operators shut them out of the cellular network. Rather than building an entirely new hardware platform, makers can build a new mainboard for a device like a 3310 using the existing display, antenna, and I/O capabilities, which I'm doing for one of my upcoming projects.

A few photos of the mainboard layout follow, and I will release a KiCad footprint shortly for those who want to make custom mainboards for this platform.

Friendship ended with LoRa, now 2-FSK is my best friend

October 19, 2023

This is mostly a venting post, it’s not going to be especially coherent but it feels important to write. As a bit of background, LoRa is a marketing term created by Semtech for a proprietary chirp spread spectrum modulation scheme that they use in some of their radio transceivers. It ostensibly stands for “low-power long-range” and I will concede that their receivers have impressive sensitivity ratings. But at a very steep cost. Active receive current draw is around 10mA and there’s no low-duty-cycle mode or really any other mode you can use to reduce that. That’s such a bummer to me. A cellular radio can maintain an association with a tower in eDRX mode drawing under a milliamp on average. Meanwhile, the “low-power” alternative to cellular draws 10x that current and is hailed as revolutionary.

What’s especially frustrating is that there are actual low-power ISM transceivers out there. Notably, the STMicro S2-LP draws around 21 microamps on average in low-duty-cycle receive at 1200 baud (not a typo, nearly 500x less power draw than LoRa).

There’s not really much more to this post than that. LoRa is not low power. It’s impressive for other reasons:

Good marketing.
Good link budget.
Extremely approachable for hobbyist use.

That last point is not to be underestimated. I’m working on making the S2-LP a little bit more approachable for hobbyists as a part of one of my current projects. I’d like to get to the point where there are $2 counterfeits of my hardware on AliExpress and a vibrant community of makers building cool stuff with it, with 500x lower power draw than LoRa modules. :)

Thoughts on maker culture and e-waste

October 19, 2023

Every time I send out a new prototype for manufacturing, I spend a lot of time thinking about the possibility of it not working. The reasons for this are twofold: time and energy. It's not a good use of my time to wait for another board revision if the first doesn't work, but the energy I'm concerned with is not my own. Semiconductor manufacturing, and to a lesser extent, PCB manufacturing, is an environmentally destructive process. I refrain from sending off broken designs to prevent the board house from burning coal to create a useless PCB and prevent myself from wasting components with far higher embodied energy than their small size may imply.

I've decided to employ more modularity in my workflow to help address this. My usual approach for designing a prototype has been to create a bespoke PCB design and send it to be manufactured and assembled. If that design doesn't work, I may waste ten microcontrollers on PCBA for that iteration. The approach I'm using now is to have a stock of generic modules on hand for assembling prototypes. Instead of paying JLCPCB to put a microcontroller on each new board, I may add a microcontroller module in a compact LGA footprint. I’m also working on an S2-LP transceiver module and an Si2141-based RF frontend module. Most of my new boards still use some surface-mount components, but the ones that are most expensive and carbon-intensive to manufacture are attached as LGA modules. By using modules, I can assemble a single board, test it, and decide whether to assemble more. LGA modules are also easy to desolder with nothing more than a hot plate, and they're usable for a wide variety of designs. I can even cannibalize one prototype to assemble a new one by moving a module from one board to another. What I give up in miniaturization I gain back many times over in flexibility.

I'd like to begin selling these modules and development boards based on them, and the ability to quickly assemble a new board from a shared stock of modules if I run low on inventory will enable me to be more efficient in how I stock products, which is helpful for a low-volume business. All of these designs are also open-source, allowing users to modify them and create derivatives. By making it easier for an end-user to repair or repurpose a product, I can directly impact the amount of e-waste I generate.

I'd love to see the rest of the industry move in this direction. It empowers makers and feels like the right thing to do.