Wednesday, July 17, 2013

How to parse US addresses without CASS certification.

One of the more interesting problems to get hit with when it comes to doing coding, especially physical mailing, is how to parse physical addresses, and returned a standardized set of addresses for matching records.

This is a non trivial, and difficult problem.  The short answer is to purchase software, or have an address service provider, consume your raw address data and send you back the split out address data that you need in a file and go from there.  In fact, if your data has to be mailed out for any reason, or is from a low integrity source such as a free form text field on a web site, then this is the only way to do it.  Several sites, such as listcleanup.com, or marketing agencies such as KBM Group can do this for you for a price.  Full stop, you can quit reading right now.

BUT, if you have data from a high integrity source, that is generally well formatted (such as from a vendor, or an internal file source), and you want to spend the least amount of time on the addresses, and simply use the address parts for identification... then you can use an address parsing bit of software.  Initially I was going to describe such a process, but instead was able to find a pretty darn good implementation for US addresses already on the internet, and figured I would post it.  Sure, it doesn't come from my own hand, but the method I was going to use was a generalized method, and this has had considerably more thought, and seems far more accurate.

A nice little open source C# implementation available here:

http://usaddress.codeplex.com/