Thursday, October 22, 2015

Comments on Source Tracking in Critical ETL processes.

This is more of a comment on current state in many companies on ETL, with a small suggestion at the end.

In the past, I have worked on ETL processes that consume data from a data source, perform work on that data, and then submit the data further down the line with aggregation and other attributes from other sources added to it.  This is your basic Extract, Transform, and Load type of environment.  You take data from one place, and you move that data to another place.

Recently I have had the misfortune to work on many ETL processes that are designed without failure in mind, and no method of tracking data from the original source that it came from, to the final destination it arrived at.  In the best case scenario you can "re-run" the data, and during debugging somewhere 3/4 the way through the code you can determine what record each individual transaction WOULD have made in the destination, so that you can then figure out what destination should be (if there is a reported error).

This, of course, is a terrible way to go at it, especially if you have to process several millions of records at a time, and the system deletes the source data on every single run and the original data may not even be available within a timely manner (IE it could literally only exist on a tape drive).

The best method for all of this is to simply bite the bullet, and give yourself the data that you need to determine what went wrong, or how to identify the original data so that you can limit it.

In a recent version of this problem, I had sales data coming in from a source, and the data went through considerable amount of process (think 50 rules or so, all inter-dependent) and the final output had no individual record or identifier that linked it back to the original record that was generated. (IE if the input was a, the output might be 4 records of 2.5,2,.25,.25).  

The problem with this system was that it took 4 hours to run, and if you wanted to debug even a single record, the only way to do so was to eat 3-4 hours of processing time. 

Solution?  In my case I separated out the long running pieces into defined segments of code, and made them independent of one another.  When consuming 3 million records but you need to only identify what happens on the 2.5 millionth record (when order of processing matters) I broke out the first 3 hours of processing into its own chunk, so that I could run the process for 3 hours, and then have unlimited time to asses the single record and how it was flowing through the system.  

The best solution?  Don't design systems where you get rid of your originating data on a daily basis, and make sure that you can track your data back to its original source.

If you have an input record of 5, and it creates 4 output records, have a record id or source linkage back to that original record.  Sure, it will take up space, but even if your record id is an identifier to a physical archived file (IE record position in the file, if your files are too large to store) this is better than not having any linkage at all.

Then MAKE SURE YOU TEST IT.  I have seen a situation where someone tried to do this, but the method they were using for reading the file was doing a sort and they were doing the record count AFTER the sort had been done.  The idea was great, but the code was literally misplaced in the order.

If you create something for a purpose such as testing, make sure that you use it for testing to make sure it fulfills your needs.

In the end, make sure you have the ability to test something, and link its results, at every single step of the way.  This means you have an understanding of the code, an understanding of what it is doing, and you can write good documentation and explain it to others later if need be.