Thursday, October 22, 2015

Comments on Source Tracking in Critical ETL processes.

This is more of a comment on current state in many companies on ETL, with a small suggestion at the end.

In the past, I have worked on ETL processes that consume data from a data source, perform work on that data, and then submit the data further down the line with aggregation and other attributes from other sources added to it.  This is your basic Extract, Transform, and Load type of environment.  You take data from one place, and you move that data to another place.

Recently I have had the misfortune to work on many ETL processes that are designed without failure in mind, and no method of tracking data from the original source that it came from, to the final destination it arrived at.  In the best case scenario you can "re-run" the data, and during debugging somewhere 3/4 the way through the code you can determine what record each individual transaction WOULD have made in the destination, so that you can then figure out what destination should be (if there is a reported error).

This, of course, is a terrible way to go at it, especially if you have to process several millions of records at a time, and the system deletes the source data on every single run and the original data may not even be available within a timely manner (IE it could literally only exist on a tape drive).

The best method for all of this is to simply bite the bullet, and give yourself the data that you need to determine what went wrong, or how to identify the original data so that you can limit it.

In a recent version of this problem, I had sales data coming in from a source, and the data went through considerable amount of process (think 50 rules or so, all inter-dependent) and the final output had no individual record or identifier that linked it back to the original record that was generated. (IE if the input was a, the output might be 4 records of 2.5,2,.25,.25).  

The problem with this system was that it took 4 hours to run, and if you wanted to debug even a single record, the only way to do so was to eat 3-4 hours of processing time. 

Solution?  In my case I separated out the long running pieces into defined segments of code, and made them independent of one another.  When consuming 3 million records but you need to only identify what happens on the 2.5 millionth record (when order of processing matters) I broke out the first 3 hours of processing into its own chunk, so that I could run the process for 3 hours, and then have unlimited time to asses the single record and how it was flowing through the system.  

The best solution?  Don't design systems where you get rid of your originating data on a daily basis, and make sure that you can track your data back to its original source.

If you have an input record of 5, and it creates 4 output records, have a record id or source linkage back to that original record.  Sure, it will take up space, but even if your record id is an identifier to a physical archived file (IE record position in the file, if your files are too large to store) this is better than not having any linkage at all.

Then MAKE SURE YOU TEST IT.  I have seen a situation where someone tried to do this, but the method they were using for reading the file was doing a sort and they were doing the record count AFTER the sort had been done.  The idea was great, but the code was literally misplaced in the order.

If you create something for a purpose such as testing, make sure that you use it for testing to make sure it fulfills your needs.

In the end, make sure you have the ability to test something, and link its results, at every single step of the way.  This means you have an understanding of the code, an understanding of what it is doing, and you can write good documentation and explain it to others later if need be.


Wednesday, July 29, 2015

SSIS vs Informatica

So I have been thinking a lot about Informatica vs SSIS lately, and I have decided I do finally have some opinions about where Informatica is much better and SSIS is better.

If I could say Informatica was better than SSIS in one major way it is in Debugging.  In SSIS the ability to debug a problem is relatively difficult if you need to get down into the guts of a data flow transformation, but this is something that is relatively easy when you are using Informatica.

SSIS though has some bonuses in that it has much better such as configuration and logging.  SSIS has the ability to do package configurations at the database level, and do so relatively dynamically.  The fact that SSIS also maintains each package as its own distinct "program" also helps.  Informatica has some issues when it comes to logging, especially if you have something that can run multiple times but not error consistently, if you don't configure it right you lose the instance that you errored on, where SSIS logs to the databbase per run, and isolates its logging as such.

Informatica also does a much better at handling large amounts of data in its built in components.  IE when you are doing 10 million or so rows through a lookup transform in Informatica, if you configure it right, it is going to outperform SSIS hands down.

SSIS beats it in one major way though, you can leverage SQL server much more efficiently, especially in the case of pulling data from stored procedures.  In Informatica you just don't have the SQL server integration, which means you can't have quite the power of SQL server at your fingure tips the entire time without having to do some hokey hacks to get around how Informatica is built.

SSIS and Informatica both have built in scripting language ability, but SSIS far out performs because it lets you use C# and import C# libraries.  I know there are those who believe Java is powerful, but there are a lot of things you can do with .NET frameworks that come easy to install, and generally standard, that gives SSIS a big advantage. Both Informatica and SSIS have issues when it comes to handling debugging of the languages.

Just some thoughts as I sit down.  I have had the opportunity lately to go back and do some difficult SSIS tasks while doing some particularly hard tasks in Informatica along the same line, and I find both to have interesting ups and downs when it comes to development.

Monday, February 23, 2015

Data Dictionary

A found a great product that produces some nice data dictionary's.  There is sql code provided, as well as a document creation tool.

http://www.csvreader.com/posts/data_dictionary.php

I like it quite a bit, it makes it easier to produce a data dictionary for the user to view when requested, and gives the ability to manage descriptions at the database level so generating documentation on-going is easy to do.