Wednesday, January 8, 2014

Comment on developing in different technologies.

Over the years I have developed in a number of languages, in multiple environments, for the specific purpose of doing Extract, Transfer, and Load (ETL) operations.  Lately I have been working with Informatica, C#, and SSIS in the same work place at the same time, and I have some comments on the progression of ETL technology from having worked in the field for 10 years.

When I first started doing ETL work, the main focus of my job was to get data into a system by first manually pulling data into an ACCESS database, doing manual data hygiene (manually "fixing" data), and then putting that data into either a file, or into a staging table in a database.  

In this early world of ETL, I worked often with vb script, access, excel, and several other "manual" data manipulation technologies.  I also got to work with microsofts old DTS project.  DTS is a pre-cursor to the SSIS technologies used at many companies today, in fact many naming conventions in SSIS still use "Dts" in their naming.    

When I was working in DTS, we had a lot of crazy standards in the company I was at, because of what developers were comfortable in.  We worked in python and vb script within the dts packages. We would use this to manipulate data flows, alter column data, etc... in the end the dts packages were being used to get data from a file or other database, and move it to another data base.  It had multiple command line options, execution options, and was generally useful all around, if a bit clunky in the interface.

Later developmental focus shifted to SSIS, and manipulating data inside of that environment, in this case because of evangelism for the product through several key developers in the company, and the need for the flexibility of using slightly more powerful scripting and manipulation abilities (expressions, conditional splits based on expressions, easier manageability of success conditions etc...).    

SSIS was not "bad" but it certainly had its quirks, in fact years later I realized that much of my problems with it at the beginning is because I didn't automatically adjust for the quirks in the UI, and internally manage myself around them at the time.  What I like about SSIS, and still like about it today, is the wizard functionality, the ability to easily and quickly prototype things.  What I disliked (at least in 2005 SSIS) is that it had such poor management of its pipeline compared to DTS.  When you changed something, you had to populate that change even when nothing of significance (IE the names of things) had changed, and it could be a massive chore.  At that time we had yet to train ourselves NOT to try to do the most advanced features, such as creating custom components, and custom components never managed the meta data properly, in the end it was really good compared to DTS, but never really realized its full potential.

Later, in the same company, we moved to C#.  This caused a slew of headaches internally, but I had the argument with the C# evangelist internally that C# would be a great ETL tool, as long as we could quickly create code using it, and had the tools that let us do many of the common tasks we had already established in the SSIS code base.  For us these tasks were around basic ETL around the internal products, such as data matching, consistency in data hygiene, etc...  So, a core group internally went ahead and created this library of tools built around a product called DataStreams (www.csvreader.com).  This was a great product, and our internal utilities (things to take file formats, create mappings, and eventually database driven mappings) was a great boon to the company.  We had consistency in our products, the ability to write complex row by row logic, complete freedom of managing data, and it really moved the companies product based forward rapidly.  Things that took SSIS a long time, or simply caused SSIS to run out of space, could now be batched, managed, and loaded into a system without many of the problems of SSIS.

Eventually I left that company, and went on to other endeavors, I even had the opportunity to teach people how to use SSIS, use SQL, and other technologies around ETL.  Now I have begun to work in a new technology, Informatica, and I finally feel that I have the ability to make a few comments about it.

Informatica is like going back to DTS, but without the flexibility of DTS, and wrapped up in licencing that Microsoft products do not have.  It takes a long time to develop, every expert brought in has different ideas on how things should be done, and configuration can be inconsistent.  Even when we have 2 different experts come in, to do essentially the same task, it takes them 3 times longer to get it done, and tested, in Informatica because of the infrastructure issues, configuration issues, and sometimes third party source components having issues.

It is the least stable platform I have seen, the only benefit is that it at least has good error handling from a readability perspective than say SSIS.  When I work with it, I feel transformed back to 2005 using DTS, but without being as straight forward because all of the sources and destinations need to be managed.

In SSIS, when you begin to work with it heavily, you start normalizing around the idea that each package is self contained with its own definitions, and own management around it.  Yes, it has some options to try to make things universal, but in the end you don't use shared data sources, you just use shared configurations, and everyone implicitly agrees upon a standard or things don't get released.

In Informatica, it claims to allow you to share sources among projects, targets among projects etc... but when you deploy, or someone changes these underlying elements, it causes incredible headache.   This, of course, could be because of growing pains in the process, but for a product that has been out so long, it should have some kind of best practices around it that should mitigate these issues, and I have yet to see those surface.

I think the best environment I have had, where everything was tracked, with good configurations, still has to be when we were using C# and a small custom library that matched the needs of the business. I feel Informatica, so far, has been a bit of a step back in its interface, though I feel its underlying technology is solid (it has to be for the cost and how common it is).

I do however like the fact that Informatica is a mature product, in the sense that it does have a lot of options available, I just wish the UI, and developing for it, wasn't quite as tough as it is.

No comments:

Post a Comment