The Mobile Field Base: January 2014

I recently hit this issue while doing some custom script task coding for an SSIS package, and it took me a while to figure out. I found that this specific bug can happen from two situations.

1. The definition of a file has the input definition that is higher than the output definition in the file connection manager. This causes a failure in a Dataflow Source component.

2. The actual buffer inside of a script task is too large.

For the first issue, this can be solved by making sure that the output size is the same as the input size.

For the second part, lets describe what is causing the issue. When using a script task it has a finite amount of bytes that the input buffer can hold. I have no idea what this finite set is, but it is rather large (but not as large as one would hope). Because it has a finite buffer amount, it matters how much data you try to push through it.

In the scenario I hit the issue, I was trying to put 10 NTEXT columns into the script task, manipulate them, and output nvarchar(4000) size columns. The input buffer was getting overflowed because NTEXT is relatively unbounded, and the data source I was using had several of the 10 columns with many thousands of characters in them.

The trick to fixing this is to create multiple script tasks to do your manipulation. Yes, this will look a bit odd, but it is the best way to approach the solution.

So, instead of using one, easily maintained and monolithic script task, use 2 of them and do half the work in one, and half the work in the other. For me, this meant doing 5 columns in one task and 5 columns in the next task.

Over the years I have developed in a number of languages, in multiple environments, for the specific purpose of doing Extract, Transfer, and Load (ETL) operations. Lately I have been working with Informatica, C#, and SSIS in the same work place at the same time, and I have some comments on the progression of ETL technology from having worked in the field for 10 years.

When I first started doing ETL work, the main focus of my job was to get data into a system by first manually pulling data into an ACCESS database, doing manual data hygiene (manually "fixing" data), and then putting that data into either a file, or into a staging table in a database.

In this early world of ETL, I worked often with vb script, access, excel, and several other "manual" data manipulation technologies. I also got to work with microsofts old DTS project. DTS is a pre-cursor to the SSIS technologies used at many companies today, in fact many naming conventions in SSIS still use "Dts" in their naming.

When I was working in DTS, we had a lot of crazy standards in the company I was at, because of what developers were comfortable in. We worked in python and vb script within the dts packages. We would use this to manipulate data flows, alter column data, etc... in the end the dts packages were being used to get data from a file or other database, and move it to another data base. It had multiple command line options, execution options, and was generally useful all around, if a bit clunky in the interface.

Later developmental focus shifted to SSIS, and manipulating data inside of that environment, in this case because of evangelism for the product through several key developers in the company, and the need for the flexibility of using slightly more powerful scripting and manipulation abilities (expressions, conditional splits based on expressions, easier manageability of success conditions etc...).

SSIS was not "bad" but it certainly had its quirks, in fact years later I realized that much of my problems with it at the beginning is because I didn't automatically adjust for the quirks in the UI, and internally manage myself around them at the time. What I like about SSIS, and still like about it today, is the wizard functionality, the ability to easily and quickly prototype things. What I disliked (at least in 2005 SSIS) is that it had such poor management of its pipeline compared to DTS. When you changed something, you had to populate that change even when nothing of significance (IE the names of things) had changed, and it could be a massive chore. At that time we had yet to train ourselves NOT to try to do the most advanced features, such as creating custom components, and custom components never managed the meta data properly, in the end it was really good compared to DTS, but never really realized its full potential.

Later, in the same company, we moved to C#. This caused a slew of headaches internally, but I had the argument with the C# evangelist internally that C# would be a great ETL tool, as long as we could quickly create code using it, and had the tools that let us do many of the common tasks we had already established in the SSIS code base. For us these tasks were around basic ETL around the internal products, such as data matching, consistency in data hygiene, etc... So, a core group internally went ahead and created this library of tools built around a product called DataStreams (www.csvreader.com). This was a great product, and our internal utilities (things to take file formats, create mappings, and eventually database driven mappings) was a great boon to the company. We had consistency in our products, the ability to write complex row by row logic, complete freedom of managing data, and it really moved the companies product based forward rapidly. Things that took SSIS a long time, or simply caused SSIS to run out of space, could now be batched, managed, and loaded into a system without many of the problems of SSIS.

Eventually I left that company, and went on to other endeavors, I even had the opportunity to teach people how to use SSIS, use SQL, and other technologies around ETL. Now I have begun to work in a new technology, Informatica, and I finally feel that I have the ability to make a few comments about it.

Informatica is like going back to DTS, but without the flexibility of DTS, and wrapped up in licencing that Microsoft products do not have. It takes a long time to develop, every expert brought in has different ideas on how things should be done, and configuration can be inconsistent. Even when we have 2 different experts come in, to do essentially the same task, it takes them 3 times longer to get it done, and tested, in Informatica because of the infrastructure issues, configuration issues, and sometimes third party source components having issues.

It is the least stable platform I have seen, the only benefit is that it at least has good error handling from a readability perspective than say SSIS. When I work with it, I feel transformed back to 2005 using DTS, but without being as straight forward because all of the sources and destinations need to be managed.

In SSIS, when you begin to work with it heavily, you start normalizing around the idea that each package is self contained with its own definitions, and own management around it. Yes, it has some options to try to make things universal, but in the end you don't use shared data sources, you just use shared configurations, and everyone implicitly agrees upon a standard or things don't get released.

In Informatica, it claims to allow you to share sources among projects, targets among projects etc... but when you deploy, or someone changes these underlying elements, it causes incredible headache. This, of course, could be because of growing pains in the process, but for a product that has been out so long, it should have some kind of best practices around it that should mitigate these issues, and I have yet to see those surface.

I think the best environment I have had, where everything was tracked, with good configurations, still has to be when we were using C# and a small custom library that matched the needs of the business. I feel Informatica, so far, has been a bit of a step back in its interface, though I feel its underlying technology is solid (it has to be for the cost and how common it is).

I do however like the fact that Informatica is a mature product, in the sense that it does have a lot of options available, I just wish the UI, and developing for it, wasn't quite as tough as it is.

The Mobile Field Base

Friday, January 31, 2014

SSIS Error Message "the value is too long to fit in the data area of the buffer"

Wednesday, January 15, 2014

CTRL+R does not work SQL Server Management Studio 2012

Wednesday, January 8, 2014

Comment on developing in different technologies.