Wednesday, July 29, 2015

SSIS vs Informatica

So I have been thinking a lot about Informatica vs SSIS lately, and I have decided I do finally have some opinions about where Informatica is much better and SSIS is better.

If I could say Informatica was better than SSIS in one major way it is in Debugging.  In SSIS the ability to debug a problem is relatively difficult if you need to get down into the guts of a data flow transformation, but this is something that is relatively easy when you are using Informatica.

SSIS though has some bonuses in that it has much better such as configuration and logging.  SSIS has the ability to do package configurations at the database level, and do so relatively dynamically.  The fact that SSIS also maintains each package as its own distinct "program" also helps.  Informatica has some issues when it comes to logging, especially if you have something that can run multiple times but not error consistently, if you don't configure it right you lose the instance that you errored on, where SSIS logs to the databbase per run, and isolates its logging as such.

Informatica also does a much better at handling large amounts of data in its built in components.  IE when you are doing 10 million or so rows through a lookup transform in Informatica, if you configure it right, it is going to outperform SSIS hands down.

SSIS beats it in one major way though, you can leverage SQL server much more efficiently, especially in the case of pulling data from stored procedures.  In Informatica you just don't have the SQL server integration, which means you can't have quite the power of SQL server at your fingure tips the entire time without having to do some hokey hacks to get around how Informatica is built.

SSIS and Informatica both have built in scripting language ability, but SSIS far out performs because it lets you use C# and import C# libraries.  I know there are those who believe Java is powerful, but there are a lot of things you can do with .NET frameworks that come easy to install, and generally standard, that gives SSIS a big advantage. Both Informatica and SSIS have issues when it comes to handling debugging of the languages.

Just some thoughts as I sit down.  I have had the opportunity lately to go back and do some difficult SSIS tasks while doing some particularly hard tasks in Informatica along the same line, and I find both to have interesting ups and downs when it comes to development.

Monday, February 23, 2015

Data Dictionary

A found a great product that produces some nice data dictionary's.  There is sql code provided, as well as a document creation tool.

http://www.csvreader.com/posts/data_dictionary.php

I like it quite a bit, it makes it easier to produce a data dictionary for the user to view when requested, and gives the ability to manage descriptions at the database level so generating documentation on-going is easy to do.

Thursday, December 11, 2014

Heads up for ETL parsing C# library

I just wanted to give a heads up to the product from http://www.csvreader.com/.  I have worked with the creator of this product, and I have worked with the DataStreams product he sells quite a bit, and it is hands down the best C# ETL product I have ever worked with.  It simplifies all of the "hard" stuff you have to work with when it comes to managing data in C#, from parsing of data files to creating correctly formatted outputs.

Since it implements the parsing as an IDataReader, it makes importing data into databases dead easy, and it allows for record by record manipulation as the data flies by as well.


Wednesday, September 10, 2014

Inserting the results of a stored procedure into a temp table.

I recently had a co-worker who was unable to duplicate the results of a well known, and assumed working, stored procedure, and wanted to find out if I could help them with getting data from a stored procedure.  The problem?  The stored procedure had nearly 100 columns, and many of the discussions on the internet basically made it very difficult to figure out the best way to get the information.  There are a LOT of methods out there.  One uses OpenRowset, but this has configuration issues in many places. Others use fantastic methods revolving around the use of XML parsing, or even dropping it into C#.  These are OK, for their needs, but in reality if you have access to the stored procedure, this is one other method that you can use that is MUCH quicker as long as you are using sql server management studio.

The method I use is to do this.

1.  Copy the code for the stored procedure, do this is in a test environment where you have full access.
2.  Change the name of the stored proc (creating a new one) and add a unique word to the end, I call my procs that I do this with <procname>_CREATETEMPTABLE.
3.  Open the procedure, find the result set that you want to get the results from, and instead of return the select statement to the proc.  Add a an insert into statement into a dbo.<insertclevernamehere>
4.  Open SQL Server Management Studio and find the table dbo.<insertclevernamehere> and right click, and get the schema by going to Script Table As->Create->New Query Window.
5.  Copy the guts into a create table script for a temp table.
6.  Run the stored procdure by doing

Insert into #mytemptable
(
...columns
)
exec <yourstoredprocedure>

And it should work every time.

So, how does this work in real life?

Lets say I have a stored procedure, call it.

dbo.MyExampleProc

And lets say it has 2 columns returned, but we will do this as if it had 100.

The proc is defined as

create proc dbo.myexampleproc
as
SELECT

cast(1 as varchar(10)) as col1,cast(2 as int) as col2

When we run the proc, it does not return the schema, and in our case we can't figure out what that Schema SHOULD be without a lot of manual work or looking at documentation that no longer exists.

The next step would be to make it so that the data goes into a temporary, but accessable database table.

So we change the proc and do.

create proc dbo.myexampleproc_CREATETEMPTABLE
as
SELECT
cast(1 as varchar(10)) as col1,cast(2 as int) as col2
into dbo.myexampleproc_CREATETEMPTABLE_OUTPUT

Now that the data is in the table dbo.myexampleproc_CREATETEMPTABLE_OUTPUT we then script the table out using Script Table As->create->New Query Window.

This provides the data definiton:

CREATE TABLE [dbo].[myexampleproc_CREATETEMPTABLE_OUTPUT](
       [col1] [varchar](10) NULL,
       [col2] [int] NULL
) ON [PRIMARY]

Then I delete the table myexampleproc_CREATETEMPTABLE_OUTPUT and the stored procedure myexampleproc_CREATETEMPTABLE.

So then we can stick this somewhere and have a useful script with it to look at data, for example, the below, and we are done.  



USE ETLExamples
/* generate schema step, create a new stored proc exactly like the old one, execute it but make the result set go into a dbo.<temptablename>

For example, I created a temp one with _CREATETEMPTABLE, and in the final result set i had it go to a table dbo.<temptablename>

I ran it with the parameters (example below) that are the same paramters I want to capture.

I then deleted the stored proc, then used sql server management studio to grab the schema of the table (right click on the table, script)

Then I took the table column definitions and inserted them into the create table script for the temp table below.

Then when you execute the

INSERT INTO <temptable>
EXEC <stored proc name>

It just inserts the results of the proc into the temp table.


*/

--exec dbo.myexampleproc

IF object_id('tempdb..#myreallyfancytemptable') IS NOT NULL
BEGIN
   DROP TABLE #myreallyfancytemptable
END
--- STep 1, create the definition of the table that exactly matches the definition of your stored procedure

create table #myreallyfancytemptable
(
       [col1] [varchar](10) NULL,
       [col2] [int] NULL

)

-- insert into the temp table.

INSERT INTO #myreallyfancytemptable
EXEC [dbo].[myexampleproc]



select
* from #myreallyfancytemptable

The results from the table are:

col1        col2
1              2





Wednesday, July 9, 2014

Are "Normal" humans excluded from programming?

http://developers-beta.slashdot.org/story/14/07/09/131243/normal-humans-effectively-excluded-from-developing-software/

I was reading the above article about how a programmer felt that "normal" humans were blocked from programming, and I wanted to initially think to myself "Of course they are!" After this initial feeling of personal superiority, I realized all of the examples of how false this is.

The great majority of good programmers that I know in the field are not fantastic savants, they are regular people doing a job.  The only caveat I have in programming is the difference between a programmer, and a good programmer.  Your average, run of the mill, programmer that can do most tasks, and do research to find basic solutions, is someone who is either self trained or had a small associates degree in software development.  The good programmer is one who enjoys it, loves finding out new problems, and has a good memory for how things have been solved in the past and can see how to use their experience to solve new problems, and can also determine when a new approach SHOULD be used, even though an old one CAN be used.

An average programmer will approach a problem in the "I have a hammer, everything is a nail" solution.  Lets take something like SSIS.  A lower level SSIS developer, trying to figure out how to get data from one system to another, and needs to do a lookup against a very large data set on the same data base, might pull both sources in and use a join or lookup transform to get the required data.  A good programmer, with experience in both (and finding out the SSIS package chokes on such large sets) might make the next step and just use a select statement with a join against the data from the sql server database.

This seems obvious that these two solutions exist, but in the real world you will see hundreds of examples of people using a non optimal, obvious to them, solution repeated over and over simply because it works, but not because it is good, or easy to understand to anyone else.

In reality everything has a good, better, best solution set, but no one is going to ever hit the "best" solution for non trivial problems, especially the first time.  What you pay for in a "good" programmer, vs your regular programmer, is the experience to get "close" to the better solution, and sometimes even between the better and best categories.  Your software will be more reliable, your hardware less strained, and your coding time much shorter, and it will be worth every penny, but in the end you don't have real access to truly good programmers with enough experience in all of the technologies you have to really shine in this field.

There are so many factors in the idea of "good", "better", "best" in programming that the ideal of having some kind of elitism toward the entire field as shown in the article doesn't make sense.  Programming is a trade, most of the jobs in programming are trade style jobs with specific scopes of specialty.  You aren't going to take an ETL developer, toss him at a web development problem, and have as good an experience as using a dedicated web developer, even if the ETL developer is "the best ETL developer you have ever had!"

Programming is NOT rocket science, you can be very competent at programming if you have the ability to visualize your problems and properly define them, at some point it just starts to be syntax, and the competence to know what needs to be created vs re-used.

Kind of a ramble, but I just wanted to get my views on it out there on my little informal blog space.

Friday, April 25, 2014

SSIS failed with error code 0xC0209029

failed with error code 0xC0209029 

I am sure you have just come from a long line of internet searches to find this specific one. KUDOS to you sir, and this post may not have a fix for your error either.

Recently I received the error failed with error code 0xC0209029 error from SSIS when related to inserting rows into an OLE DB destination. The issue occurred during a package that had never failed in production, but had failed in the development environment for several days, but no issue seemed to be able to be found.

After looking around the internet, the variety of things that can cause this are many, but I would like to add one more.

IF you have a situation where you are trying to insert an identity key into a destination table, and the identity key already exists, it will produce this error.  Check to see if the identity key records for your table are not being inserted, and if they are, that the identity records do not already exist.

The situation that caused this error was because someone else was doing development in the dev environment without anyone giving notice, and the keys were for different data entirely, so it took a while to hunt the issue down. This should give you one more idea to look for the issue.


Friday, January 31, 2014

SSIS Error Message "the value is too long to fit in the data area of the buffer"

I recently hit this issue while doing some custom script task coding for an SSIS package, and it took me a while to figure out.  I found that this specific bug can happen from two situations.

1.  The definition of a file has the input definition that is higher than the output definition in the  file connection manager.  This causes a failure in a Dataflow Source component.  

2.  The actual buffer inside of a script task is too large.

For the first issue, this can be solved by making sure that the output size is the same as the input size.

For the second part, lets describe what is causing the issue.  When using a script task it has a finite amount of bytes that the input buffer can hold.  I have no idea what this finite set is, but it is rather large (but not as large as one would hope).  Because it has a finite buffer amount, it matters how much data you try to push through it.

In the scenario I hit the issue, I was trying to put 10 NTEXT columns into the script task, manipulate them, and output nvarchar(4000) size columns.  The input buffer was getting overflowed because NTEXT is relatively unbounded, and the data source I was using had several of the 10 columns with many thousands of characters in them.  

The trick to fixing this is to create multiple script tasks to do your manipulation.  Yes, this will look a bit odd, but it is the best way to approach the solution.

So, instead of using one, easily maintained and monolithic script task, use 2 of them and do half the work in one, and half the work in the other.  For me, this meant doing 5 columns in one task and 5 columns in the next task.