Thursday, December 11, 2014

Heads up for ETL parsing C# library

I just wanted to give a heads up to the product from http://www.csvreader.com/.  I have worked with the creator of this product, and I have worked with the DataStreams product he sells quite a bit, and it is hands down the best C# ETL product I have ever worked with.  It simplifies all of the "hard" stuff you have to work with when it comes to managing data in C#, from parsing of data files to creating correctly formatted outputs.

Since it implements the parsing as an IDataReader, it makes importing data into databases dead easy, and it allows for record by record manipulation as the data flies by as well.


Wednesday, September 10, 2014

Inserting the results of a stored procedure into a temp table.

I recently had a co-worker who was unable to duplicate the results of a well known, and assumed working, stored procedure, and wanted to find out if I could help them with getting data from a stored procedure.  The problem?  The stored procedure had nearly 100 columns, and many of the discussions on the internet basically made it very difficult to figure out the best way to get the information.  There are a LOT of methods out there.  One uses OpenRowset, but this has configuration issues in many places. Others use fantastic methods revolving around the use of XML parsing, or even dropping it into C#.  These are OK, for their needs, but in reality if you have access to the stored procedure, this is one other method that you can use that is MUCH quicker as long as you are using sql server management studio.

The method I use is to do this.

1.  Copy the code for the stored procedure, do this is in a test environment where you have full access.
2.  Change the name of the stored proc (creating a new one) and add a unique word to the end, I call my procs that I do this with <procname>_CREATETEMPTABLE.
3.  Open the procedure, find the result set that you want to get the results from, and instead of return the select statement to the proc.  Add a an insert into statement into a dbo.<insertclevernamehere>
4.  Open SQL Server Management Studio and find the table dbo.<insertclevernamehere> and right click, and get the schema by going to Script Table As->Create->New Query Window.
5.  Copy the guts into a create table script for a temp table.
6.  Run the stored procdure by doing

Insert into #mytemptable
(
...columns
)
exec <yourstoredprocedure>

And it should work every time.

So, how does this work in real life?

Lets say I have a stored procedure, call it.

dbo.MyExampleProc

And lets say it has 2 columns returned, but we will do this as if it had 100.

The proc is defined as

create proc dbo.myexampleproc
as
SELECT

cast(1 as varchar(10)) as col1,cast(2 as int) as col2

When we run the proc, it does not return the schema, and in our case we can't figure out what that Schema SHOULD be without a lot of manual work or looking at documentation that no longer exists.

The next step would be to make it so that the data goes into a temporary, but accessable database table.

So we change the proc and do.

create proc dbo.myexampleproc_CREATETEMPTABLE
as
SELECT
cast(1 as varchar(10)) as col1,cast(2 as int) as col2
into dbo.myexampleproc_CREATETEMPTABLE_OUTPUT

Now that the data is in the table dbo.myexampleproc_CREATETEMPTABLE_OUTPUT we then script the table out using Script Table As->create->New Query Window.

This provides the data definiton:

CREATE TABLE [dbo].[myexampleproc_CREATETEMPTABLE_OUTPUT](
       [col1] [varchar](10) NULL,
       [col2] [int] NULL
) ON [PRIMARY]

Then I delete the table myexampleproc_CREATETEMPTABLE_OUTPUT and the stored procedure myexampleproc_CREATETEMPTABLE.

So then we can stick this somewhere and have a useful script with it to look at data, for example, the below, and we are done.  



USE ETLExamples
/* generate schema step, create a new stored proc exactly like the old one, execute it but make the result set go into a dbo.<temptablename>

For example, I created a temp one with _CREATETEMPTABLE, and in the final result set i had it go to a table dbo.<temptablename>

I ran it with the parameters (example below) that are the same paramters I want to capture.

I then deleted the stored proc, then used sql server management studio to grab the schema of the table (right click on the table, script)

Then I took the table column definitions and inserted them into the create table script for the temp table below.

Then when you execute the

INSERT INTO <temptable>
EXEC <stored proc name>

It just inserts the results of the proc into the temp table.


*/

--exec dbo.myexampleproc

IF object_id('tempdb..#myreallyfancytemptable') IS NOT NULL
BEGIN
   DROP TABLE #myreallyfancytemptable
END
--- STep 1, create the definition of the table that exactly matches the definition of your stored procedure

create table #myreallyfancytemptable
(
       [col1] [varchar](10) NULL,
       [col2] [int] NULL

)

-- insert into the temp table.

INSERT INTO #myreallyfancytemptable
EXEC [dbo].[myexampleproc]



select
* from #myreallyfancytemptable

The results from the table are:

col1        col2
1              2





Wednesday, July 9, 2014

Are "Normal" humans excluded from programming?

http://developers-beta.slashdot.org/story/14/07/09/131243/normal-humans-effectively-excluded-from-developing-software/

I was reading the above article about how a programmer felt that "normal" humans were blocked from programming, and I wanted to initially think to myself "Of course they are!" After this initial feeling of personal superiority, I realized all of the examples of how false this is.

The great majority of good programmers that I know in the field are not fantastic savants, they are regular people doing a job.  The only caveat I have in programming is the difference between a programmer, and a good programmer.  Your average, run of the mill, programmer that can do most tasks, and do research to find basic solutions, is someone who is either self trained or had a small associates degree in software development.  The good programmer is one who enjoys it, loves finding out new problems, and has a good memory for how things have been solved in the past and can see how to use their experience to solve new problems, and can also determine when a new approach SHOULD be used, even though an old one CAN be used.

An average programmer will approach a problem in the "I have a hammer, everything is a nail" solution.  Lets take something like SSIS.  A lower level SSIS developer, trying to figure out how to get data from one system to another, and needs to do a lookup against a very large data set on the same data base, might pull both sources in and use a join or lookup transform to get the required data.  A good programmer, with experience in both (and finding out the SSIS package chokes on such large sets) might make the next step and just use a select statement with a join against the data from the sql server database.

This seems obvious that these two solutions exist, but in the real world you will see hundreds of examples of people using a non optimal, obvious to them, solution repeated over and over simply because it works, but not because it is good, or easy to understand to anyone else.

In reality everything has a good, better, best solution set, but no one is going to ever hit the "best" solution for non trivial problems, especially the first time.  What you pay for in a "good" programmer, vs your regular programmer, is the experience to get "close" to the better solution, and sometimes even between the better and best categories.  Your software will be more reliable, your hardware less strained, and your coding time much shorter, and it will be worth every penny, but in the end you don't have real access to truly good programmers with enough experience in all of the technologies you have to really shine in this field.

There are so many factors in the idea of "good", "better", "best" in programming that the ideal of having some kind of elitism toward the entire field as shown in the article doesn't make sense.  Programming is a trade, most of the jobs in programming are trade style jobs with specific scopes of specialty.  You aren't going to take an ETL developer, toss him at a web development problem, and have as good an experience as using a dedicated web developer, even if the ETL developer is "the best ETL developer you have ever had!"

Programming is NOT rocket science, you can be very competent at programming if you have the ability to visualize your problems and properly define them, at some point it just starts to be syntax, and the competence to know what needs to be created vs re-used.

Kind of a ramble, but I just wanted to get my views on it out there on my little informal blog space.

Friday, April 25, 2014

SSIS failed with error code 0xC0209029

failed with error code 0xC0209029 

I am sure you have just come from a long line of internet searches to find this specific one. KUDOS to you sir, and this post may not have a fix for your error either.

Recently I received the error failed with error code 0xC0209029 error from SSIS when related to inserting rows into an OLE DB destination. The issue occurred during a package that had never failed in production, but had failed in the development environment for several days, but no issue seemed to be able to be found.

After looking around the internet, the variety of things that can cause this are many, but I would like to add one more.

IF you have a situation where you are trying to insert an identity key into a destination table, and the identity key already exists, it will produce this error.  Check to see if the identity key records for your table are not being inserted, and if they are, that the identity records do not already exist.

The situation that caused this error was because someone else was doing development in the dev environment without anyone giving notice, and the keys were for different data entirely, so it took a while to hunt the issue down. This should give you one more idea to look for the issue.


Friday, January 31, 2014

SSIS Error Message "the value is too long to fit in the data area of the buffer"

I recently hit this issue while doing some custom script task coding for an SSIS package, and it took me a while to figure out.  I found that this specific bug can happen from two situations.

1.  The definition of a file has the input definition that is higher than the output definition in the  file connection manager.  This causes a failure in a Dataflow Source component.  

2.  The actual buffer inside of a script task is too large.

For the first issue, this can be solved by making sure that the output size is the same as the input size.

For the second part, lets describe what is causing the issue.  When using a script task it has a finite amount of bytes that the input buffer can hold.  I have no idea what this finite set is, but it is rather large (but not as large as one would hope).  Because it has a finite buffer amount, it matters how much data you try to push through it.

In the scenario I hit the issue, I was trying to put 10 NTEXT columns into the script task, manipulate them, and output nvarchar(4000) size columns.  The input buffer was getting overflowed because NTEXT is relatively unbounded, and the data source I was using had several of the 10 columns with many thousands of characters in them.  

The trick to fixing this is to create multiple script tasks to do your manipulation.  Yes, this will look a bit odd, but it is the best way to approach the solution.

So, instead of using one, easily maintained and monolithic script task, use 2 of them and do half the work in one, and half the work in the other.  For me, this meant doing 5 columns in one task and 5 columns in the next task.



Wednesday, January 15, 2014

CTRL+R does not work SQL Server Management Studio 2012

Found this solution here, want to bump this up in Googles Search Results.

http://social.msdn.microsoft.com/Forums/sqlserver/en-US/abc6d428-fe6c-4321-b177-341eee54d807/hide-results-pane?forum=sqlgetstarted 


Answer, in SSMS, do the following (this worked for me):

Tools -> Options -> Keyboard and reset to Default, that fixed the problem.

Wednesday, January 8, 2014

Comment on developing in different technologies.

Over the years I have developed in a number of languages, in multiple environments, for the specific purpose of doing Extract, Transfer, and Load (ETL) operations.  Lately I have been working with Informatica, C#, and SSIS in the same work place at the same time, and I have some comments on the progression of ETL technology from having worked in the field for 10 years.

When I first started doing ETL work, the main focus of my job was to get data into a system by first manually pulling data into an ACCESS database, doing manual data hygiene (manually "fixing" data), and then putting that data into either a file, or into a staging table in a database.  

In this early world of ETL, I worked often with vb script, access, excel, and several other "manual" data manipulation technologies.  I also got to work with microsofts old DTS project.  DTS is a pre-cursor to the SSIS technologies used at many companies today, in fact many naming conventions in SSIS still use "Dts" in their naming.    

When I was working in DTS, we had a lot of crazy standards in the company I was at, because of what developers were comfortable in.  We worked in python and vb script within the dts packages. We would use this to manipulate data flows, alter column data, etc... in the end the dts packages were being used to get data from a file or other database, and move it to another data base.  It had multiple command line options, execution options, and was generally useful all around, if a bit clunky in the interface.

Later developmental focus shifted to SSIS, and manipulating data inside of that environment, in this case because of evangelism for the product through several key developers in the company, and the need for the flexibility of using slightly more powerful scripting and manipulation abilities (expressions, conditional splits based on expressions, easier manageability of success conditions etc...).    

SSIS was not "bad" but it certainly had its quirks, in fact years later I realized that much of my problems with it at the beginning is because I didn't automatically adjust for the quirks in the UI, and internally manage myself around them at the time.  What I like about SSIS, and still like about it today, is the wizard functionality, the ability to easily and quickly prototype things.  What I disliked (at least in 2005 SSIS) is that it had such poor management of its pipeline compared to DTS.  When you changed something, you had to populate that change even when nothing of significance (IE the names of things) had changed, and it could be a massive chore.  At that time we had yet to train ourselves NOT to try to do the most advanced features, such as creating custom components, and custom components never managed the meta data properly, in the end it was really good compared to DTS, but never really realized its full potential.

Later, in the same company, we moved to C#.  This caused a slew of headaches internally, but I had the argument with the C# evangelist internally that C# would be a great ETL tool, as long as we could quickly create code using it, and had the tools that let us do many of the common tasks we had already established in the SSIS code base.  For us these tasks were around basic ETL around the internal products, such as data matching, consistency in data hygiene, etc...  So, a core group internally went ahead and created this library of tools built around a product called DataStreams (www.csvreader.com).  This was a great product, and our internal utilities (things to take file formats, create mappings, and eventually database driven mappings) was a great boon to the company.  We had consistency in our products, the ability to write complex row by row logic, complete freedom of managing data, and it really moved the companies product based forward rapidly.  Things that took SSIS a long time, or simply caused SSIS to run out of space, could now be batched, managed, and loaded into a system without many of the problems of SSIS.

Eventually I left that company, and went on to other endeavors, I even had the opportunity to teach people how to use SSIS, use SQL, and other technologies around ETL.  Now I have begun to work in a new technology, Informatica, and I finally feel that I have the ability to make a few comments about it.

Informatica is like going back to DTS, but without the flexibility of DTS, and wrapped up in licencing that Microsoft products do not have.  It takes a long time to develop, every expert brought in has different ideas on how things should be done, and configuration can be inconsistent.  Even when we have 2 different experts come in, to do essentially the same task, it takes them 3 times longer to get it done, and tested, in Informatica because of the infrastructure issues, configuration issues, and sometimes third party source components having issues.

It is the least stable platform I have seen, the only benefit is that it at least has good error handling from a readability perspective than say SSIS.  When I work with it, I feel transformed back to 2005 using DTS, but without being as straight forward because all of the sources and destinations need to be managed.

In SSIS, when you begin to work with it heavily, you start normalizing around the idea that each package is self contained with its own definitions, and own management around it.  Yes, it has some options to try to make things universal, but in the end you don't use shared data sources, you just use shared configurations, and everyone implicitly agrees upon a standard or things don't get released.

In Informatica, it claims to allow you to share sources among projects, targets among projects etc... but when you deploy, or someone changes these underlying elements, it causes incredible headache.   This, of course, could be because of growing pains in the process, but for a product that has been out so long, it should have some kind of best practices around it that should mitigate these issues, and I have yet to see those surface.

I think the best environment I have had, where everything was tracked, with good configurations, still has to be when we were using C# and a small custom library that matched the needs of the business. I feel Informatica, so far, has been a bit of a step back in its interface, though I feel its underlying technology is solid (it has to be for the cost and how common it is).

I do however like the fact that Informatica is a mature product, in the sense that it does have a lot of options available, I just wish the UI, and developing for it, wasn't quite as tough as it is.