Friday, December 13, 2013

Configuring a package using dtsconfig configuration files

Had someone ask me a question on how to configure a package using DTS Config files, since they had only had experience configuring using sysdtsconfig.
Open the package.
Right click and choose package configurations.
Click on “Add”
Click “Next”

Make sure “XML configuration file” is selected in the configuration type drop down.
Then select “Browse”
Then browse out to the dtsconfig file

You should get something like:

You will say “reuse existing” unless you are adding or CHANGING a configuration in the file.
IF you are just trying to GET a configuration from the file, then you just want to Reuse existing, and not change the file in ANY way.

If you want to ADD a configuration, but not CHANGE an old configuration, you still want to use “Reuse existing”

IF you want to overwrite existing configurations, as well as change/add configurations, then you use “Overwrite”
Now give your configuration a useful name, hopefully a standard within your company.
Now that you have added the configuration:
Select it, and hit Edit, and verify that it configures the elements you care about.
In this example, I am configuring the connection string of the ETLExamples object.
The file looks like:
<?xml version="1.0"?>
<DTSConfiguration>
   <DTSConfigurationHeading>
       <DTSConfigurationFileInfo GeneratedBy="TheMythTheManTheLegend" GeneratedFromPackageName="ExampleUsingDTSConfigurationFiles" GeneratedFromPackageID="{CD0DB029-8B6D-4994-B8C5-D0DE51A611A4}" GeneratedDate="12/13/2013 2:54:52 PM"/>
   </DTSConfigurationHeading>
   <Configuration ConfiguredType="Property" Path="\Package.Connections[ETLExamples].Properties[ConnectionString]" ValueType="String">
       <ConfiguredValue>Data Source=.;Initial Catalog=ETLExamples;Provider=SQLNCLI10.1;Integrated Security=SSPI;Application Name=SSIS-ExampleSimpleCSVExport-{B3A912D4-3037-43F8-AE78-511D48B8BC26}LocalHost.ETLExamples;Auto Translate=False;</ConfiguredValue>
   </Configuration>
</DTSConfiguration>

As you can see, the “Path” has the exact name of my connection to configure (ETLExamples) sand the property (connectionstring) because it looks like this, when I hit edit, my edit screen will look like:
If it does not show this, it means that the name is not the same in your package.  If you are trying to use a common connection value, then you must name your package items (such as connections, or whatever you want to configure) the same name of the thing being configured in this file.
Calling your connection “MyETLExample” and expecting it to be configured, or expecting to easily match configurations or select it from the file simply would not be valid in this case.


SSIS: Writing to log files, but maintaining a single header.

So I had a small issue where I wanted to create a log file during the on error step of a precedence constraint.  For example, while dynamically pulling from a list of sites, one of the sites does not match the definition of a "valid" site, and the data flow component errors, but I don't want to error the entire package, and I only want to write to the log file IF there is an error with a site, but otherwise do NOT create a log file.

Well, I got this set up, and I had multiple steps writing to the error file, and I had the options set so that each time it writes it does not overwrite the file.

So I get a file that looks like:

Header
Record
Header
Record
Header
Record

But, I only want to write the header one time, instead of multiple times, but I have no way of controlling with the SSIS package is going to throw the errors.  The users may have "fixed" the site I was trying to fix.

So, after a bit of thinking, the solution I came up with seems to work pretty well.

I created an overall variable to track whether or not I have written an error or not, called User::HasWrittenError, and then set it initially to false.  In the step that writes the errors out, I put a script transform task in the data flow, and in the post execute of the data flow script component, I added the HasWrittenError to my script.



In the script task itself, in the post execute, I did the following:

public override void PostExecute()
    {
        base.PostExecute();

        Variables.HasWrittenError = true;
        /*
          Add your code here for postprocessing or remove if not needed
          You can set read/write variables here, for example:
          Variables.MyIntVar = 100
        */
    }

So now that the code sets the HasWrittenError, I added this to each of the logging file steps.

In the file connection manager for the output, I added an expression to set the ColumNamesInFirstDataRow option to @[User::HasWrittenError] != TRUE.



What this allows is that after the first execution, when HasWritten error is "false" it will write the header on the first run, and also create the file, on the second execution through, HasWrittenError will be "true" and thus no header will be written to the file.





Monday, September 9, 2013

Informatica development

So I have been working in Informatica recently, and I haven't really had much to post about as my main specialty is in SSIS/SQL/C# ETL, but it is coming along.  Learning a lot of things, and I may have some time to get tips and tricks out for things I am finding tricky in Informatica. When I have time I think I will give a post about how to pull a single record from a group of records based on sorting, as this is one of the things I think is not well explained in Informatica, and is much easier using SQL.  I also have a complete SSIS examples library to post at some point.  Which was the original point of this whole blog actually, but real life programming concerns have somewhat derailed that ambition.  I also have examples of how you can do quick ETL in C# using sql bulk loader, which I hope to get up on the site as well.  So just a small update that yes, in fact, I am still alive and kicking and still thinking of things to post here.

Wednesday, July 17, 2013

How to parse US addresses without CASS certification.

One of the more interesting problems to get hit with when it comes to doing coding, especially physical mailing, is how to parse physical addresses, and returned a standardized set of addresses for matching records.

This is a non trivial, and difficult problem.  The short answer is to purchase software, or have an address service provider, consume your raw address data and send you back the split out address data that you need in a file and go from there.  In fact, if your data has to be mailed out for any reason, or is from a low integrity source such as a free form text field on a web site, then this is the only way to do it.  Several sites, such as listcleanup.com, or marketing agencies such as KBM Group can do this for you for a price.  Full stop, you can quit reading right now.

BUT, if you have data from a high integrity source, that is generally well formatted (such as from a vendor, or an internal file source), and you want to spend the least amount of time on the addresses, and simply use the address parts for identification... then you can use an address parsing bit of software.  Initially I was going to describe such a process, but instead was able to find a pretty darn good implementation for US addresses already on the internet, and figured I would post it.  Sure, it doesn't come from my own hand, but the method I was going to use was a generalized method, and this has had considerably more thought, and seems far more accurate.

A nice little open source C# implementation available here:

http://usaddress.codeplex.com/


Thursday, June 27, 2013

Pivoting multiple columns with ordered names and attributes in sql server 2008 using the PIVOT function.

A friend of mine came across an issue where they needed to take data from a table, that had a label and a value, and then pivot it out into a single row and they were unsure of how to accomplish the goal.  So I created this little example to show them what needed to be done.

In the data they have a table of labeled attributes combined with a sort order.  The sort order is the priority of any individual attribute, and the attributes have labels, and for any foreign key has a value associated with that specific label.  

Key, AttributeName, Attribute Value, SortOrder

An example would be:

FKeyValue
SortOrder
ItemName
ItemValue
1
2
Item2
2
1
3
Item3
3
1
4
Item4
4
1
5
Item5
5
2
1
Item1
1
2
2
Item2
2
2
3
Item3
3
2
14
Item14
14
2
31
Item31
31
2
51
Item51
51

And they want the data to pivot to:


fkeyvalue
att1name
att1value
att2name
att2value
att3name
att3value
att4name
att4value
att5name
att5value
1
Item2
2
Item3
3
Item4
4
Item5
5
NULL
NULL
2
Item1
1
Item2
2
Item3
3
Item14
14
Item31
31

So how do you approach this problem?  Because SQL doesn't allow you to pivot multiple item data, you have to approach the problem in parts.  My solution is to pivot the data twice around the foreign key, and use a row number assigned at the beginning of the operations to manage the column names and column order the data is exported.

The idea is that you first create a table that has the order you want the data to be in, this gives you the attribute number, and aligns the data in the way you want it to appear in the columns.  The data after this initial sort step will look like:

FKeyValue
SortOrder
ItemName
ItemValue
row_num
1
2
Item2
2
1
1
3
Item3
3
2
1
4
Item4
4
3
1
5
Item5
5
4
2
1
Item1
1
1
2
2
Item2
2
2
2
3
Item3
3
3
2
14
Item14
14
4
2
31
Item31
31
5
2
51
Item51
51
6

Then you do two pivots, one pivot is to generate the name attribute columns, and the second pivot is to create the value attribute columns.  Then you join the columns back to their foreign key, and select the order you want the columns to be in.

So the first pivoted table data would look like:

FKeyValue
1
2
3
4
5
1
Item2
Item3
Item4
Item5
NULL
2
Item1
Item2
Item3
Item14
Item31

 And the second pivoted table will look like:


FKeyValue
1
2
3
4
5
1
2
3
4
5
NULL
2
1
2
3
14
31

And after you join both tables together, and do a bit of column renaming you get:


fkeyvalue
att1name
att1value
att2name
att2value
att3name
att3value
att4name
att4value
att5name
att5value
1
Item2
2
Item3
3
Item4
4
Item5
5
NULL
NULL
2
Item1
1
Item2
2
Item3
3
Item14
14
Item31
31

Below is a complete example, one done with temp tables, and one done with a CTE that also has a little bit of column renaming so you can see how to pull out appropriate column names dynamically using the row number from the data.


IF Object_id('tempdb..#tmp_multipivotexample') IS NOT NULL
  BEGIN
      DROP TABLE #tmp_multipivotexample
  END

/* First create some test data*/
CREATE TABLE #tmp_multipivotexample
  (
     fkeyvalue INT NOT NULL,
     sortorder INT,
     itemname  VARCHAR(255),
     itemvalue VARCHAR(255)
  )

INSERT INTO #tmp_multipivotexample
            (fkeyvalue,
             sortorder,
             itemname,
             itemvalue)
SELECT 1       AS KeyValue,
       2       AS SortOrder,
       'Item2' AS ItemName,
       '2'     AS ItemValue
UNION
SELECT 1       AS FKeyValue,
       4       AS SortOrder,
       'Item4' AS ItemName,
       '4'     AS ItemValue
UNION
SELECT 1       AS FKeyValue,
       5       AS SortOrder,
       'Item5' AS ItemName,
       '5'     AS ItemValue
UNION
SELECT 1       AS FKeyValue,
       3       AS SortOrder,
       'Item3' AS ItemName,
       '3'     AS ItemValue
UNION
SELECT 2       AS FKeyValue,
       1       AS SortOrder,
       'Item1' AS ItemName,
       '1'     AS ItemValue
UNION
SELECT 2       AS KeyValue,
       3       AS SortOrder,
       'Item3' AS ItemName,
       '3'     AS ItemValue
UNION
SELECT 2        AS FKeyValue,
       14       AS SortOrder,
       'Item14' AS ItemName,
       '14'     AS ItemValue
UNION
SELECT 2        AS FKeyValue,
       51       AS SortOrder,
       'Item51' AS ItemName,
       '51'     AS ItemValue
UNION
SELECT 2        AS FKeyValue,
       31       AS SortOrder,
       'Item31' AS ItemName,
       '31'     AS ItemValue
UNION
SELECT 2       AS FKeyValue,
       2       AS SortOrder,
       'Item2' AS ItemName,
       '2'     AS ItemValue


--select * from #tmp_multipivotexample
/* Second Now that we have data, lets grab the data and make faux row numbers for the data for each key */
IF Object_id('tempdb..#InitialSortStep') IS NOT NULL
  BEGIN
      DROP TABLE #initialsortstep
  END

SELECT *,
       Row_number()
         OVER(
           partition BY fkeyvalue
           ORDER BY sortorder ASC) row_num
INTO   #initialsortstep
FROM   #tmp_multipivotexample

--select * from #InitialSortStep

/* Now that we have the initial sort step, we can see it sorts the data and properly assigned attribute numbers
so lets do the first pivot */


IF Object_id('tempdb..#PivotedItemNameByFKeyValue') IS NOT NULL
  BEGIN
      DROP TABLE #pivoteditemnamebyfkeyvalue
  END

SELECT *
INTO   #pivoteditemnamebyfkeyvalue
FROM   (SELECT fkeyvalue,
               itemname,
               row_num
        FROM   #initialsortstep) AS InitPivot
       PIVOT ( Max(itemname)
             FOR row_num IN ([1],
                             [2],
                             [3],
                             [4],
                             [5]) ) AS pivoted

--select * from #pivoteditemnamebyfkeyvalue

/*  Now to produce the properly sorted item value data */

IF Object_id('tempdb..#PivotedItemValuesByFKeyValue') IS NOT NULL
  BEGIN
      DROP TABLE #pivoteditemvaluesbyfkeyvalue
  END

SELECT *
INTO   #pivoteditemvaluesbyfkeyvalue
FROM   (SELECT fkeyvalue,
               itemvalue,
               row_num
        FROM   #initialsortstep) AS InitPivot
       PIVOT ( Max(itemvalue)
             FOR row_num IN ([1],
                             [2],
                             [3],
                             [4],
                             [5]) ) AS pivoted

--select * from #pivoteditemvaluesbyfkeyvalue

/* now to bring all of it together */

SELECT PINBFV.fkeyvalue,
       PINBFV.[1] AS att1name,
       PINBFN.[1] AS att1value,
       PINBFV.[2] AS att2name,
       PINBFN.[2] AS att2value,
       PINBFV.[3] AS att3name,
       PINBFN.[3] AS att3value,
       PINBFV.[4] AS att4name,
       PINBFN.[4] AS att4value,
       PINBFV.[5] AS att5name,
       PINBFN.[5] AS att5value
FROM   #pivoteditemnamebyfkeyvalue PINBFV
       JOIN #pivoteditemvaluesbyfkeyvalue PINBFN
         ON PINBFV.fkeyvalue = PINBFN.fkeyvalue
/*now as a CTE, in this CTE I actually use the row number function to do column naming as well, and re-designed the pivot
I think this makes it more readable. */
;

WITH cte_initialsortstep
     AS (SELECT *,
                Row_number()
                  OVER(
                    partition BY fkeyvalue
                    ORDER BY sortorder ASC) row_num
         FROM   #tmp_multipivotexample),
     cte_pivoteditemnamebyfkeyvalue
     AS (SELECT *
         FROM   (SELECT fkeyvalue,
                        itemname,
                        'att' + Cast(row_num AS VARCHAR(10)) + 'name' AS Name
                 FROM   cte_initialsortstep) AS InitPivot
                PIVOT ( Max(itemname)
                      FOR name IN ([att1name],
                                   [att2name],
                                   [att3name],
                                   [att4name],
                                   [att5name]) ) AS pivoted),
     pivoteditemvaluesbyfkeyvalue
     AS (SELECT *
         FROM   (SELECT fkeyvalue,
                        itemvalue,
                        'att' + Cast(row_num AS VARCHAR(10))
                        + 'value' AS Value
                 FROM   #initialsortstep) AS InitPivot
                PIVOT ( Max(itemvalue)
                      FOR value IN ([att1value],
                                    [att2value],
                                    [att3value],
                                    [att4value],
                                    [att5value]) ) AS pivoted)
SELECT PINBFV.fkeyvalue,
       PINBFV.att1name  AS att1name,
       PINBFN.att1value AS att1value,
       PINBFV.att2name  AS att2name,
       PINBFN.att2value AS att2value,
       PINBFV.att3name  AS att3name,
       PINBFN.att3value AS att3value,
       PINBFV.att4name  AS att4name,
       PINBFN.att4value AS att4value,
       PINBFV.att5name  AS att5name,
       PINBFN.att5value AS att5value
FROM   cte_pivoteditemnamebyfkeyvalue PINBFV
       JOIN pivoteditemvaluesbyfkeyvalue PINBFN
         ON PINBFV.fkeyvalue = PINBFN.fkeyvalue