The Mobile Field Base: January 2013

I was reading an SSIS blog by Bill Fellows, from the World Of Whatever blog, and I came across an interesting problem dealing with SSIS data filtering. Basically he was doing speed tests on different data filtering methods in SSIS. His approach used find string, cache lookup, and script methods to do a set of lookups and send those lookups down different paths in a data flow.

I noticed that in his implementation of the script, that he was using a list object during his script tasks to do the lookups. So I sent a question off to him and asked if he had tried using sorted dictionaries instead. He said that he hadn't, but that he would be happy to see what I could find if I used sorted dictionaries.

So I downloaded the SSIS package from his example, and implemented in my own environment. The edit I made to his package was to copy the list implementation, do a little bit of a naming change, and update his list implementation script. My implementation of the sorted dictionary lookup involved setting up the data dictionary from the lists given in Bill's example, and then using an ENUM to determine which path each row should go down.

Download My Example

My script looks like this:

using System;

using System.Collections.Generic;

using System.Data;

using Microsoft.SqlServer.Dts.Pipeline.Wrapper;

using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]

public class ScriptMain : UserComponent

{

private enum wheretheystood

{

Confederate

, Union

, Undeclared

, Unknown

}

private List<int> Confederates;

private List<int> Union;

private List<int> Undeclared;

private SortedDictionary<int, wheretheystood> LookupList = new SortedDictionary<int,wheretheystood>();

public override void PreExecute()

{

base.PreExecute();

//this.Confederates = this.Variables.ConfederateList as List<int>;

//this.Union = this.Variables.UnionList as List<int>;

//this.Undeclared = this.Variables.UndeclaredList as List<int>;

foreach (int i in this.Variables.ConfederateList as List<int>)

{

LookupList.Add(i, wheretheystood.Confederate);

}

foreach (int i in this.Variables.UnionList as List<int>)

{

LookupList.Add(i, wheretheystood.Union);

}

foreach (int i in this.Variables.UndeclaredList as List<int>)

{

LookupList.Add(i, wheretheystood.Undeclared);

}

public override void Input0_ProcessInputRow(Input0Buffer Row)

{

wheretheystood lookup = wheretheystood.Unknown;

if (!LookupList.TryGetValue(Row.StateCode,out lookup))

{

lookup = wheretheystood.Unknown;

}

switch (lookup)

{

case wheretheystood.Confederate:

ConfederatesBuffer.AddRow();

ConfederatesBuffer.StateCode = Row.StateCode;

break;

case wheretheystood.Union:

UnionBuffer.AddRow();

UnionBuffer.StateCode = Row.StateCode;

break;

case wheretheystood.Undeclared:

UndeclaredBuffer.AddRow();

UndeclaredBuffer.StateCode = Row.StateCode;

break;

case wheretheystood.Unknown:

UnknownBuffer.AddRow();

UnknownBuffer.SateCode = Row.StateCode;

break;

default:

UnknownBuffer.AddRow();

UnknownBuffer.SateCode = Row.StateCode;

break;

}

As you can see, I use his original implementation of each list and use it to fill up the sorted dictionary, and then I simply do a quick lookup and switch statement to determine where the rows go in the data flow.

The results I got back surprised me, because my results for even the originally implemented tasks did not match those from his blog. This may be because of hardware differences between the two machines being tested on though.

I used the following script to get into the results for each run that I did from the sysdtslog table.

DECLARE @PackageSourceID VARCHAR(255)

SET @PackageSourceID = 'C446ED62-E79C-4AC4-97D6-45E39AEDC2B8'

DECLARE @listsize INT

SET @listsize = 60000000;

WITH cte_startend
     AS (SELECT *
         FROM   sysssislog
         WHERE  executionid IN (SELECT executionid
                                FROM   sysssislog
                                WHERE  sourceid = @PackageSourceID)
        --where executionid = @ExecutionID
        ),
     preexecute
     AS (SELECT *
         FROM   cte_startend a
         WHERE  event = 'OnPreExecute'),
     postexecute
     AS (SELECT *
         FROM   cte_startend a
         WHERE  event = 'OnPostExecute'),
     executetime
     AS (SELECT Datediff (ms, pe.starttime, poe.endtime) AS executetime,
                pe.id,
                pe.source,
                pe.starttime,
                pe.endtime,
                pe.sourceid,
                pe.executionid
         FROM   preexecute pe
                JOIN postexecute poe
                  ON pe.sourceid = poe.sourceid
                     AND pe.executionid = poe.executionid
         WHERE  pe.[source] IN ( 'Conditional Split list', 'List implementation'
                                 ,
                                 'Script task Sorted Dictionaries',
                                         'Smarter Script',
                                                        'Use lookups' ))
--@listsize
SELECT executionid,
       @listsize / executetime AS rowsperms,
       *
FROM   executetime et
ORDER  BY et.executionid DESC,
          @listsize / executetime DESC

The results I have after running it with 60000000 records is:

rowsperms	executetime	id	source	starttime	endtime	sourceid
265	226000	22339	Script task Sorted Dictionaries	06:50.0	06:50.0	A417A6BB-BF51-4161-9AC4-FA77F633EC6B
259	231000	22337	List implementation	02:59.0	02:59.0	051FE1E0-7065-4A2E-91AC-B35256B5BAFA
235	255000	22345	Use lookups	15:03.0	15:03.0	F18AEC4A-8108-4866-B3CF-2A1AD9872F5D
224	267000	22341	Smarter Script	10:36.0	10:36.0	BE989566-5921-4A25-9292-8255B5DEF57F
213	281000	22335	Conditional Split list	58:18.0	58:18.0	2EFB80C5-05CA-4198-990D-C0635B0F4AA0

The overall results show me that the sorted dictionary is faster, but it also showed that the list implementation itself came in second. I don't know if this is because of a specific difference in hardware being used, but the results were consistent over multiple runs.

Download My Example

The Mobile Field Base

Thursday, January 17, 2013

Filter speeds in SSIS

Wednesday, January 2, 2013