Thursday, January 17, 2013

Filter speeds in SSIS

I was reading an SSIS blog by Bill Fellows, from the World Of Whatever blog, and I came across an interesting problem dealing with SSIS data filtering.  Basically he was doing speed tests on different data filtering methods in SSIS.  His approach used find string, cache lookup, and script methods to do a set of lookups and send those lookups down different paths in a data flow.

I noticed that in his implementation of the script, that he was using a list object during his script tasks to do the lookups.  So I sent a question off to him and asked if he had tried using sorted dictionaries instead.  He said that he hadn't, but that he would be happy to see what I could find if I used sorted dictionaries.

So I downloaded the SSIS package from his example, and implemented in my own environment.  The edit I made to his package was to copy the list implementation, do a little bit of a naming change, and update his list implementation script.  My implementation of the sorted dictionary lookup involved setting up the data dictionary from the lists given in Bill's example, and then using an ENUM to determine which path each row should go down.

Download My Example

My script looks like this:

using System;
using System.Collections.Generic;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    private enum wheretheystood
    {
        Confederate
        , Union
        , Undeclared
        , Unknown
    }

    private List<int> Confederates;
    private List<int> Union;
    private List<int> Undeclared;
    private SortedDictionary<int, wheretheystood> LookupList = new SortedDictionary<int,wheretheystood>();


    public override void PreExecute()
    {
        base.PreExecute();
        //this.Confederates = this.Variables.ConfederateList as List<int>;
        //this.Union = this.Variables.UnionList as List<int>;
        //this.Undeclared = this.Variables.UndeclaredList as List<int>;

        foreach (int i in this.Variables.ConfederateList as List<int>)
        {
            LookupList.Add(i, wheretheystood.Confederate);
        }

        foreach (int i in this.Variables.UnionList as List<int>)
        {
            LookupList.Add(i, wheretheystood.Union);
        }

        foreach (int i in this.Variables.UndeclaredList as List<int>)
        {
            LookupList.Add(i, wheretheystood.Undeclared);
        }
    }


    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {

        wheretheystood lookup = wheretheystood.Unknown;
        if (!LookupList.TryGetValue(Row.StateCode,out lookup))
        {
            lookup = wheretheystood.Unknown;
        }

        switch (lookup)
          {
                case wheretheystood.Confederate:
                ConfederatesBuffer.AddRow();           
                ConfederatesBuffer.StateCode = Row.StateCode;
                break;

            case wheretheystood.Union:
                UnionBuffer.AddRow();
                UnionBuffer.StateCode = Row.StateCode;
             break;
            case wheretheystood.Undeclared:
                UndeclaredBuffer.AddRow();
                UndeclaredBuffer.StateCode = Row.StateCode;
             break;
            case wheretheystood.Unknown:
                UnknownBuffer.AddRow();
                UnknownBuffer.SateCode = Row.StateCode;
                break;
            default:
                UnknownBuffer.AddRow();
                UnknownBuffer.SateCode = Row.StateCode;
                break;
          }


      

    }


}


As you can see, I use his original implementation of each list and use it to fill up the sorted dictionary, and then I simply do a quick lookup and switch statement to determine where the rows go in the data flow.

The results I got back surprised me, because my results for even the originally implemented tasks did not match those from his blog.  This may be because of hardware differences between the two machines being tested on though.

I used the following script to get into the results for each run that I did from the sysdtslog table.


DECLARE @PackageSourceID VARCHAR(255) 

SET @PackageSourceID = 'C446ED62-E79C-4AC4-97D6-45E39AEDC2B8' 

DECLARE @listsize INT 

SET @listsize = 60000000; 

WITH cte_startend 
     AS (SELECT * 
         FROM   sysssislog 
         WHERE  executionid IN (SELECT executionid 
                                FROM   sysssislog 
                                WHERE  sourceid = @PackageSourceID) 
        --where executionid = @ExecutionID 
        ), 
     preexecute 
     AS (SELECT * 
         FROM   cte_startend a 
         WHERE  event = 'OnPreExecute'), 
     postexecute 
     AS (SELECT * 
         FROM   cte_startend a 
         WHERE  event = 'OnPostExecute'), 
     executetime 
     AS (SELECT Datediff (ms, pe.starttime, poe.endtime) AS executetime, 
                pe.id, 
                pe.source, 
                pe.starttime, 
                pe.endtime, 
                pe.sourceid, 
                pe.executionid 
         FROM   preexecute pe 
                JOIN postexecute poe 
                  ON pe.sourceid = poe.sourceid 
                     AND pe.executionid = poe.executionid 
         WHERE  pe.[source] IN ( 'Conditional Split list', 'List implementation' 
                                 , 
                                 'Script task Sorted Dictionaries', 
                                         'Smarter Script', 
                                                        'Use lookups' )) 
--@listsize 
SELECT executionid, 
       @listsize / executetime AS rowsperms, 
       * 
FROM   executetime et 
ORDER  BY et.executionid DESC, 
          @listsize / executetime DESC 


The results I have after running it with 60000000 records is:

rowsperms
executetime
id
source
starttime
endtime
sourceid
265
226000
22339
Script task Sorted Dictionaries
06:50.0
06:50.0
A417A6BB-BF51-4161-9AC4-FA77F633EC6B
259
231000
22337
List implementation
02:59.0
02:59.0
051FE1E0-7065-4A2E-91AC-B35256B5BAFA
235
255000
22345
Use lookups
15:03.0
15:03.0
F18AEC4A-8108-4866-B3CF-2A1AD9872F5D
224
267000
22341
Smarter Script
10:36.0
10:36.0
BE989566-5921-4A25-9292-8255B5DEF57F
213
281000
22335
Conditional Split list
58:18.0
58:18.0
2EFB80C5-05CA-4198-990D-C0635B0F4AA0

The overall results show me that the sorted dictionary is faster, but it also showed that the list implementation itself came in second.  I don't know if this is because of a specific difference in hardware being used, but the results were consistent over multiple runs.

No comments:

Post a Comment