I noticed that in his implementation of the script, that he was using a list object during his script tasks to do the lookups. So I sent a question off to him and asked if he had tried using sorted dictionaries instead. He said that he hadn't, but that he would be happy to see what I could find if I used sorted dictionaries.
So I downloaded the SSIS package from his example, and implemented in my own environment. The edit I made to his package was to copy the list implementation, do a little bit of a naming change, and update his list implementation script. My implementation of the sorted dictionary lookup involved setting up the data dictionary from the lists given in Bill's example, and then using an ENUM to determine which path each row should go down.
Download My Example
My script looks like this:
using System;
using
System.Collections.Generic;
using
System.Data;
using
Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using
Microsoft.SqlServer.Dts.Runtime.Wrapper;
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain :
UserComponent
{
private enum wheretheystood
{
Confederate
, Union
, Undeclared
, Unknown
}
private List<int>
Confederates;
private List<int>
Union;
private List<int>
Undeclared;
private SortedDictionary<int,
wheretheystood> LookupList = new SortedDictionary<int,wheretheystood>();
public override void
PreExecute()
{
base.PreExecute();
//this.Confederates
= this.Variables.ConfederateList as List<int>;
//this.Union
= this.Variables.UnionList as List<int>;
//this.Undeclared
= this.Variables.UndeclaredList as List<int>;
foreach
(int i in this.Variables.ConfederateList as List<int>)
{
LookupList.Add(i, wheretheystood.Confederate);
}
foreach (int i in this.Variables.UnionList
as List<int>)
{
LookupList.Add(i, wheretheystood.Union);
}
foreach
(int i in this.Variables.UndeclaredList as List<int>)
{
LookupList.Add(i, wheretheystood.Undeclared);
}
}
public override void
Input0_ProcessInputRow(Input0Buffer Row)
{
wheretheystood
lookup = wheretheystood.Unknown;
if
(!LookupList.TryGetValue(Row.StateCode,out
lookup))
{
lookup = wheretheystood.Unknown;
}
switch
(lookup)
{
case wheretheystood.Confederate:
ConfederatesBuffer.AddRow();
ConfederatesBuffer.StateCode =
Row.StateCode;
break;
case
wheretheystood.Union:
UnionBuffer.AddRow();
UnionBuffer.StateCode =
Row.StateCode;
break;
case
wheretheystood.Undeclared:
UndeclaredBuffer.AddRow();
UndeclaredBuffer.StateCode =
Row.StateCode;
break;
case
wheretheystood.Unknown:
UnknownBuffer.AddRow();
UnknownBuffer.SateCode =
Row.StateCode;
break;
default:
UnknownBuffer.AddRow();
UnknownBuffer.SateCode =
Row.StateCode;
break;
}
}
}
As you can see, I use his original implementation of each list and use it to fill up the sorted dictionary, and then I simply do a quick lookup and switch statement to determine where the rows go in the data flow.
The results I got back surprised me, because my results for even the originally implemented tasks did not match those from his blog. This may be because of hardware differences between the two machines being tested on though.
I used the following script to get into the results for each run that I did from the sysdtslog table.
DECLARE @PackageSourceID VARCHAR(255)
SET @PackageSourceID = 'C446ED62-E79C-4AC4-97D6-45E39AEDC2B8'
DECLARE @listsize INT
SET @listsize = 60000000;
WITH cte_startend
AS (SELECT *
FROM sysssislog
WHERE executionid IN (SELECT executionid
FROM sysssislog
WHERE sourceid = @PackageSourceID)
--where executionid = @ExecutionID
),
preexecute
AS (SELECT *
FROM cte_startend a
WHERE event = 'OnPreExecute'),
postexecute
AS (SELECT *
FROM cte_startend a
WHERE event = 'OnPostExecute'),
executetime
AS (SELECT Datediff (ms, pe.starttime, poe.endtime) AS executetime,
pe.id,
pe.source,
pe.starttime,
pe.endtime,
pe.sourceid,
pe.executionid
FROM preexecute pe
JOIN postexecute poe
ON pe.sourceid = poe.sourceid
AND pe.executionid = poe.executionid
WHERE pe.[source] IN ( 'Conditional Split list', 'List implementation'
,
'Script task Sorted Dictionaries',
'Smarter Script',
'Use lookups' ))
--@listsize
SELECT executionid,
@listsize / executetime AS rowsperms,
*
FROM executetime et
ORDER BY et.executionid DESC,
@listsize / executetime DESC
SET @PackageSourceID = 'C446ED62-E79C-4AC4-97D6-45E39AEDC2B8'
DECLARE @listsize INT
SET @listsize = 60000000;
WITH cte_startend
AS (SELECT *
FROM sysssislog
WHERE executionid IN (SELECT executionid
FROM sysssislog
WHERE sourceid = @PackageSourceID)
--where executionid = @ExecutionID
),
preexecute
AS (SELECT *
FROM cte_startend a
WHERE event = 'OnPreExecute'),
postexecute
AS (SELECT *
FROM cte_startend a
WHERE event = 'OnPostExecute'),
executetime
AS (SELECT Datediff (ms, pe.starttime, poe.endtime) AS executetime,
pe.id,
pe.source,
pe.starttime,
pe.endtime,
pe.sourceid,
pe.executionid
FROM preexecute pe
JOIN postexecute poe
ON pe.sourceid = poe.sourceid
AND pe.executionid = poe.executionid
WHERE pe.[source] IN ( 'Conditional Split list', 'List implementation'
,
'Script task Sorted Dictionaries',
'Smarter Script',
'Use lookups' ))
--@listsize
SELECT executionid,
@listsize / executetime AS rowsperms,
*
FROM executetime et
ORDER BY et.executionid DESC,
@listsize / executetime DESC
The results I have after running it with 60000000 records is:
rowsperms
|
executetime
|
id
|
source
|
starttime
|
endtime
|
sourceid
|
265
|
226000
|
22339
|
Script task Sorted Dictionaries
|
06:50.0
|
06:50.0
|
A417A6BB-BF51-4161-9AC4-FA77F633EC6B
|
259
|
231000
|
22337
|
List implementation
|
02:59.0
|
02:59.0
|
051FE1E0-7065-4A2E-91AC-B35256B5BAFA
|
235
|
255000
|
22345
|
Use lookups
|
15:03.0
|
15:03.0
|
F18AEC4A-8108-4866-B3CF-2A1AD9872F5D
|
224
|
267000
|
22341
|
Smarter Script
|
10:36.0
|
10:36.0
|
BE989566-5921-4A25-9292-8255B5DEF57F
|
213
|
281000
|
22335
|
Conditional Split list
|
58:18.0
|
58:18.0
|
2EFB80C5-05CA-4198-990D-C0635B0F4AA0
|
The overall results show me that the sorted dictionary is faster, but it also showed that the list implementation itself came in second. I don't know if this is because of a specific difference in hardware being used, but the results were consistent over multiple runs.