Wednesday, March 9, 2016

SQL Tutorial: Capturing changes using update statements using OUTPUT clauses

In the coding world we often get a fun request.  Capture changes to a table into a log table during bulk ETL updates.  When doing this in SQL the examples I have seen of people doing this can often be clunky, slow, and inefficient.  Recently I had to tackle one of these problems and the example code that I was given had multiple select statements comparing the old data and the new data directly from the source and the destination using complicated joins per column which it then inserted. 

This looked incredibly complicated, and required that any transformations that were done in the SQL be re-applied during the data comparison.  It was not a good method to approach the issue, and was very slow because for each column it had to do a join against old data.

Example:

Old Data vs New Data

Select
       mycolum,
       mynewcolumndata,
from old as current
join new as fromsource
 on key = key

Then they inserted all of this into a log table, and then did the update the table.

This is very slow as every single column required its own insert and update statement.  We don't want to do this.  So what I have done is crafted some code using the OUTPUT clause in SQL Server to give the ability to capture the direct differences between two tables.

MSDN OUTPUT CLAUSE: https://msdn.microsoft.com/en-us/library/ms177564.aspx

Here is the example that I created.
/*
Developer: Christopher Harsch
Definition: This is an example of how to do change data capture using unpivot.

This uses a sample table, and updates a record and shows the before and after results 
of the query that could then be later inserted into a per column logging table.

*/
IF Object_id('tempdb..#tmp_OUTPUT_CHG_DATA_CAPTURE_EXAMPLE') IS NOT NULL
  BEGIN
      DROP TABLE #tmp_output_chg_data_capture_example
  END

  -- Create a table to put some sample data
CREATE TABLE #tmp_output_chg_data_capture_example
  (
     id                   INT IDENTITY(1, 1)
     , firstname          VARCHAR(255)
     , lastname           VARCHAR(255)
     , numberofanimals    INT
     , awesomeindicator   CHAR(1)
     , awesomeind_type_cd VARCHAR(255)
  )

  -- Insert some sample data
INSERT INTO #tmp_output_chg_data_capture_example
            (firstname
             , lastname
             , numberofanimals
             , awesomeindicator
             , awesomeind_type_cd)
SELECT 'Chris'        AS FirstName
       , 'Harsch'     AS LastName
       , '0'          AS NUmberOfAnimals
       , 'N'          AS AwesomeIndicator
       , 'NotAwesome' AS AwesomeInd_Type_CD
UNION
SELECT 'Buzz'     AS FirstName
       , 'LightYear'   AS LastName
       , '-1'      AS NUmberOfAnimals
       , 'Y'       AS AwesomeIndicator
       , 'Awesome' AS AwesomeInd_Type_CD
UNION
SELECT 'Woody'    AS FirstName
       , 'AndysToy' AS LastName
       , '-1'       AS NUmberOfAnimals
       , 'Y'        AS AwesomeIndicator
       , 'Awesome'  AS AwesomeInd_Type_CD

IF Object_id('tempdb..#tmp_chng_capture') IS NOT NULL
  BEGIN
      DROP TABLE #tmp_chng_capture
  END

  -- See what is in the sample data.
select * from #tmp_OUTPUT_CHG_DATA_CAPTURE_EXAMPLE


-- Add the columns I want to show are being changed
-- These are the ones you are bringing in, don't capture things changing if you are  
-- not changing them.
-- Note, for this part I am turning things into a static data type as my method of 
-- before
-- and after comparisons require all the columns to have the same data type, 
-- varchar(size) is a good method for this
-- in this case I set the size to 255
CREATE TABLE #tmp_chng_capture
  (
     id                       INT NOT NULL
     , numberofanimals        VARCHAR(255)
     , numberofanimals_old    VARCHAR(255)
     , awesomeindicator       VARCHAR(255)
     , awesomeindicator_old   VARCHAR(255)
     , awesomeind_type_cd     VARCHAR(255)
     , awesomeind_type_cd_old VARCHAR(255)
  )

  -- Run the update, and use the OUTPUT clause to get the inserted/updated values
UPDATE tocdce
SET    tocdce.numberofanimals = '0'
       , awesomeindicator = 'Y'
       , awesomeind_type_cd = 'Awesome'
output inserted.id
       , inserted.numberofanimals
       , deleted.numberofanimals
       , inserted.awesomeindicator
       , deleted.awesomeindicator
       , inserted.awesomeind_type_cd
       , deleted.awesomeind_type_cd
INTO #tmp_chng_capture ( id, numberofanimals, numberofanimals_old,
awesomeindicator, awesomeindicator_old, awesomeind_type_cd,
awesomeind_type_cd_old )
FROM   #tmp_output_chg_data_capture_example TOCDCE
WHERE  lastname = 'Harsch'

-- Make sure we know woody has a dog (slinky dog)
UPDATE tocdce
SET    tocdce.numberofanimals = '1'
output inserted.id
       , inserted.numberofanimals
       , deleted.numberofanimals
       , inserted.awesomeindicator
       , deleted.awesomeindicator
       , inserted.awesomeind_type_cd
       , deleted.awesomeind_type_cd
INTO #tmp_chng_capture ( id, numberofanimals, numberofanimals_old,
awesomeindicator, awesomeindicator_old, awesomeind_type_cd,
awesomeind_type_cd_old )
FROM   #tmp_output_chg_data_capture_example TOCDCE
WHERE  lastname = 'AndysToy'

-- lets look at the output of the data changes
SELECT *
FROM   #tmp_chng_capture

-- now we want to do the unpivot magic. This unpivot is doing a double unpivot.
-- This uses a CTE but could easily have used selects into multiple temp tables for 
-- each step
-- Step 1: Get only the data that we care about from the table, this should only hold 
-- columns that we care
-- about unpivoting, everything else is extraneous.
-- Step 2: unpivot the new columns and capture the column names and the ID numbers 
-- for those columns.
-- Step 3: unpivot the old columns and capture the column names and the id numbers 
-- for those columns.
-- You will notice that in the old column set, I made the naming convention useful so 
-- that the columns
-- are just the name of the "new" columns, but with the name "_old" applied.
-- Step 4: Do a join on ID and column name, and remove the name "Old" from the old 
-- column names so that
-- it joins to the old columns.  This also does a comparison to see if the columns  
-- are actually different
-- and it also checks for the NULL condition, IE two null columns are the same, but 
-- will not be equivalent
-- if you do a straight comparison, so this does an AND NOT (BOTH NULL) clause.
; WITH cte_values_to_unpivot
     AS (SELECT id
                , numberofanimals
                , numberofanimals_old
                , awesomeindicator
                , awesomeindicator_old
                , awesomeind_type_cd
                , awesomeind_type_cd_old
         FROM   #tmp_chng_capture),
     cte_unpivoted_new
     AS (SELECT id
                , [columnvalue] AS [New_Value]
                , columnname    AS columnname
         FROM   cte_values_to_unpivot
                UNPIVOT ( columnvalue
                        FOR columnname IN ( numberofanimals
                                            , awesomeindicator
                                            , awesomeind_type_cd ) ) AS up1),
     cte_unpivoted_old
     AS (SELECT id
                , [columnvalue] AS [Old_Value]
                , columnname    AS columnname
         FROM   cte_values_to_unpivot
                UNPIVOT ( columnvalue
                        FOR columnname IN ( numberofanimals_old
                                            , awesomeindicator_old
                                            , awesomeind_type_cd_old ) ) AS up1)
SELECT cun.id
       , cun.columnname
       , cun.new_value
       , cuo.old_value
FROM   cte_unpivoted_new cun
       JOIN cte_unpivoted_old cuo
         ON cun.id = cuo.id
            AND cun.columnname = Replace(cuo.columnname, '_old', '') -- Make the column names match
WHERE  cun.new_value <> cuo.old_value
       AND NOT ( cun.new_value IS NULL
                 AND cuo.old_value IS NULL ) 
-- Catches for if both values are null and wont evaluate



No comments:

Post a Comment