SSIS Delta Load – CDC Approach SQL Server 2008 R2

I have spoken about CDC in my last post will speak about how to impliment CDC in conjuction with SSIS. This approrach might not be the best approach and you would need to evaluate properly in case you decide to implimenting the same

MS describes 2 ways to impliment the same,

Specified Interval Sample: http://msftisprodsamples.codeplex.com/wikipage?title=SS2008!Change%20Data%20Capture%20for%20Specified%20Interval%20Package%20Sample&ProjectName=msftisprodsamples

Change Data Capture since Last Request Package Sample

http://msftisprodsamples.codeplex.com/wikipage?title=SS2008%21Change%20Data%20Capture%20since%20Last%20Request%20Package%20Sample

i have used  the change data capture since last request package appraoch and cusomised its use, will try to define and simplify the same here

The package at high level looks like

The basic approach is:

  1. There are 4 containers, high level as defined below:
    1. Initialization Container: This step initializes all the connection Managers on the package and logs the start of the package to SQL server via the OnPrexecute Event Handler
    2. LSN assign Container: This step calculates if there are any data changes to be processed for the present run.
    3. Child package Container: This step calls all the child packages required for loading the data.
    4. End Package Container: This logs the end of package and sends Email, this will replaced with the SEND MAIL task provided in SSIS, and based on requirements the email will be sent to ADMIN(Need to figure out whom should Emails be sent to? )

Points 1 and 4 above are specific to your own implimentation, I would shed some light on the 2 and 3 as part of this post

THe basic approach taken requires us to introduce a new table called as LOAD_LOG onto the Destination DB, this table will capture the Start LSN, END LSN, Start Time, End TIme , Load Status.i.e. it defines values S- Sucess F: Failure N:No records Found

The approach above is a incremental data load approach it requires the initial data load to be completed in advance, and ETL_ Load_Log table to be populated manually once with the appropriate LSN values which can be sourced from cdc.lsn_time_mappingtable on your DB

Obtain the Start LSN from the Load_LOG table,  this would be the same as the latest row on this table which has load run= S and picked from EndLsn Column sample query may look like

The query tries to get the lastLSN which was succefully loaded which can then act as a start point for the next load

The next load LSN (both start and END) can be odtained via using the fucntion

StartLSN: sys.fn_Cdc_Incriment: this function will incriment the last sucssfull loaded LSN to the next LSN

EndLSN: sys.fn_cdc_get_max_Lsn(), this will give the END LSN for which load needs to run,

Keep a note the LSN values are binary values and for passing in the package we need to conver them to string and then viceversa for this we need to use functions like sys.fn_varbintoHex() and sys.fn_cdc_hexstrtoBin

An example of how to find out the new points once we know the last Succefull LSN load would look like something in the image below:

The step above sets the varaible dataready which can be used to figure out if the data is ready to be loaded or not, as above compare the LSN’s obtained as stored on the cdec.lsn_time_mapping table if the lsn <= startLSN and >= EndLSN exists that means data is ready to be loaded and we can go forward with the load else the no data for loading exisits

If the data is ready the next step to insert a record onto the load_log table with the status F, which would be updated post all the data load completion with S, if at all the things run fine

Once inserte the next conatiner would be used to call all the child packages whcih would have the code for all the tables required during load

Hope this articale makes sense the next article i would deal with the drawbacks of the appraoch i ahve taken and precautions which needs to be taken for this approach to work correctly

I would like to know the feedback and  questions for the above post, do write in what is good what is bad which has pointers for improvements in future posts.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: