CDC Feature in Microsoft SQL Server

Writing my first BLOG ever, couldn’t think of anything better then the technical stuffs which i do with my Employer. Today would pin down my thoughts for design approach i took for one of the projects where we wanted to capture DELTA loads using the newbie(not so new now) in SQL SERVER 2008 R2. Let me introduce first to some concepts

What is CDC: CDC aka CHANGE DATA CAPTURE is the feature introduced via sql server 2008, CDC is available only on the Enterprise, Developer, and Evaluation editions of SQL Server 2008

What does CDC do: CDC records insert, update, and delete activity that is applied to a SQL Server table making them available as easily consumable relational format tables. Any changes that are made to the specified tables is captured, and the modified rows are stored in change tables that mirror the column structure of the tracked source tables. Post that CDC/ SQL server provides Table-valued functions for systematic access to the change data by consumers.

Explaining CDC: If we try to correlate how Microsoft brought all these things together, and try to correlate, remember the good old concepts of Triggers and writing of log files, MS brought both of them together to give us something which is called as CDC, you might think how are these related to understand its better i would try to correlate and explain things such that they could be understood better, Warning: This is just for better understanding and SQL server might not exactly do all the steps)

As we know any changes made to the Database are written into a forward only .ldf file by the SQL server lazy writer think of it as a page in a book where someone continues writing and cant go back, so whatever changes or things he wants to write he will continue writing, to recognize one will place page numbers or row numbers in between so that things can be tracked, something called as LSN(Logical Sequence Number) for SQL server.

Now once the CDC is implemented the changes records are captured via the triggers magic table(Inserted, deleted) and rows are captured with the LSN from the ldf files and inserted as data to the new tables, this is at high level how CDC works.

ENOUGH OF THEORY SOME CODE NOW: First COde to enable CDC

USE <DBNAME>
GO

–Create FileGroup to Store the CDC TABLES
ALTER DATABASE ADD FILEGROUP fgCDC

–Associate File .ndf for the FileGroup
ALTER DATABASE
ADD FILE (NAME = ‘CDC.ndf’,
FILENAME = ‘C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\NDF\CDC.ndf’,
SIZE = 1024,
MAXSIZE = UNLIMITED ,
FILEGROWTH = 10244)
TO FILEGROUP fgCDC

DECLARE @Error VARCHAR(MAX)
BEGIN TRY
BEGIN TRAN
select name, database_id,is_cdc_enabled
from sys.databases

– Enable CDC
EXEC sys.sp_cdc_enable_db
–EXEC sys.sp_cdc_disable_db

–Check Enabled or Not on DB
select name, database_id,is_cdc_enabled
from sys.databases

– Enable a Table Specifying Filegroup Option
– ===================================================
EXEC sys.sp_cdc_enable_table
@source_schema = N’dbo’,
@source_name = N’Test’,
@role_name = NULL,
@filegroup_name = N’fgCDC’,
@supports_net_changes = 1

END TRY

BEGIN CATCH
SELECT @Error = ERROR_MESSAGE()+ ERROR_LINE()
RAISERROR(16,@Error,1)
ROLLBACK TRAN
RETURN
END CATCH

COMMIT TRAN

Steps Involved

a) Create a file group to store the CDC enabled tables, this file group would store the secondary files defined to store the CDC tables.

b) Associate a secondary file(.ndf) to the filegroup, note that the .ndf should reside on a separate directory other then the primary .mdf file.

c) Enable CDC at database instance level : This is done by using the stored procedure sys.sp_cdc_enable_db.

d) Enable CDC for Table level instance : Source tables can be identified as tracked tables by using the stored procedure sys.sp_cdc_enable_table. When a table is enabled for change data capture, an associated capture instance is created to support the dissemination of the change data in the source table. The capture instance consists of a change table and up to two query functions

All objects that are associated with a capture instance are created in the change data capture schema of the enabled database. By default, the name is <schema name_table name> of the source table. Its associated change table is named by appending _CT to the capture instance name. The function that is used to query for all changes is named by prepending fn_cdc_get_all_changes_ to the capture instance name. If the capture instance is configured to support net changes, the net_changes query function is also created and named by prepending fn_cdc_get_net_changes_ to the capture instance name.

CDC has to be enabled for each Database and then tables individually on the server. once enabled SQL server creates several tables/ functions which are used to help CDC work, most important of them are, assuming we enabled CDC for a TABLEA

a) Table:

CDC.lsn_time_Mapping: this table is database specific and captures all the start times and end times of trassactions with the respective LSN(Logical Sequence Numbers)

CDC.dbo_TABLEA:THis is the table with the same structure as the table you want to capture, with some additional columns like: StartLSN, EndLSN, $UpdateMask(this stores values in binary format and help identifies which column has changed), $Operation(Specifies the type of operation row corrsoponds to, 1: Delete, 2: Inserts, 3: Updated values before, 4: Updated value after)

SSIS Delta Load – CDC Approach SQL Server 2008 R2

I have spoken about CDC in my last post will speak about how to impliment CDC in conjuction with SSIS. This approrach might not be the best approach and you would need to evaluate properly in case you decide to implimenting the same

MS describes 2 ways to impliment the same,

Specified Interval Sample: http://msftisprodsamples.codeplex.com/wikipage?title=SS2008!Change%20Data%20Capture%20for%20Specified%20Interval%20Package%20Sample&ProjectName=msftisprodsamples

Change Data Capture since Last Request Package Sample

http://msftisprodsamples.codeplex.com/wikipage?title=SS2008%21Change%20Data%20Capture%20since%20Last%20Request%20Package%20Sample

i have used  the change data capture since last request package appraoch and cusomised its use, will try to define and simplify the same here

The package at high level looks like

The basic approach is:

  1. There are 4 containers, high level as defined below:
    1. Initialization Container: This step initializes all the connection Managers on the package and logs the start of the package to SQL server via the OnPrexecute Event Handler
    2. LSN assign Container: This step calculates if there are any data changes to be processed for the present run.
    3. Child package Container: This step calls all the child packages required for loading the data.
    4. End Package Container: This logs the end of package and sends Email, this will replaced with the SEND MAIL task provided in SSIS, and based on requirements the email will be sent to ADMIN(Need to figure out whom should Emails be sent to? )

Points 1 and 4 above are specific to your own implimentation, I would shed some light on the 2 and 3 as part of this post

THe basic approach taken requires us to introduce a new table called as LOAD_LOG onto the Destination DB, this table will capture the Start LSN, END LSN, Start Time, End TIme , Load Status.i.e. it defines values S- Sucess F: Failure N:No records Found

The approach above is a incremental data load approach it requires the initial data load to be completed in advance, and ETL_ Load_Log table to be populated manually once with the appropriate LSN values which can be sourced from cdc.lsn_time_mappingtable on your DB

Obtain the Start LSN from the Load_LOG table,  this would be the same as the latest row on this table which has load run= S and picked from EndLsn Column sample query may look like

The query tries to get the lastLSN which was succefully loaded which can then act as a start point for the next load

The next load LSN (both start and END) can be odtained via using the fucntion

StartLSN: sys.fn_Cdc_Incriment: this function will incriment the last sucssfull loaded LSN to the next LSN

EndLSN: sys.fn_cdc_get_max_Lsn(), this will give the END LSN for which load needs to run,

Keep a note the LSN values are binary values and for passing in the package we need to conver them to string and then viceversa for this we need to use functions like sys.fn_varbintoHex() and sys.fn_cdc_hexstrtoBin

An example of how to find out the new points once we know the last Succefull LSN load would look like something in the image below:

The step above sets the varaible dataready which can be used to figure out if the data is ready to be loaded or not, as above compare the LSN’s obtained as stored on the cdec.lsn_time_mapping table if the lsn <= startLSN and >= EndLSN exists that means data is ready to be loaded and we can go forward with the load else the no data for loading exisits

If the data is ready the next step to insert a record onto the load_log table with the status F, which would be updated post all the data load completion with S, if at all the things run fine

Once inserte the next conatiner would be used to call all the child packages whcih would have the code for all the tables required during load

Hope this articale makes sense the next article i would deal with the drawbacks of the appraoch i ahve taken and precautions which needs to be taken for this approach to work correctly

I would like to know the feedback and  questions for the above post, do write in what is good what is bad which has pointers for improvements in future posts.