Late Arriving Dimensions- Design Approach

This post will describe how to handle Late Arriving Dimensions, for a Data Mart, and how we did it in our project

What we did it was taken via the design approach defined on the GURU webiste for all KIMBALL here is the link i will try to theoretical y define the approach stated and what we did to suit our needs

http://www.kimballgroup.com/html/designtipsPDF/DesignTips2006/KU78LateArrivingDimensionRows.pdf

What is a Late Arriving Dimensions(LAD)

LAD are the dimensions, where in FACT Table records are populated priot to population of the dimension records , this statement in itself is very confusing especially for the people in the OLTP world, but hey in OLAP world this is something we have to deal with will site and example from the real life project and try to convince you

In the INSURANCE industry Site a policy lifecycle, a policy is been issued to a person and a and he has the POLICYNUMBER with him, we have a 15 day period wherein the Policy document will be delivered to the client and his approval sought, the entry for this would be made/ confirmed only when the policy document is delivered/ accepted by the client so in these 15 days if any claim is made on the POLICY NUMBER then it creates a ORPHAN record in terms of Dimension not having the entry for that record.

So the scenario is as below with the table structures

TableDim1: (KeyCol, BusinessKey, Description)

Keycol: Is the Surrogate Key Column on the dimension column, this is a TYPE1 SCD, popualted on a daily basis.

BusinessKey: This is the KeyCol as in Upstream OLTP System.( This is the Key value which has been generated on the OLTP)

TableFact1: (KeyColFact, KeyColDim1, KeyColDim2,FactValue)

KeyColDim1: Is the key column as coming from the Dimension(Primary Key–> Foreign key relation)

KeyColDim2: This is the key column same as above for a different dimension

FactVaue: This is the fact the fact value of the measure.

Now to define the design approach what we did:

The TableDim1, would have a  default row something values like:

0, 0,UNKNOWN.

On the data Load ETL, now when the FACT value comes in, and a lookup happens on the base of BUSINESS KEY with the Dimension table, the value is returned as NULL as the tableDIM1 hasn’t obtained the BusinessKey1 Record in it, we do a ISNULL on the fact row for the column replace it with a default 0 so the fact record looks like:

(0,4,2111.07), Note the first column is kEYDIM1 column which is looked up and we have got the default value there.

So far so good we have handled the FACT row well, and our data load/ cube processing wouldn’t fail as we have the constraints met properly (PrimayKey–> Foreign Key), bit the question is as and when the Dimension value arrives for that specific record how do the OLAP system know that it has arrived? to solve this again we took the help of GURU, and this is what we did,

a) Introduce a new table which capture all the misses for the BUSINESSKEY, and have a row each for them. This table would have the BUSINESSKEY captured, and FACTROW Key column also captured into it(KeyColFact).

b) Once the ETL load has finished traverse through all the fact rows, check for the BUSINESS Keys again doing a Lookup against the Dimension table( as the Dimension table record would have arrived now), obtain the new KeyDim1, and update the fact table on the basis of KeyColFact.

This was the easiest and safest way around, and can be easily enhanced to have a single table for all the lateArrivingdimensions by adding a new column to refer to the column names.

Advertisements

MOLAP vs ROLAP vs HOLAP

Today topic once again something i was confused about, always had an high level idea but never really read and understood the difference correctly until i was enlightened by a friend about a MS white paper on the  performance optimization on the CUBES so here goes the link and for the people who are lazy not wanting to read through whole lots of things in there just the extract of Storage modes difference goes as below:

SQL Server 2005

http://download.microsoft.com/download/8/5/e/85eea4fa-b3bb-4426-97d0-7f7151b2011c/SSAS2005PerfGuide.doc

http://www.sql-server-performance.com/2009/ssas-storage-modes/

 MOLAP:

MOLAP stores the aggregations and the data from the fact tables and dimensions on the Analysis services instance, the data and aggregations are stored in Compressed state Multidimensional Data structure, which is suited for fast retrieval and occupy very less space. precisely  this is the reason despite of all fact and dimension data getting copied still the MOLAP occupies much less space then the corresponding relational data. On a average as a guidance Analysis Services compresses the source fact data and its aggregations to approximately 30 percent of the size of the same data stored in a relational database, actual ration can vary based on number of duplicate keys and bit encoding algorithms.

In MOLAP viz a viz ROLAP and HOLAP where Analysis Services relies on on the relational engine to perform potentially large table joins against fact data stored in the relational database to resolve some or all queries, MOLAP relies on the Multidimensional data alone.

Pros of MOLAP:

Faster data access, good user experience.

All the calculations are pre-generated when the cube is processed and stored locally on the OLAP server hence even the complex calculations, as a part the query result, will be pefromed quickly.

MOLAP does not need to have a permanent connection to the underlying relational database (only at the time of processing) as it stores the detail and aggregate data in the OLAP server so the data can be viewed even when there is connection to the relational database.

Cons of MOLAP

With MOLAP  mode, we need frequent processing to pull refreshed data after last processing resulting in drain on system resources.

If the data volume is high, the cube processing can take longer, though you can use incremental processing to overcome this.

ROLAP:

This mode requires the least storage, none of the data and aggregations are stored on the Analysis Server, everything resides on the Relational DB side, this can be mostly useful in scenarios where the constraints are not applied on the Relational DB side, then we need to ensure that the UNKNOWN ember is enabled both on dimension and facts.

 Analysis Services rely on the relational engine to resolve all queries against the relational tables, storing both fact data and aggregations. The sheer number of queries with large table joins in large or complex cubes frequently overwhelms the relational engine.

Pros

The data available is such cube is near Real time

Since ROLAP keeps the data on the Relational DB,  it has less storage requirements. This is very advantageous for large datasets which are queried infrequently such as historical data.

Cons

Compared to MOLAP or HOLAP the response is generally slower, as all things lies on the Relational DB

HOLAP:

HOLAP is the combination of good things of HOLAP and ROLAP, it stores the aggregations on the Analysis Server and data(fact and dimensional) remains on the Relational DB

As a result, whenever Analysis Services needs to resolve a query against fact data stored in a HOLAP partition, it has to query the relational database directly rather than querying a multidimensional structure as in case of MOLAP

 

As far as processing goes there is no significant difference in processing performance between MOLAP partitions and HOLAP partition, in all fact data is read from the relational database, and aggregations are calculated. With HOLAP, Analysis Services does not store fact data. This difference has minimal impact on processing performance, but can have a significant impact on query performance, as HOLAP and MOLAP processing speeds are nearly same and MOLAP query performance is superior, MOLAP is the optimum storage choice.

Pros:

Stores the detail and aggregate data in the OLAP server in a compressed multidimensional format; as a result the cube browsing is fastest in this mode.

All the calculations are pre-generated when the cube is processed and stored locally on the OLAP server hence even the complex calculations, as a part the query result, will be pefromed quickly.

MOLAP uses compression to store the data on the OLAP server and so has less storage requirements than relational databases for same amount of data.

CONS:

 With MOLAP  mode, you need frequent processing to pull refreshed data after last processing resulting in drain on system resources.

Just after the processing if there is any changes in the relational database it will not be reflected on the OLAP server unless   re-processing is performed.

 

CDC Feature in Microsoft SQL Server

Writing my first BLOG ever, couldn’t think of anything better then the technical stuffs which i do with my Employer. Today would pin down my thoughts for design approach i took for one of the projects where we wanted to capture DELTA loads using the newbie(not so new now) in SQL SERVER 2008 R2. Let me introduce first to some concepts

What is CDC: CDC aka CHANGE DATA CAPTURE is the feature introduced via sql server 2008, CDC is available only on the Enterprise, Developer, and Evaluation editions of SQL Server 2008

What does CDC do: CDC records insert, update, and delete activity that is applied to a SQL Server table making them available as easily consumable relational format tables. Any changes that are made to the specified tables is captured, and the modified rows are stored in change tables that mirror the column structure of the tracked source tables. Post that CDC/ SQL server provides Table-valued functions for systematic access to the change data by consumers.

Explaining CDC: If we try to correlate how Microsoft brought all these things together, and try to correlate, remember the good old concepts of Triggers and writing of log files, MS brought both of them together to give us something which is called as CDC, you might think how are these related to understand its better i would try to correlate and explain things such that they could be understood better, Warning: This is just for better understanding and SQL server might not exactly do all the steps)

As we know any changes made to the Database are written into a forward only .ldf file by the SQL server lazy writer think of it as a page in a book where someone continues writing and cant go back, so whatever changes or things he wants to write he will continue writing, to recognize one will place page numbers or row numbers in between so that things can be tracked, something called as LSN(Logical Sequence Number) for SQL server.

Now once the CDC is implemented the changes records are captured via the triggers magic table(Inserted, deleted) and rows are captured with the LSN from the ldf files and inserted as data to the new tables, this is at high level how CDC works.

ENOUGH OF THEORY SOME CODE NOW: First COde to enable CDC

USE <DBNAME>
GO

–Create FileGroup to Store the CDC TABLES
ALTER DATABASE ADD FILEGROUP fgCDC

–Associate File .ndf for the FileGroup
ALTER DATABASE
ADD FILE (NAME = ‘CDC.ndf’,
FILENAME = ‘C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\NDF\CDC.ndf’,
SIZE = 1024,
MAXSIZE = UNLIMITED ,
FILEGROWTH = 10244)
TO FILEGROUP fgCDC

DECLARE @Error VARCHAR(MAX)
BEGIN TRY
BEGIN TRAN
select name, database_id,is_cdc_enabled
from sys.databases

– Enable CDC
EXEC sys.sp_cdc_enable_db
–EXEC sys.sp_cdc_disable_db

–Check Enabled or Not on DB
select name, database_id,is_cdc_enabled
from sys.databases

– Enable a Table Specifying Filegroup Option
– ===================================================
EXEC sys.sp_cdc_enable_table
@source_schema = N’dbo’,
@source_name = N’Test’,
@role_name = NULL,
@filegroup_name = N’fgCDC’,
@supports_net_changes = 1

END TRY

BEGIN CATCH
SELECT @Error = ERROR_MESSAGE()+ ERROR_LINE()
RAISERROR(16,@Error,1)
ROLLBACK TRAN
RETURN
END CATCH

COMMIT TRAN

Steps Involved

a) Create a file group to store the CDC enabled tables, this file group would store the secondary files defined to store the CDC tables.

b) Associate a secondary file(.ndf) to the filegroup, note that the .ndf should reside on a separate directory other then the primary .mdf file.

c) Enable CDC at database instance level : This is done by using the stored procedure sys.sp_cdc_enable_db.

d) Enable CDC for Table level instance : Source tables can be identified as tracked tables by using the stored procedure sys.sp_cdc_enable_table. When a table is enabled for change data capture, an associated capture instance is created to support the dissemination of the change data in the source table. The capture instance consists of a change table and up to two query functions

All objects that are associated with a capture instance are created in the change data capture schema of the enabled database. By default, the name is <schema name_table name> of the source table. Its associated change table is named by appending _CT to the capture instance name. The function that is used to query for all changes is named by prepending fn_cdc_get_all_changes_ to the capture instance name. If the capture instance is configured to support net changes, the net_changes query function is also created and named by prepending fn_cdc_get_net_changes_ to the capture instance name.

CDC has to be enabled for each Database and then tables individually on the server. once enabled SQL server creates several tables/ functions which are used to help CDC work, most important of them are, assuming we enabled CDC for a TABLEA

a) Table:

CDC.lsn_time_Mapping: this table is database specific and captures all the start times and end times of trassactions with the respective LSN(Logical Sequence Numbers)

CDC.dbo_TABLEA:THis is the table with the same structure as the table you want to capture, with some additional columns like: StartLSN, EndLSN, $UpdateMask(this stores values in binary format and help identifies which column has changed), $Operation(Specifies the type of operation row corrsoponds to, 1: Delete, 2: Inserts, 3: Updated values before, 4: Updated value after)

SSIS Delta Load – CDC Approach SQL Server 2008 R2

I have spoken about CDC in my last post will speak about how to impliment CDC in conjuction with SSIS. This approrach might not be the best approach and you would need to evaluate properly in case you decide to implimenting the same

MS describes 2 ways to impliment the same,

Specified Interval Sample: http://msftisprodsamples.codeplex.com/wikipage?title=SS2008!Change%20Data%20Capture%20for%20Specified%20Interval%20Package%20Sample&ProjectName=msftisprodsamples

Change Data Capture since Last Request Package Sample

http://msftisprodsamples.codeplex.com/wikipage?title=SS2008%21Change%20Data%20Capture%20since%20Last%20Request%20Package%20Sample

i have used  the change data capture since last request package appraoch and cusomised its use, will try to define and simplify the same here

The package at high level looks like

The basic approach is:

  1. There are 4 containers, high level as defined below:
    1. Initialization Container: This step initializes all the connection Managers on the package and logs the start of the package to SQL server via the OnPrexecute Event Handler
    2. LSN assign Container: This step calculates if there are any data changes to be processed for the present run.
    3. Child package Container: This step calls all the child packages required for loading the data.
    4. End Package Container: This logs the end of package and sends Email, this will replaced with the SEND MAIL task provided in SSIS, and based on requirements the email will be sent to ADMIN(Need to figure out whom should Emails be sent to? )

Points 1 and 4 above are specific to your own implimentation, I would shed some light on the 2 and 3 as part of this post

THe basic approach taken requires us to introduce a new table called as LOAD_LOG onto the Destination DB, this table will capture the Start LSN, END LSN, Start Time, End TIme , Load Status.i.e. it defines values S- Sucess F: Failure N:No records Found

The approach above is a incremental data load approach it requires the initial data load to be completed in advance, and ETL_ Load_Log table to be populated manually once with the appropriate LSN values which can be sourced from cdc.lsn_time_mappingtable on your DB

Obtain the Start LSN from the Load_LOG table,  this would be the same as the latest row on this table which has load run= S and picked from EndLsn Column sample query may look like

The query tries to get the lastLSN which was succefully loaded which can then act as a start point for the next load

The next load LSN (both start and END) can be odtained via using the fucntion

StartLSN: sys.fn_Cdc_Incriment: this function will incriment the last sucssfull loaded LSN to the next LSN

EndLSN: sys.fn_cdc_get_max_Lsn(), this will give the END LSN for which load needs to run,

Keep a note the LSN values are binary values and for passing in the package we need to conver them to string and then viceversa for this we need to use functions like sys.fn_varbintoHex() and sys.fn_cdc_hexstrtoBin

An example of how to find out the new points once we know the last Succefull LSN load would look like something in the image below:

The step above sets the varaible dataready which can be used to figure out if the data is ready to be loaded or not, as above compare the LSN’s obtained as stored on the cdec.lsn_time_mapping table if the lsn <= startLSN and >= EndLSN exists that means data is ready to be loaded and we can go forward with the load else the no data for loading exisits

If the data is ready the next step to insert a record onto the load_log table with the status F, which would be updated post all the data load completion with S, if at all the things run fine

Once inserte the next conatiner would be used to call all the child packages whcih would have the code for all the tables required during load

Hope this articale makes sense the next article i would deal with the drawbacks of the appraoch i ahve taken and precautions which needs to be taken for this approach to work correctly

I would like to know the feedback and  questions for the above post, do write in what is good what is bad which has pointers for improvements in future posts.

Cube Processing Using SSIS

This has been a long pending post from me, always thought of penning down my thought but somehow just misses out. Today would deal with it finally yaa MAYBE NOT, lets see how it goes. Pardon the Spellings though .

Now this topic is something which i don’t find as a readymade code available on googling maybe i will try to solve that issue and give the code as well today so that if at all someone like me wants the code, this post can be of quick assistance. So the topic for today is Cube Processing, you would think big deal not so complex task just do a full process of cubes each time, but it becomes a tad complex when we have huge cubes with partitions each partition needs to be create dynamically on the basis of some relational table column EX:

a) Consider a huge Retail Cube, which has 5 million transactions to be processed each day, think of how many monthly yearly, so partition the cube on the basis of Months or even dates.

b) Consider a huge cube for several countries, this would require us to create partition on the basis of COUNTRY identifier

So let’s begin, the SSIS task would be a simple SCRIPT Task, you can have your error handling in place am not going to define the same here on this post, will concentrate on the just the SCRIPT Task

I will start from what goes inside main, In this example am taking the example of partitioning via monthly periods, it can easily be changed to daily periods as well. I will start with sample where partition Name and Cube Name are passed, this can easily be modified for not passing the cube name and processing all cubes in catalouge

Things to be passed as variables, NOTE: These can be easily assigned inside a script task, by using a SQL configuration.

Server Name, Cube Catalogue Name, Period, CubeName, ConnectionString

————————————————————————————————————————————–

‘ Microsoft SQL Server Integration Services Script Task
‘ Write scripts using Microsoft Visual Basic 2008.
‘ The ScriptMain is the entry point class of the script.

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports Microsoft.AnalysisServices

<System.AddIn.AddIn(“ScriptMain”, Version:=”1.0?, Publisher:=””, Description:=””)> _
<System.CLSCompliantAttribute(False)> _
Partial Public Class ScriptMain
Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase

Enum ScriptResults
Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success
Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
End Enum
‘ The execution engine calls this method when the task executes.
‘ To access the object model, use the Dts property. Connections, variables, events,
‘ and logging features are available as members of the Dts property as shown in the following examples.

‘ To reference a variable, call Dts.Variables(“MyCaseSensitiveVariableName”).Value
‘ To post a log entry, call Dts.Log(“This is my log text”, 999, Nothing)
‘ To fire an event, call Dts.Events.FireInformation(99, “test”, “hit the help message”, “”, 0, True)

‘ To use the connections collection use something like the following:
‘ ConnectionManager cm = Dts.Connections.Add(“OLEDB”)
‘ cm.ConnectionString = “Data Source=localhost;Initial Catalog=AdventureWorks;Provider=SQLNCLI10;Integrated Security=SSPI;Auto Translate=False;”

‘ Before returning from this method, set the value of Dts.TaskResult to indicate success or failure.

‘ To open Help, press F1.

Dim oServer As New Microsoft.AnalysisServices.Server
Dim oDB As Database
Public Sub Main()

‘ Add your code here

Dim oConnection As ConnectionManager
Dim sCube As String, sMesGroupNm As String, sPartName As String

Dts.Connections(“Connections”).ConnectionString = Dts.Variables(“ConStr”).Value.ToString

sCube = Dts.Variables(“CubeName”).Value.ToString
sMesGroupNm = Dts.Variables(“MesGroupNm”).Value.ToString
sPartName = Dts.Variables(“PartName”).Value.ToString

oConnection = Dts.Connections(“Connections”)
‘Assing Server Name
Dim sServer As String = CStr(oConnection.Properties(“ServerName”).GetValue(oConnection))
‘Assing DB Name
Dim sDatabase As String = CStr(oConnection.Properties(“InitialCatalog”).GetValue(oConnection))
Try
‘ connect to the server
oServer.Connect(sServer)
‘Assing Catalouge Name
oDB = oServer.Databases.FindByName(sDatabase)

‘Strat Calling Subfucntions
CheckndProcessPartitions(sCube, sMesGroupNm, sPartName) ‘This functions calls the Creation for Valid Paritions

Catch ex As Exception

End Try

Dts.TaskResult = ScriptResults.Success
End Sub
Public Sub CheckndProcessPartitions(ByVal sCube As String, ByVal MesGrpName As String, ByVal partName As String)

Dim subCube As Cube = oDB.Cubes.FindByName(sCube)
Dim sPeriod As String
sPeriod = Dts.Variables(“Period”).Value.ToString
Try
If (Not PartitionExists(sCube, MesGrpName, partName)) Then
CreateNewPartition(sCube, sPeriod)
End If

‘Call the processing Code
ProcessPartitions(sCube, MesGrpName, partName)
Catch ex As Exception
Throw ex
End Try

End Sub

Public Function PartitionExists(ByVal sCube As String, ByVal sMeasureGroup As String, ByVal sPartition As String) As Boolean

Dim oCube As Cube = oDB.Cubes.FindByName(sCube)
If oCube Is Nothing Then
Return False
End If

Dim oMeasureGroup As MeasureGroup = oCube.MeasureGroups.FindByName(sMeasureGroup)
If oMeasureGroup Is Nothing Then
Return False
End If

Dim oPartition As Partition = oMeasureGroup.Partitions.FindByName(sPartition)
Return Not oPartition Is Nothing

End Function

Public Function CreateNewPartition(ByVal sCube As String, ByVal PartitionDate As String) As Boolean

Dim oQueryBinding As QueryBinding
Dim oCube As Cube = oDB.Cubes.FindByName(sCube)
If oCube Is Nothing Then
Return False
End If

Dim aPartitionNames As New ArrayList()

Try
For Each oMeasureGroup As MeasureGroup In oCube.MeasureGroups
If oMeasureGroup Is Nothing Then
Return False
End If
Dim DefaultPartitionName As String
DefaultPartitionName = oMeasureGroup.Name
Dim sNewPartitionName As String
sNewPartitionName = String.Empty

sNewPartitionName = DefaultPartitionName + “_” + PartitionDate
Dim oPartition As Partition = oMeasureGroup.Partitions(oMeasureGroup.Partitions.Count – 1)
oPartition = oMeasureGroup.Partitions.FindByName(sNewPartitionName)
Dim oPartitionNew As Partition

If oPartition Is Nothing Then
oPartition = oMeasureGroup.Partitions(oMeasureGroup.Partitions.Count – 1)
oPartitionNew = oPartition.Clone
oPartitionNew.ID = sNewPartitionName
oPartitionNew.Name = sNewPartitionName
oQueryBinding = oPartitionNew.Source
End If

If oQueryBinding Is Nothing Then
Return True ‘No query, no partition
End If

oQueryBinding.QueryDefinition = GetNewQuery(oQueryBinding.QueryDefinition, PartitionDate)
oMeasureGroup.Partitions.Add(oPartitionNew)
oPartitionNew.Update()
?
Next
Catch ex As Exception
Return 0
Throw ex
End Try
End Function

Public Function GetNewQuery(ByVal SourceQuery As String, ByVal DatePartition As String) As String
Dim mainQ As String = String.Empty
Dim newQ As String = String.Empty
Dim newCond As String = String.Empty
Dim lastIndex As Integer = SourceQuery.ToUpper().LastIndexOf(“WHERE”)

mainQ = SourceQuery.Substring(0, lastIndex)
newCond = ” WHERE ColName = ” + DatePartition
newQ = mainQ + newCond
Return newQ

End Function
Public Sub ProcessPartitions(ByVal sCube As String, ByVal MesGrpName As String, ByVal spartName As String)
Try
ProcessPartition(sCube, MesGrpName, spartName)
Catch ex As Exception
Throw ex
End Try

End Sub

Public Sub ProcessPartition(ByVal sCube As String, ByVal sMeasureGroup As String, ByVal sPartition As String)

Dim oCube As Cube = oDB.Cubes.FindByName(sCube)

If oCube Is Nothing Then
Return
End If

Dim oMeasureGroup As MeasureGroup = oCube.MeasureGroups.FindByName(sMeasureGroup)
If oMeasureGroup Is Nothing Then
Return
End If
Dim oPartition As Partition = oMeasureGroup.Partitions.FindByName(sPartition)
?
For Each oDim In oDB.Dimensions
oDim.Process(ProcessType.ProcessUpdate) ‘ Process all dimensions
Next
oPartition.Process(ProcessType.ProcessFull)

End Sub

End Class

—————————————————————————————————–

Hello world!

Welcome to WordPress.com! This is your very first post. Click the Edit link to modify or delete it, or start a new post. If you like, use this post to tell readers why you started this blog and what you plan to do with it.

Happy blogging!