Late Arriving Dimensions- Design Approach

This post will describe how to handle Late Arriving Dimensions, for a Data Mart, and how we did it in our project

What we did it was taken via the design approach defined on the GURU webiste for all KIMBALL here is the link i will try to theoretical y define the approach stated and what we did to suit our needs

http://www.kimballgroup.com/html/designtipsPDF/DesignTips2006/KU78LateArrivingDimensionRows.pdf

What is a Late Arriving Dimensions(LAD)

LAD are the dimensions, where in FACT Table records are populated priot to population of the dimension records , this statement in itself is very confusing especially for the people in the OLTP world, but hey in OLAP world this is something we have to deal with will site and example from the real life project and try to convince you

In the INSURANCE industry Site a policy lifecycle, a policy is been issued to a person and a and he has the POLICYNUMBER with him, we have a 15 day period wherein the Policy document will be delivered to the client and his approval sought, the entry for this would be made/ confirmed only when the policy document is delivered/ accepted by the client so in these 15 days if any claim is made on the POLICY NUMBER then it creates a ORPHAN record in terms of Dimension not having the entry for that record.

So the scenario is as below with the table structures

TableDim1: (KeyCol, BusinessKey, Description)

Keycol: Is the Surrogate Key Column on the dimension column, this is a TYPE1 SCD, popualted on a daily basis.

BusinessKey: This is the KeyCol as in Upstream OLTP System.( This is the Key value which has been generated on the OLTP)

TableFact1: (KeyColFact, KeyColDim1, KeyColDim2,FactValue)

KeyColDim1: Is the key column as coming from the Dimension(Primary Key–> Foreign key relation)

KeyColDim2: This is the key column same as above for a different dimension

FactVaue: This is the fact the fact value of the measure.

Now to define the design approach what we did:

The TableDim1, would have a  default row something values like:

0, 0,UNKNOWN.

On the data Load ETL, now when the FACT value comes in, and a lookup happens on the base of BUSINESS KEY with the Dimension table, the value is returned as NULL as the tableDIM1 hasn’t obtained the BusinessKey1 Record in it, we do a ISNULL on the fact row for the column replace it with a default 0 so the fact record looks like:

(0,4,2111.07), Note the first column is kEYDIM1 column which is looked up and we have got the default value there.

So far so good we have handled the FACT row well, and our data load/ cube processing wouldn’t fail as we have the constraints met properly (PrimayKey–> Foreign Key), bit the question is as and when the Dimension value arrives for that specific record how do the OLAP system know that it has arrived? to solve this again we took the help of GURU, and this is what we did,

a) Introduce a new table which capture all the misses for the BUSINESSKEY, and have a row each for them. This table would have the BUSINESSKEY captured, and FACTROW Key column also captured into it(KeyColFact).

b) Once the ETL load has finished traverse through all the fact rows, check for the BUSINESS Keys again doing a Lookup against the Dimension table( as the Dimension table record would have arrived now), obtain the new KeyDim1, and update the fact table on the basis of KeyColFact.

This was the easiest and safest way around, and can be easily enhanced to have a single table for all the lateArrivingdimensions by adding a new column to refer to the column names.

MOLAP vs ROLAP vs HOLAP

Today topic once again something i was confused about, always had an high level idea but never really read and understood the difference correctly until i was enlightened by a friend about a MS white paper on the  performance optimization on the CUBES so here goes the link and for the people who are lazy not wanting to read through whole lots of things in there just the extract of Storage modes difference goes as below:

SQL Server 2005

http://download.microsoft.com/download/8/5/e/85eea4fa-b3bb-4426-97d0-7f7151b2011c/SSAS2005PerfGuide.doc

http://www.sql-server-performance.com/2009/ssas-storage-modes/

 MOLAP:

MOLAP stores the aggregations and the data from the fact tables and dimensions on the Analysis services instance, the data and aggregations are stored in Compressed state Multidimensional Data structure, which is suited for fast retrieval and occupy very less space. precisely  this is the reason despite of all fact and dimension data getting copied still the MOLAP occupies much less space then the corresponding relational data. On a average as a guidance Analysis Services compresses the source fact data and its aggregations to approximately 30 percent of the size of the same data stored in a relational database, actual ration can vary based on number of duplicate keys and bit encoding algorithms.

In MOLAP viz a viz ROLAP and HOLAP where Analysis Services relies on on the relational engine to perform potentially large table joins against fact data stored in the relational database to resolve some or all queries, MOLAP relies on the Multidimensional data alone.

Pros of MOLAP:

Faster data access, good user experience.

All the calculations are pre-generated when the cube is processed and stored locally on the OLAP server hence even the complex calculations, as a part the query result, will be pefromed quickly.

MOLAP does not need to have a permanent connection to the underlying relational database (only at the time of processing) as it stores the detail and aggregate data in the OLAP server so the data can be viewed even when there is connection to the relational database.

Cons of MOLAP

With MOLAP  mode, we need frequent processing to pull refreshed data after last processing resulting in drain on system resources.

If the data volume is high, the cube processing can take longer, though you can use incremental processing to overcome this.

ROLAP:

This mode requires the least storage, none of the data and aggregations are stored on the Analysis Server, everything resides on the Relational DB side, this can be mostly useful in scenarios where the constraints are not applied on the Relational DB side, then we need to ensure that the UNKNOWN ember is enabled both on dimension and facts.

 Analysis Services rely on the relational engine to resolve all queries against the relational tables, storing both fact data and aggregations. The sheer number of queries with large table joins in large or complex cubes frequently overwhelms the relational engine.

Pros

The data available is such cube is near Real time

Since ROLAP keeps the data on the Relational DB,  it has less storage requirements. This is very advantageous for large datasets which are queried infrequently such as historical data.

Cons

Compared to MOLAP or HOLAP the response is generally slower, as all things lies on the Relational DB

HOLAP:

HOLAP is the combination of good things of HOLAP and ROLAP, it stores the aggregations on the Analysis Server and data(fact and dimensional) remains on the Relational DB

As a result, whenever Analysis Services needs to resolve a query against fact data stored in a HOLAP partition, it has to query the relational database directly rather than querying a multidimensional structure as in case of MOLAP

 

As far as processing goes there is no significant difference in processing performance between MOLAP partitions and HOLAP partition, in all fact data is read from the relational database, and aggregations are calculated. With HOLAP, Analysis Services does not store fact data. This difference has minimal impact on processing performance, but can have a significant impact on query performance, as HOLAP and MOLAP processing speeds are nearly same and MOLAP query performance is superior, MOLAP is the optimum storage choice.

Pros:

Stores the detail and aggregate data in the OLAP server in a compressed multidimensional format; as a result the cube browsing is fastest in this mode.

All the calculations are pre-generated when the cube is processed and stored locally on the OLAP server hence even the complex calculations, as a part the query result, will be pefromed quickly.

MOLAP uses compression to store the data on the OLAP server and so has less storage requirements than relational databases for same amount of data.

CONS:

 With MOLAP  mode, you need frequent processing to pull refreshed data after last processing resulting in drain on system resources.

Just after the processing if there is any changes in the relational database it will not be reflected on the OLAP server unless   re-processing is performed.

 

Hello world!

Welcome to WordPress.com! This is your very first post. Click the Edit link to modify or delete it, or start a new post. If you like, use this post to tell readers why you started this blog and what you plan to do with it.

Happy blogging!