SSIS Performance Enhancement- Remove Sort Column Additional Steps

Writing again after a long time, today while working on a SSIS package i came across a situation where i have a Type2 Dimension and i was getting duplicates rows to the table the only difference was BUsiness Key column was behaving as a case sensitive column i.e. if any row on source was found to change in case the code was treating this business Key as a new row and inserting whereas i needed it to ignore case.

Now here is the reason why it was happening and the resolution,

Cause: I was using the best practices for my Merge Join on the package wherein i have removed the SORT column and to achieve the same i was ordering it on my SOurce Trasnform and marking the ISSorted property and sorting the dataset from source. For details on what i mean read this

https://technet.microsoft.com/en-us/library/ms141031%28v=sql.105%29.aspx

My Data Flow

DataFlow

My SQL server is case insensitive, and SSIS takes by Default windows Sorting, so i was getting the case sensitive row as a new row as the key column changed from ABC to abc,

So to get this resolved what we have to do is set the, comparison Flag property to IGnore case and this would lead to ignore of case for the Key column, as per my DFT above this would have to done on both the source transform, such that setting is identical on both of them

Comparison_Flag

The link helped me resolve it:

http://microsoft-ssis.blogspot.com/2011/03/sorting-in-sql-vs-sorting-in-ssis.html

Powerpviot- Dynamic Filtering using Slicers post Convert to Formulaes

NOTE: THis post only applies to PwerPivot 2013

I am not going to explain why do we need to use the Convert to Formula functionality in Power Pivot and what the advt it provides that a topic for a separate discussion in itself, in this post i want to explain how to deal with the limitations posed once we have converted to formula.

A classical example is consider a Active Pivot report where you have data based on slicer things like Product category(Slicer) and Product and Unit sale Price(data) on the report now if the end users wants to play with reports and wants to see only those products which belongs to specific category there is nothing speacial which needs to be done it comes out of box Screen print 1 and Screen Print2 shows the same, but what happens when we convert the report using the OLAP tools we seem to loose the functionality as now the data is referred as key wise so all the product names have to hard-coded if we want to see there data, in this case how do we replicate the same filtered view which we had working in the Pivot report.

1.

PivotUnfiltered Items

PivotUnfiltered Items

2. 

Filtered Item Pivot table

Filtered Item Pivot table

3.

pic4

So in essence Problem now becomes as

Problem Statement:  Once a report is converted to formulas the Slicers would suddenly stop working,  how to replicate the dynamic behavior of report with changes in slicers changing the data.

This is what report looks likes once you convert to Formula and even apply the Slicers, clearly seeing that we can see all product despite of applying filter via the slicers, so essentially we see all the Products despite of slicers for product subcategory, in the screen print i have selected BikeRacks and should see only Bike Racks products but that’s not happening.

 

4.

Converted_1

Now to make the slicers work again and see only relevant data, one has 2 ways:

a) Write an adaptive complex DAX formula to hide the products , which i am pretty sure could be written though i didnt try it as you know we all have to save time 🙂 and this would have required some research. I would post it once i have got some time to spare.

b) Follow the method am going to depict, this method is essentially a cheat and reads from the hidden columns 🙂

1. Add a new active Pivot which too is connected to the same slicer, with the columns you need( We are going to hide this pivot)

5.

Converted_1

2. Replace the Convert to formula [product name with the formula as below Add a new column, with formula like :

=IFNA(CUBEMEMBER(“ThisdataModel”,”[Product].[Name].”&”[” &L7  &”]”),””)

Copy and extend the same formulas for all rows, to explain the formula IFNA checks for errors and would replace them with blanks, &M7& is the cheat where we are reading the data from the active pivot which are always filtered see screenprint 5

 

where M would refer to the active pivot column, ensure that the new pivot which was inserted had these properties unchecked, such that the slicer is able to check data properly

6.

Converted_1

 

 

Once you follow all the steps you are ready and rocking with least time, let me know your comments on the same

Consume Webservice via SSIS Script Component

I have always believed SSIS Script component is a brilliant task and it provides for accessing data from any type of source. I have used it as a SOURCE Component very recently to consume data from Webservice, and it required a bit of a tweak to overcome the limitations on this Script component. viz:

Script Component doesn’t support reading data from a Config file, you can have a look at the problem i had published here

http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/75111c36-3d92-4440-a7b3-5cfe8a648403

The scenario i will explain below, would consume a WebService which is a WCF service targeted at Framework 3.5, and use HTTPBINDING.

So how to we proceed step wise as:

1. Create a DFT and drop a DFT task.

2. Inside the DFT place the Script Component, select the SOURCE

Image

3) Now open the Script component to edit the Script, look for Add Service Reference on the Reference section if you dont see it change the project to target Framework 3.5, once done you would be able to see the AddSerivceRefrence

Image

4. Adding the Service Reference would lead to adding a file a App.config on the project, which would have the bindings and Contract settings for the web service, Genrally you would use this and from code call the settings stored on this file to access the Webservices.

Image

THIS IS THE PROBLEM WE CANT CALL THE WEBSERVICE NORMAL WAY AS SSIS COMPONENT TASK DOESNT RECOGNISE THE APP.CONFIG file,

Resolution is to define the bindings at runtime so use a code like below, wherein you define the BINDINGS via code rather then depending on the App.config file.

You would need to use the code below for BASICHTTPBINDING

BasicHttpBinding binding = new BasicHttpBinding(BasicHttpSecurityMode.TransportWithMessageCredential);
binding.Security.Message.ClientCredentialType = BasicHttpMessageCredentialType.UserName;
binding.MaxBufferSize = 2147483647;
binding.MaxBufferPoolSize = 2147483647;
binding.MaxReceivedMessageSize = 2147483647;

// Replace <machinename> with the name of the development server.
EndpointAddress remoteAddress = new EndpointAddress(“<service Link>”);

ReportPortClient client = new ReportPortClient(binding, remoteAddress);
client.ClientCredentials.UserName.UserName = “USER NAME”;
client.ClientCredentials.UserName.Password = “Password”;

BindingElementCollection elements = client.Endpoint.Binding.CreateBindingElements();
elements.Find<SecurityBindingElement>().IncludeTimestamp = false;
client.Endpoint.Binding = new CustomBinding(elements);

// THis portion is specific to my call of web service

Execute_ReportType request = new Execute_ReportType();
response = client.Execute_Report(request);

For NETTCPBINDING and defining the bindings from code you could have a look at:

http://www.bibits.co/post/2012/08/15/Consuming-Web-Services-in-SSIS-Script-Component.aspx

Reach out to me at my Email if you some clarifications and require assistance in consuming a Webservice via the Script Component task,

 

PS: Ensure for the SSIS Script component Source to work properly the SSIS package executing account must have read/ write permissions on the TEMP(generally- C:\Windows\Temp) directory, on the box the package would execute

 

Late Arriving Dimensions- Design Approach

This post will describe how to handle Late Arriving Dimensions, for a Data Mart, and how we did it in our project

What we did it was taken via the design approach defined on the GURU webiste for all KIMBALL here is the link i will try to theoretical y define the approach stated and what we did to suit our needs

http://www.kimballgroup.com/html/designtipsPDF/DesignTips2006/KU78LateArrivingDimensionRows.pdf

What is a Late Arriving Dimensions(LAD)

LAD are the dimensions, where in FACT Table records are populated priot to population of the dimension records , this statement in itself is very confusing especially for the people in the OLTP world, but hey in OLAP world this is something we have to deal with will site and example from the real life project and try to convince you

In the INSURANCE industry Site a policy lifecycle, a policy is been issued to a person and a and he has the POLICYNUMBER with him, we have a 15 day period wherein the Policy document will be delivered to the client and his approval sought, the entry for this would be made/ confirmed only when the policy document is delivered/ accepted by the client so in these 15 days if any claim is made on the POLICY NUMBER then it creates a ORPHAN record in terms of Dimension not having the entry for that record.

So the scenario is as below with the table structures

TableDim1: (KeyCol, BusinessKey, Description)

Keycol: Is the Surrogate Key Column on the dimension column, this is a TYPE1 SCD, popualted on a daily basis.

BusinessKey: This is the KeyCol as in Upstream OLTP System.( This is the Key value which has been generated on the OLTP)

TableFact1: (KeyColFact, KeyColDim1, KeyColDim2,FactValue)

KeyColDim1: Is the key column as coming from the Dimension(Primary Key–> Foreign key relation)

KeyColDim2: This is the key column same as above for a different dimension

FactVaue: This is the fact the fact value of the measure.

Now to define the design approach what we did:

The TableDim1, would have a  default row something values like:

0, 0,UNKNOWN.

On the data Load ETL, now when the FACT value comes in, and a lookup happens on the base of BUSINESS KEY with the Dimension table, the value is returned as NULL as the tableDIM1 hasn’t obtained the BusinessKey1 Record in it, we do a ISNULL on the fact row for the column replace it with a default 0 so the fact record looks like:

(0,4,2111.07), Note the first column is kEYDIM1 column which is looked up and we have got the default value there.

So far so good we have handled the FACT row well, and our data load/ cube processing wouldn’t fail as we have the constraints met properly (PrimayKey–> Foreign Key), bit the question is as and when the Dimension value arrives for that specific record how do the OLAP system know that it has arrived? to solve this again we took the help of GURU, and this is what we did,

a) Introduce a new table which capture all the misses for the BUSINESSKEY, and have a row each for them. This table would have the BUSINESSKEY captured, and FACTROW Key column also captured into it(KeyColFact).

b) Once the ETL load has finished traverse through all the fact rows, check for the BUSINESS Keys again doing a Lookup against the Dimension table( as the Dimension table record would have arrived now), obtain the new KeyDim1, and update the fact table on the basis of KeyColFact.

This was the easiest and safest way around, and can be easily enhanced to have a single table for all the lateArrivingdimensions by adding a new column to refer to the column names.

MOLAP vs ROLAP vs HOLAP

Today topic once again something i was confused about, always had an high level idea but never really read and understood the difference correctly until i was enlightened by a friend about a MS white paper on the  performance optimization on the CUBES so here goes the link and for the people who are lazy not wanting to read through whole lots of things in there just the extract of Storage modes difference goes as below:

SQL Server 2005

http://download.microsoft.com/download/8/5/e/85eea4fa-b3bb-4426-97d0-7f7151b2011c/SSAS2005PerfGuide.doc

http://www.sql-server-performance.com/2009/ssas-storage-modes/

 MOLAP:

MOLAP stores the aggregations and the data from the fact tables and dimensions on the Analysis services instance, the data and aggregations are stored in Compressed state Multidimensional Data structure, which is suited for fast retrieval and occupy very less space. precisely  this is the reason despite of all fact and dimension data getting copied still the MOLAP occupies much less space then the corresponding relational data. On a average as a guidance Analysis Services compresses the source fact data and its aggregations to approximately 30 percent of the size of the same data stored in a relational database, actual ration can vary based on number of duplicate keys and bit encoding algorithms.

In MOLAP viz a viz ROLAP and HOLAP where Analysis Services relies on on the relational engine to perform potentially large table joins against fact data stored in the relational database to resolve some or all queries, MOLAP relies on the Multidimensional data alone.

Pros of MOLAP:

Faster data access, good user experience.

All the calculations are pre-generated when the cube is processed and stored locally on the OLAP server hence even the complex calculations, as a part the query result, will be pefromed quickly.

MOLAP does not need to have a permanent connection to the underlying relational database (only at the time of processing) as it stores the detail and aggregate data in the OLAP server so the data can be viewed even when there is connection to the relational database.

Cons of MOLAP

With MOLAP  mode, we need frequent processing to pull refreshed data after last processing resulting in drain on system resources.

If the data volume is high, the cube processing can take longer, though you can use incremental processing to overcome this.

ROLAP:

This mode requires the least storage, none of the data and aggregations are stored on the Analysis Server, everything resides on the Relational DB side, this can be mostly useful in scenarios where the constraints are not applied on the Relational DB side, then we need to ensure that the UNKNOWN ember is enabled both on dimension and facts.

 Analysis Services rely on the relational engine to resolve all queries against the relational tables, storing both fact data and aggregations. The sheer number of queries with large table joins in large or complex cubes frequently overwhelms the relational engine.

Pros

The data available is such cube is near Real time

Since ROLAP keeps the data on the Relational DB,  it has less storage requirements. This is very advantageous for large datasets which are queried infrequently such as historical data.

Cons

Compared to MOLAP or HOLAP the response is generally slower, as all things lies on the Relational DB

HOLAP:

HOLAP is the combination of good things of HOLAP and ROLAP, it stores the aggregations on the Analysis Server and data(fact and dimensional) remains on the Relational DB

As a result, whenever Analysis Services needs to resolve a query against fact data stored in a HOLAP partition, it has to query the relational database directly rather than querying a multidimensional structure as in case of MOLAP

 

As far as processing goes there is no significant difference in processing performance between MOLAP partitions and HOLAP partition, in all fact data is read from the relational database, and aggregations are calculated. With HOLAP, Analysis Services does not store fact data. This difference has minimal impact on processing performance, but can have a significant impact on query performance, as HOLAP and MOLAP processing speeds are nearly same and MOLAP query performance is superior, MOLAP is the optimum storage choice.

Pros:

Stores the detail and aggregate data in the OLAP server in a compressed multidimensional format; as a result the cube browsing is fastest in this mode.

All the calculations are pre-generated when the cube is processed and stored locally on the OLAP server hence even the complex calculations, as a part the query result, will be pefromed quickly.

MOLAP uses compression to store the data on the OLAP server and so has less storage requirements than relational databases for same amount of data.

CONS:

 With MOLAP  mode, you need frequent processing to pull refreshed data after last processing resulting in drain on system resources.

Just after the processing if there is any changes in the relational database it will not be reflected on the OLAP server unless   re-processing is performed.

 

CDC Feature in Microsoft SQL Server

Writing my first BLOG ever, couldn’t think of anything better then the technical stuffs which i do with my Employer. Today would pin down my thoughts for design approach i took for one of the projects where we wanted to capture DELTA loads using the newbie(not so new now) in SQL SERVER 2008 R2. Let me introduce first to some concepts

What is CDC: CDC aka CHANGE DATA CAPTURE is the feature introduced via sql server 2008, CDC is available only on the Enterprise, Developer, and Evaluation editions of SQL Server 2008

What does CDC do: CDC records insert, update, and delete activity that is applied to a SQL Server table making them available as easily consumable relational format tables. Any changes that are made to the specified tables is captured, and the modified rows are stored in change tables that mirror the column structure of the tracked source tables. Post that CDC/ SQL server provides Table-valued functions for systematic access to the change data by consumers.

Explaining CDC: If we try to correlate how Microsoft brought all these things together, and try to correlate, remember the good old concepts of Triggers and writing of log files, MS brought both of them together to give us something which is called as CDC, you might think how are these related to understand its better i would try to correlate and explain things such that they could be understood better, Warning: This is just for better understanding and SQL server might not exactly do all the steps)

As we know any changes made to the Database are written into a forward only .ldf file by the SQL server lazy writer think of it as a page in a book where someone continues writing and cant go back, so whatever changes or things he wants to write he will continue writing, to recognize one will place page numbers or row numbers in between so that things can be tracked, something called as LSN(Logical Sequence Number) for SQL server.

Now once the CDC is implemented the changes records are captured via the triggers magic table(Inserted, deleted) and rows are captured with the LSN from the ldf files and inserted as data to the new tables, this is at high level how CDC works.

ENOUGH OF THEORY SOME CODE NOW: First COde to enable CDC

USE <DBNAME>
GO

–Create FileGroup to Store the CDC TABLES
ALTER DATABASE ADD FILEGROUP fgCDC

–Associate File .ndf for the FileGroup
ALTER DATABASE
ADD FILE (NAME = ‘CDC.ndf’,
FILENAME = ‘C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\NDF\CDC.ndf’,
SIZE = 1024,
MAXSIZE = UNLIMITED ,
FILEGROWTH = 10244)
TO FILEGROUP fgCDC

DECLARE @Error VARCHAR(MAX)
BEGIN TRY
BEGIN TRAN
select name, database_id,is_cdc_enabled
from sys.databases

– Enable CDC
EXEC sys.sp_cdc_enable_db
–EXEC sys.sp_cdc_disable_db

–Check Enabled or Not on DB
select name, database_id,is_cdc_enabled
from sys.databases

– Enable a Table Specifying Filegroup Option
– ===================================================
EXEC sys.sp_cdc_enable_table
@source_schema = N’dbo’,
@source_name = N’Test’,
@role_name = NULL,
@filegroup_name = N’fgCDC’,
@supports_net_changes = 1

END TRY

BEGIN CATCH
SELECT @Error = ERROR_MESSAGE()+ ERROR_LINE()
RAISERROR(16,@Error,1)
ROLLBACK TRAN
RETURN
END CATCH

COMMIT TRAN

Steps Involved

a) Create a file group to store the CDC enabled tables, this file group would store the secondary files defined to store the CDC tables.

b) Associate a secondary file(.ndf) to the filegroup, note that the .ndf should reside on a separate directory other then the primary .mdf file.

c) Enable CDC at database instance level : This is done by using the stored procedure sys.sp_cdc_enable_db.

d) Enable CDC for Table level instance : Source tables can be identified as tracked tables by using the stored procedure sys.sp_cdc_enable_table. When a table is enabled for change data capture, an associated capture instance is created to support the dissemination of the change data in the source table. The capture instance consists of a change table and up to two query functions

All objects that are associated with a capture instance are created in the change data capture schema of the enabled database. By default, the name is <schema name_table name> of the source table. Its associated change table is named by appending _CT to the capture instance name. The function that is used to query for all changes is named by prepending fn_cdc_get_all_changes_ to the capture instance name. If the capture instance is configured to support net changes, the net_changes query function is also created and named by prepending fn_cdc_get_net_changes_ to the capture instance name.

CDC has to be enabled for each Database and then tables individually on the server. once enabled SQL server creates several tables/ functions which are used to help CDC work, most important of them are, assuming we enabled CDC for a TABLEA

a) Table:

CDC.lsn_time_Mapping: this table is database specific and captures all the start times and end times of trassactions with the respective LSN(Logical Sequence Numbers)

CDC.dbo_TABLEA:THis is the table with the same structure as the table you want to capture, with some additional columns like: StartLSN, EndLSN, $UpdateMask(this stores values in binary format and help identifies which column has changed), $Operation(Specifies the type of operation row corrsoponds to, 1: Delete, 2: Inserts, 3: Updated values before, 4: Updated value after)

SSIS Delta Load – CDC Approach SQL Server 2008 R2

I have spoken about CDC in my last post will speak about how to impliment CDC in conjuction with SSIS. This approrach might not be the best approach and you would need to evaluate properly in case you decide to implimenting the same

MS describes 2 ways to impliment the same,

Specified Interval Sample: http://msftisprodsamples.codeplex.com/wikipage?title=SS2008!Change%20Data%20Capture%20for%20Specified%20Interval%20Package%20Sample&ProjectName=msftisprodsamples

Change Data Capture since Last Request Package Sample

http://msftisprodsamples.codeplex.com/wikipage?title=SS2008%21Change%20Data%20Capture%20since%20Last%20Request%20Package%20Sample

i have used  the change data capture since last request package appraoch and cusomised its use, will try to define and simplify the same here

The package at high level looks like

The basic approach is:

  1. There are 4 containers, high level as defined below:
    1. Initialization Container: This step initializes all the connection Managers on the package and logs the start of the package to SQL server via the OnPrexecute Event Handler
    2. LSN assign Container: This step calculates if there are any data changes to be processed for the present run.
    3. Child package Container: This step calls all the child packages required for loading the data.
    4. End Package Container: This logs the end of package and sends Email, this will replaced with the SEND MAIL task provided in SSIS, and based on requirements the email will be sent to ADMIN(Need to figure out whom should Emails be sent to? )

Points 1 and 4 above are specific to your own implimentation, I would shed some light on the 2 and 3 as part of this post

THe basic approach taken requires us to introduce a new table called as LOAD_LOG onto the Destination DB, this table will capture the Start LSN, END LSN, Start Time, End TIme , Load Status.i.e. it defines values S- Sucess F: Failure N:No records Found

The approach above is a incremental data load approach it requires the initial data load to be completed in advance, and ETL_ Load_Log table to be populated manually once with the appropriate LSN values which can be sourced from cdc.lsn_time_mappingtable on your DB

Obtain the Start LSN from the Load_LOG table,  this would be the same as the latest row on this table which has load run= S and picked from EndLsn Column sample query may look like

The query tries to get the lastLSN which was succefully loaded which can then act as a start point for the next load

The next load LSN (both start and END) can be odtained via using the fucntion

StartLSN: sys.fn_Cdc_Incriment: this function will incriment the last sucssfull loaded LSN to the next LSN

EndLSN: sys.fn_cdc_get_max_Lsn(), this will give the END LSN for which load needs to run,

Keep a note the LSN values are binary values and for passing in the package we need to conver them to string and then viceversa for this we need to use functions like sys.fn_varbintoHex() and sys.fn_cdc_hexstrtoBin

An example of how to find out the new points once we know the last Succefull LSN load would look like something in the image below:

The step above sets the varaible dataready which can be used to figure out if the data is ready to be loaded or not, as above compare the LSN’s obtained as stored on the cdec.lsn_time_mapping table if the lsn <= startLSN and >= EndLSN exists that means data is ready to be loaded and we can go forward with the load else the no data for loading exisits

If the data is ready the next step to insert a record onto the load_log table with the status F, which would be updated post all the data load completion with S, if at all the things run fine

Once inserte the next conatiner would be used to call all the child packages whcih would have the code for all the tables required during load

Hope this articale makes sense the next article i would deal with the drawbacks of the appraoch i ahve taken and precautions which needs to be taken for this approach to work correctly

I would like to know the feedback and  questions for the above post, do write in what is good what is bad which has pointers for improvements in future posts.