Tuesday, April 30, 2013

SSIS: Data Profiling task

Data Profiling task

Data profiling task, as names suggests Profile of Data. Here using this task we can get information on data in any table.

If this task is performed for say Employee table, then profile of data may contain

  •          The number of rows in the table.
  •          The number of distinct values in the State column.
  •          The number of null or missing values in the Zip column.
  •          The distribution of values in the City column.
  •          The strength of the functional dependency of the State column

As quality of data is very important when one tries to do analysis on data for business purpose. Then a valid data will always improve the business decision making.

By using these information, quality issues can be minimized that might occur from using the source data.

We can use the output of this task in various scenarios for making decisions on the data; some of them are as below:

  •        Checking data quality before an incremental load. Use the Data Profiling task to compute the Column Null Ratio Profile of new data intended for the Customer Name column in a Customers table. If the percentage of null values is greater than 20%, send an e-mail message that contains the profile output to the operator and end the package. Otherwise, continue the incremental load.

  •          Automating cleanup when the specified conditions are met. Use the Data Profiling task to compute the Value Inclusion Profile of the State column against a lookup table of states, and of the ZIP Code/Postal Code column against a lookup table of zip codes. If the inclusion strength of the state values is less than 80%, but the inclusion strength of the ZIP Code/Postal Code values is greater than 99%, this indicates two things. First, the state data is bad. Second, the ZIP Code/Postal Code data is good. Launch a Data Flow task that cleans up the state data by performing a lookup of the correct state value from the current Zip Code/Postal Code value.

Below are the steps to be followed for Data Profiling task
Step 1: Drag and drop a Data Profiling Task from Control flow item.

Step 2: Double click on the task and click on destination, and create a new file connection, or use the existing one

Step 3: Click on ‘Quick Profile’

Create a new ADO.Net Connection or, use the existing one to one database, and select the table or view, for which data profiling will be done.
and  select all the Compute

Step 4: Execute the Task
Step 5:
Then go to AllPrograms->SqlServer 2008->IntegrationServices ->Data Profile Viewer

And open Data Profile Viewer, and open the text file, which you have used to store the data profile of table.

It looks like below

Now you can go through various properties of the table’s columns, data etc

Please provide your feedback for the post, if you find this post useful. Also Post your query or scenario, i will be happy to help.

No comments:

Post a Comment