MSBI LIVE: SSIS (SQL Server Integration Services) Best Practices

1. Introduction
SSIS – SQL Server Integration Services is an Extract – Transform – Load (ETL) tool of enterprise class, which is available as part of SQL server 2005/2008/2008R2/2012.SSIS enables database administrators and application developers to design, implement, and manage complex, high-performance ETL applications. Using SSIS, we can select data from one or more sources and standardize, join, merge, cleanse, augment, derive, calculate, and perform just about any other function and operation required for our data integration applications.One common use for SSIS is to move data from one data source to another. The reasons for moving data are too numerous to count. Some common business reasons for using SSIS include migrating business data from one application to another, extracting data for distribution to external entities, integrating data from external entities, creating sample test data sources for development environments, and extracting and loading data into business intelligence (BI) application systems.Listed below are some SQL Server Integration Services (SSIS) best practices

2. Use of Package configuration file

Never hard code database names or changeable values like mail server or path in the SSIS packages, always use an xml Package Configuration File.The connection managers also need to be made configurable using expressions. Hence if there is any change in query or server name or output file name, we can quickly make the changes in config file and execute the package without opening and editing the package. For complex packages we can even split the configuration files as this is more modular and easier. Even a SQL Server database and table are created to contain the actual package configurations.

3. Calling a child Package from a Parent package:

If we are using a part of code multiple times in the same package or in multiple packages, we can separate the reused code and make it a child package which is being called from the Parent package. When a child package is executed from a master package, the parameters that are passed from the master Package needs to be configured in the child package. For this, we can use the “Parent Package Configuration” option in thechild package. But, for using the “Parent Package Configuration”, we need to specify the name of the “Parent Package Variable” that is passed to the child package. If we want to call the same child package multiple times(each time with a different parameter value), declare the parent package variables (with the same name as givenin the child package) with a scope limited to “Execute Package Tasks”.

4. Protection.

To avoid most of the package deployment error from one system to another system, set the package protectionlevel to “DontSaveSenstive”.

5. Sequence Containers

Make use of sequence containers to group logical related tasks into a single group for better visibility and understanding.

6. Delay Validation :

Consider a scenario in which first component of the package creates an object i.e. a temporary table, which is being referenced by the second component of the package. During package validation, the first component has not yet executed, so no object has been created causing a package validation failure when validating the second component. SSIS will throw a validation exception and will not start the package execution. To overcome this situation every component has a DelayValidation (default=FALSE) property. If we set it toTRUE, early validation will be skipped and the component will be validated only at the component level (latevalidation) which is during package execution.

7. Logging:

Logs help to gather run-time information about a package and troubleshoot a package every time it is run. For this the Package Level Logging Mode needs to be set as Enabled. We can enable or Disable the Logging for each task separately or can use UseParentSetting option.

We can use any type of log providers given in the provider type list. Also make sure to tick the type of log provider added, otherwise nothing will be written into the log file.

Make log file names dynamic so that you get a new log file for each execution.

8. Package Name Limitations:

When an SSIS package with a package name exceeding 100 chars is deployed into SQL Server, it trims thepackage name to 100 chars, which may cause an execution failure. So, limit the package names to a maximum of100 characters.

9. Annotations:

Please provide annotations in the package wherever necessary for better understanding of the logic.

10. Naming Conventions in SSIS:

When creating connections, do not use the default name. Say for Package Log use "FF_DEST_Conn_SSISPkgLog", for SQL connection use "OLEDB_Connection", and for SMTP use "SMTP_Server_Connection”.

The package variable names used should be meaningful names which describe their contents and use.

The name of the package as well as the tasks used should reflect its function.

11. OLEDB Destination Settings:
There are a couple of settings with OLEDB destination which can impact the performance of data transfer which are listed below.
Data Access Mode – This setting provides the 'fast load' option which internally uses a BULK INSERT statement for uploading data into the destination table instead of a simple INSERT statement (for each single row) as in the case for other options. So unless you have a reason for changing it, don't change this default value of fast load. If you select the 'fast load' option, there are also a couple of other settings which you can use as discussed below.
Keep Identity – By default this setting is unchecked which means the destination table (if it has an identity column) will create identity values on its own. If you check this setting, the dataflow engine will ensure that the source identity values are preserved and same value is inserted into the destination table.

Keep Nulls – Again by default this setting is unchecked which means default value will be inserted (if the default constraint is defined on the target column) during insert into the destination table if NULL value is coming from the source for that particular column. If you check this option then default constraint on the destination table's column will be ignored and preserved NULL of the source column will be inserted into the destination.
Table Lock – By default this setting is checked and the recommendation is to let it be checked unless the same table is being used by some other process at same time. It specifies a table lock will be acquired on the destination table instead of acquiring multiple row level locks, which could turn into lock escalation problems.
Check Constraints – Again by default this setting is checked and recommendation is to un-check it if you are sure that the incoming data is not going to violate constraints of the destination table. This setting specifies that the dataflow pipeline engine will validate the incoming data against the constraints of target table. If you uncheck this option it will improve the performance of the data load.

12. Environment issue for Excel connection:

The Excel Source or Excel Connection manager works only with the 32 bit runtime. When a package using the Excel Source is enabled for 64-bit runtime (by default, it is enabled), it will fail on the production server using the 64-bit runtime. Go to the solution property pages\debugging and set Run64BitRuntime to False. If we are using Batch file for execution, the dtexec command will by default point to the 64- bit driver. In order to avoid package failure, specify the 32- bit dtexec say “E:\Program Files (x86)\Microsoft SQL Server\100\DTS\Binn\DTExec".

13. Retaining Connection in SSIS:

The control flow of an SSIS package threads together various control tasks. It is possible to set a transaction that can span into multiple tasks using the same connection. To enable this, the “retainsameconnection” property of the Connection Manager should be set to “True”.But in some cases say we are creating an Excel file and the next task is to send the file via email to a few intended recipients, the package will fail giving error as the process is already in use. In such situations, the “retainsameconnection” property of the Connection Manager should be set to “False”.

14. Type Casting for Excel destination:

For writing data into excel file, the columns with string data types needs to be type casted to [DT_WSTR] data type.

15. OLEDB connection establishment using C# script task:

While connecting to the DB server using Windows authentication in C# script task, we need to provide the connection string such that the Provider should come in fourth position after Integrated Security Parameter.

16. Excel Provider issues

If the production machine is using BIDS (Business Intelligence Development Studio) tool which is a 32bit process (devenv.exe, you can check it from Task Manager), it cannot reach 64bit ACE

OLEDB Provider. To avoid this we should have 32bit ACE OLEDB Provider installed on your machine. But you cannot install 32bit. So the connection string for Excel needs to be “Provider=Microsoft.Jet.OLEDB.4.0;Data Source=;Extended Properties="EXCEL 8.0;HDR=YES";”.

If we are using Access DB connection managers, the Provider should be “Provider=Microsoft.Jet.OLEDB.4.0;”

17. Sorting:

If you know that data in a source is sorted, set IsSorted=TRUE on the source adapter output. This may save unnecessary SORTs later in the pipeline which can be expensive. Setting this value does not perform a sort operation; it only indicates that the data it sorted.

18. Effective use of queries:

Only select columns that you need in the pipeline to reduce buffer size and reduce OnWarning events at execution time. Use variable on the SQLSourceType property of the Execute SQL Task instead of direct SQL statements. This removes ambiguity when different OLE DB providers are being used. It is also easier as the queries are passed from the configuration file and any modification will be easier.

19. Use of Event Handler:

Whenever there is an error or warning or any event happening on a package, it can be configured to send email and notify the specified users. For this event handlers are used.

20. Folder Structure:

Use the same folder structure for all the packages as the code for validation of folders in the SSIS packages can be reused.

Thursday, 24 April 2014

SSIS (SQL Server Integration Services) Best Practices

No comments:

Post a Comment