SSIS has evolved a lot since the days of its predecessor DTS. Today SSIS is one of the most sought after ETL tool in the market. There are multiple features that are inbuilt in SSIS that can perform varied level of repetitive tasks with little or no manual intervention.
Though the inbuilt architecture of SSIS is quite robust and can flawlessly handle significant workloads, there are a few best practices that need to be followed.
Here are a few best practices that should to be implemented while designing and working with SSIS.
To group related tasks, use Sequence Container
- Limit the scope of the variables and create the variables closer to the shape that would use it.
- Its preferable to usePascal casing for variables configurable in dtsConfig and camel casing for the rest.
Use precedence constraint to handle failure instead of using Event Handlers. The reason being, it’s simpler, cleaner and easier to spot. Also, event handler gives cryptic error that makes no sense at all which is sheer waste of time.
Set the Protection level to DontSaveSensitive and SuppressConfigurationWarning to True.
Setting the protection level will prevent from message like below:
Failed to decrypt protected XML node “DTS:Password” with error 0x8009000B
SuppressConfigurationWarning, will prevent unrelated configuration level warnings that can occur when dtsConfig is moved across different environments.
For parallel data processing, one can use the Balanced Data Distributor for SSIS which can be downloaded separately from Microsoft.
Always use centralized repository for maintaining connectionStrings. If overtime the number of SSIS projects grow, a centralized repository of connectionString will make sure that settings defined in repository/configuration store are reused in other SSIS packages/projects.
Don’t ever use OLEDB command. Unless you have really small data sets and you’re 100% sure they won’t grow. For every row in the buffer the OLE DB command executes the SQL statement. It would mean that if 1 million rows are passed through a transformation, the SSIS package will send 1 million SQL statements to the server. This will blow up the transaction log file on the database server based upon the recovery model the database is in.
Let’s think of a scenario where the first component of your package creates an object for this case consider a temporary table and which is being referenced by the second component of the package. Now it can happen that due to external factors like I/O or Network Issues the temporary table creation is getting delayed which causes a package validation failure when validating the second component. SSIS would then throw a validation exception and package execution would fail. To mitigate such scenario, every component has a DelayValidation property whose value by default is set to FALSE.
Changing the property value to TRUE early validation will be skipped and the validation would occur during the package execution.
It is always recommended to push as many rows as possible into the Buffer.
SSIS consumes a lot of memory. It is very crucial to have a good package design so that the package execution does not lead to memory pressure. Memory buffer can be considered as a temporary hold up area for the data before being sent to actual processing.
For example, consider a flat file with 2 million rows. A simple package is designed to push the two million rows into a database table. All of those 2 million rows will be placed into memory buffer during the package execution. SSIS will not push all of the 2 million rows in the memory buffer at the same time. Internally, SSIS execution engine will use a buffer mechanism to break apart into smaller chunks of the incoming data. It will load maximum of 5 active buffers at a time for every data flow execution. Each buffer can be smaller of either of below two:
- 10,000 rows
- 10 MB size
Memory allocation for Asynchronous tasks in the data components like union all or sort there will be multiple execution trees and execution trees will have their own set of memory buffers. Having multiple execution trees is one of the reasons why data transformations like union all and sort as called blocking operators.
Default buffer settings
- DefaultBufferSize: 10MB.
- DefaultBufferMaxRows: Number of rows by default the data flow will try to put in a buffer. This is set to 10,000 rows.
As explained earlier the SSIS engine will allocate maximum of 5 active buffers for a serial/synchronous operation for each individual execution tree of the component and for asynchronous execution tree, every execution tree will have its own set of memory buffers.
The size of these memory buffers is determined by the values set for DefaultBufferSize and DefaultBufferMaxRows property. The buffer sizes are set through a round robin allocation mechanism. The buffer size of a transformation can never be of uneven size. For example, if Buffer1 reaches the 10MB or 10K rows limit the Buffer2 is used till it reaches the 10MB/10K row limit and then Buffer3 and then the Buffer4 and so on.
A new property was introduced in SSIS 2016: AutoAdjustBufferSize. With this property, if set to True, the customized values for DefaultBufferMaxRows would come into effect.
Please note that this property by default is always set to False
All transformations are categorized either as synchronous or asynchronous. A synchronous transformation is by default non-blocking transformation while an asynchronous transformation is a blocking transformation.
Example of synchronous/non-blocking transformations are:
- Derived Column
- Conditional Split
- Data Type Conversion
Example of asynchronous/blocking transformations are:
- Union All
As explained earlier, since it requires the blocking transformations to create multiple buffers for each execution tree, this operation leads to reading of all these multiple buffers to output even one buffer. This can lead to tremendous memory pressure if the transformations are operating on a huge data set.
It is always recommended to avoid the use of blocking transformations and if such operation is indeed necessary, then setting up values for proper DefaultBufferMaxRows/AutoAdjustBufferSize might help improve performance and scalability.
Though there no extensive settings or properties to enhance overall performance of an SSIS application but by implementing the above steps you would be able to leverage the best of what SSIS has to offer.