SQL Server Integration Services (SSIS) is an ETL (Extraction, Transformation, and Loading) tool designed to move/copy data from various sources to specific destinations. SSIS facilitates this Data Integration and Migration by making the data available to be transformed while in transit so that it conforms to destination-specific data formats. SSIS Packages need to be developed in a conscious way to avoid any bottlenecks during data migration or transformation. Here are some of the performance guidelines to be followed during the development of SSIS packages.
- Blocking Transformations: Avoid using blocking transformations such as Sort, Aggregate, Fuzzy Lookup, Fuzzy Grouping, and Row Sampling. These transformations copy the data into memory to perform the operation which is time-consuming. Wherever possible, avoiding these transformations will improve the efficiency of SSIS package.
- Data Type Conversions: Data type conversions and derived columns create an additional column in the physical buffers. This is memory-intensive and could impact the performance. Try to add the data type conversions in the source query instead of doing as a transformation within the package.
- Optimize Lookup Transformation: Try to optimize the lookup transformation using right Cache Mode (Full, Partial, No Cache). Use cache Connection Manager, when multiple data flows need the same lookup.
- Remove unused columns from data set or the SQL query: Sometimes, columns are pulled from a source, used for a specific purpose and then not needed for the remaining package execution. It is important to remove these columns from the pipeline at the point where they are not necessary and free up this previously allocated memory.
- Avoid using “Table or View” when pulling source data in the data access mode: Don’t use “Table or View” option in the Data Access Mode of the source as it is equivalent to SQL “Select *” which brings every row and every column in the selected table or view whether they are necessary or not. This adds additional burden on memory. The best practice is to use SQL Command to filter both rows and columns in the source query.
- Use Stored Procedures instead of in-line SQL: If T-SQL code is used within SSIS package when changes need to be made to the T-SQL code, the entire package needs to be redeployed. In such cases, using a Stored Procedure would avoid redeploying the SSIS package versus compiling the procedure in the backend.
A benefit of using Stored Procedure is reusability, as it can be reused between other SSIS packages. Stored Procedures cache an execution plan within SQL Server, whereas a new execution plan must be generated each time an inline SQL code is executed causing additional overhead and execution time.
- Use sorting within SQL query rather using SSIS task: Sorting used in SQL query is not the same sorting that is used within SSIS. As SSIS sorts after copying the data into memory and in addition to that, data flows through network between database and SSIS affecting performance. Whenever possible, sort the data during the initial data pull within SQL query.
- Use the “(NOLOCK)” hint: It is a good practice to use the “(NOLOCK)” database hint when pulling source data using SQL Query. This would prevent SSIS package from blocking other queries or processes attempting to use the same data.
- Avoid using OLE DB Command: Using OLE DB Command is very slow as it performs row-by-row operation. Instead, use a stage table to load the data and use set-based operations using SQL. This avoids row-by-row operation resulting in better performance.
- SQL Server Source Query Optimization: Select only columns you need and optimize the query by reviewing at the execution plan and with proper index strategy.
- Disable Event Handler: Event Handlers decrease package performance. So unnecessary Event Handlers should be removed/disabled.
- Maximum Concurrent Executables: MaxConcurrentExecutables control flow property is used to configure the maximum number of executable SSIS threads. By default, the value is set to -1 which is the number of logical machine processors plus 2. Increase this setting if SSIS runs on a dedicated server and if there are lots of operations that run in parallel. Decrease this setting if SSIS runs with other applications to avoid resource conflicts.
- DefaultMaxBufferRows / DefaultMaxBufferSize: The default values of DefaultMaxBufferSize (10MB) and DefaultMaxBufferRows (10k) is to fetch as much number of rows into buffer as possible. Manipulating these default values can achieve performance gain by allowing the package to process more data in the buffers.
- Use the Fast Parsing Mode option to read data from Flat File sources. Fast Parse supports only Integer, Date and Time data types and sensitive to locale-specific data.
- When SQL Server is the target database, use SQL Server destination instead of OLE DB destination. This will improve the package performance.
- Drop and recreate Indexes before any transformation that might have to process huge data.
- When using an OLE DB destination, use ‘fast load’ data access mode option which uses a BULK INSERT for uploading data into the destination table, instead of simple INSERT statements.
If you have any questions on how to incorporate these performance tips in SSIS development, please engage with us via comments on this blog post, or reach out to us.