Snowflake needs no introduction as it is one of the leading data management platforms on the cloud and its usage is widespread across industry verticals covering multiple use cases (Data Warehousing, Data Lake, Data Science, etc.). It is an open platform which offers connectivity to all major ETL, BI platforms and allows development using a wide variety of programming languages. It offers unique capabilities such as Data Marketplace and Data Cloud.
Users could start realizing immediate benefits such as better workload performance, easier disaster recovery, etc. if the resources are configured and utilized efficiently. Just like any other platform, one needs to keep a close watch on storage, compute, and data transfer credits usage particularly when table sizes are in terabytes and workloads are complex. Here are some best practices that would help run the platform efficiently and get optimum performance.
- Ensure resource monitors are in place for all user-managed warehouses to set credit-usage thresholds and suspend them when threshold is reached.
- Use Snowflake provided views, functions, and web interfaces to monitor storage, compute and could services credit usage.
- Use transient and temporary tables where using them will suffice and data recoverability is not important (example: Dev environments, sandboxes, temporary data manipulations, etc.).
- Storage credits are used for time-travel and fail-safe features.
- Use zero-copy cloning to copy data to multiple environments.
- Load the data in the order based on the attribute that is used for querying.
- Size the warehouses based on the requirement. Simply using larger warehouses for all workloads is not smart though Snowflake charges only for the compute used. If a small warehouse is meeting the performance needs, simply using a large warehouse may not give any better performance but could charge more credits. Resizing the warehouse does not affect the running queries.
- Resizing the warehouse can improve query performance but does not improve concurrency. For improving query concurrency, use a multi-cluster warehouse.
- Design applications to utilize results cache as much as possible to minimize the compute costs. This can be done by allocating warehouses based on the applications/workloads. This will ensure maximum cache utilization. This needs to be balanced with suspending the warehouses when not in use. The cache is lost when the warehouse is suspended.
- Consider using clustering keys only when required on large tables with rows that don’t change frequently, and queries are running slow. Keep in mind that credits are used for reclustering the data.
- Use materialized views only when the savings offset the costs involved in creating and maintaining them. We can create clustering keys on materialized views, but the performance advantage comes with cost.
- Have the external storage (S3, Azure Blob, etc.) closer to the Snowflake region and use external tables depending on the data volumes/size as they are not performant compared to regular tables. Consider the cost and performance trade-off.
- Split the source data into multiple gzipped csv files to bulk load large data sets and use appropriate warehouse size to fully utilize the threads.
- Use Query Profile for analyzing the queries to understand the opportunities for tuning as it provides a graphical query processing plan with execution time and statistics.
- Queries involving more than two table joins with large data volumes need to be analyzed using Query Profile and if required, change the join order for better performance.
- Use appropriate concurrency levels (to set number of queries that can run concurrently) and timeout (to set the number of seconds the query should wait before aborting).
- Use UNION ALL instead of UNION when deduplication is not required (or if we know that the rows being combined are already distinct).
- Use Snowflake’s Search Optimization Service to get expected performance for point lookup queries.
If you have questions about Snowflake platform and its capabilities, need assistance with migration to Snowflake, assessment of your current implementation, implementing data engineering solutions on Snowflake or support your current data pipelines, please engage with us via comments on this blog post or reach out to us here.