Digital transformation and cloud migration have become essential initiatives for modern businesses, offering increased agility, scalability, and efficiency. Today, I’m excited to share insights from a recent migration project I led, moving from an on-premises Teradata data warehouse to Azure Synapse. While this won’t serve as a step-by-step guide, I’ll delve into key considerations, challenges, and valuable lessons learned along the way.
Introduction: Setting the Stage
Our mission was clear: migrate 20TB of compressed data from Teradata to Azure Synapse, expanding to 65TB uncompressed. This article primarily focuses on the migration of source data, but it’s crucial to note the parallel need for orchestrating pipelines for incremental loads post-migration.
Planning: Breaking Down the Process
We kicked off our migration by dividing the process into three primary phases:
- Initial Data Load: Utilizing Microsoft Data Box for the bulk transfer of data.
- Delta Load: Managing incremental data gathered post-initial load until the migration cutover.
- Incremental Daily Loads: Implementing ongoing data transfers post-migration to ensure continuity.
Considerations around the number of environments of the source data warehouse were critical. Typically, companies maintain DEV, QA/UAT, and PROD environments. Timing data extraction to avoid disrupting critical processes was paramount.
Data Extraction: Overcoming Hurdles
We opted for PySpark on an on-premises Linux cluster for data extraction. Copying about 4TB of compressed Teradata data, we split it into 2GB chunks to optimize parallel data transfer. While initially planning to use Microsoft Data Box, logistical delays led us to explore alternative methods, such as Hadoop DistCP for the final phase.
Data Transfer: Choosing the Right Tools
For the final data transfer phase, we relied on Hadoop DistCP to transfer data from the Linux file share to Azure Data Lake Storage (ADLS). Selecting the appropriate file format was crucial, and after experimenting with ORC and AVRO, we settled on Parquet due to datatype compatibility issues.
Schema Conversion and Creation: Ensuring Seamless Transition
Replicating Teradata’s table schemas within Azure Synapse was meticulous work. It laid the groundwork for seamless data transformation and loading, a crucial step in maintaining data integrity across platforms.
Data Transformation and Loading: Leveraging Polybase
Once the data resided in ADLS, we utilized Polybase COPY command to seamlessly transfer it to Azure Synapse’s Dedicated SQL Pool. Using ADF we orchestrated pipelines to loop thousands of files, process and load into Synapse Dedicated SQL Pool. This marked the culmination of our migration journey, enabling us to leverage Azure Synapse’s powerful analytics capabilities.
Workload Management: Fine-Tuning for Optimal Performance
Tweaking workload management parameters during data transfer and downstream application setup was crucial for performance optimization. We spent significant time and effort to customize Workload Management for different types of data loads to maximize the utilization of the DWU to ensure smooth operations.
Quality Assurance: Upholding Data Integrity
Rigorous quality assurance checks were imperative. We conducted checks on the number of files, file sizes between Hadoop and ADLS, and compared Teradata/Synapse schemas. Some of the checks made included table name, number of columns, underlying column names, data type, varchar lengths, data type precisions, index, partition, distribution type. Any issues identified necessitated re-pulling data to maintain data integrity.
Navigating Challenges: Lessons Learned
No migration journey is without its obstacles. We encountered data type issues, longer-than-expected Data Box delivery times, and the need to adapt our approach based on the environment. These challenges, however, became valuable lessons, highlighting the importance of:
- Phased migration: Allows for manageable data chunks and effective testing.
- Understanding environments: Identifying data usage patterns to avoid disruption.
- Exploring alternatives: Adapting the approach based on experience.
- Rigorous testing: Ensuring a smooth transition.
Key Takeaways: Your Migration Compass
This migration journey has been a testament to the power of cloud technology and the importance of meticulous planning and adaptability. Here are some key takeaways to serve as your guiding stars on your own cloud migration voyage:
- Divide and conquer.
- Know your environment.
- Explore alternatives.
- Test, test, test.
By sharing my experience, I hope to equip you with the knowledge and confidence to embark on your own successful cloud migration adventure!
Lastly, I would love to hear your thoughts and comments on any cloud migration project you have executed on or are planning for your organization.
Two key members of the team who worked with me on this project:
Abhishek N
Amarnath Jagannathan