5 Essentials for Robust ADF Pipelines
Why hello, welcome back! Building pipelines can feel like a delicate balance sometimes seeming effortless and other times extremely daunting. Either way the failure to take your time and plan ahead can lead to potentially terrible consequences: costs spiraling out of control, data integrity issues, or even worse pipelines appearing to run successfully but fail silently on important steps. Today lets investigate some essentials for preventing these outcomes from coming true and creating robust, reliable Azure Data Factory (ADF) pipelines.
Reusability
Reusability is key to reducing redundant work, progressing quickly, and ensuring speedy troubleshooting when issues arise. In Azure Data Factory (ADF), one of the best ways to achieve this is by extensively using parameters. They are invaluable and should be used everywhere from your pipelines to your linked services and especially in your datasets.
For example, let’s say you have an Azure Data Lake with a container named "work" and five pipelines drop CSV files there in different directories. Instead of creating a dedicated dataset for each directory/file name combo, you can create a single dataset that uses parameters for the directory path and file name. By taking it a step further and parameterizing the container as well, you can nearly future-proof your design, ensuring you’ll only ever need one dataset.
This approach keeps your datasets organized, reduces setup time, and helps speed up troubleshooting.
Preventing Wasted Time on Transient Issues
Using Timeouts, and retries
Nothing can be more frustrating than getting assigned a ticket and realizing that the error is a none issue that could have been prevent with a few simple clicks and incrementing a few numbers. This is another case of the defaults not making sense, because often times some steps should almost always have at least 1 retry.
Connecting to a Rest API? What if there was a network blimp now you have an email or alert (hopefully if you have those) that a pipeline failed only to go look and see that it timed out. Querying a Sql Server database and get blocked or deadlocks? Having to manually restart that pipeline can be frustrating right?
This is why I like to always set retry attempts to between 3-5 with a retry period between 60 and 500 seconds. As well the default timeout for ADF of 12 hours (it used to be even worse at 7 days, if you believe that) should likely be reduced to around 1-2 or less. If you have a single step in the pipeline taking over that, might be time to investigate. This helps ensure that if it is a transient resource or network issue that I don’t have to manually restart that pipeline in the morning when I come into the office and I can continue to sip my morning coffee in peace and happiness.
Error Alerting: Staying Ahead of Pipeline Failures
When it comes to error handling in Azure Data Factory (ADF), I prefer a step-by-step approach. For each activity, I use a Web Activity to send detailed error information to a Logic App, which then handles alerting. The Logic App sends an email to a distribution list and creates a ticket in Jira through an integrated account.
While ADF does provide some built-in error handling options, they often fall short in terms of clarity and detail. They don’t always provide enough actionable information to quickly identify and address critical issues. My approach ensures that I can glance through errors and immediately know which ones require attention first.
By leveraging a Logic App, you gain full control over the information you send. You can:
Customize the recipients of the email.
Dynamically set the subject line (e.g., include the pipeline name).
Include a clean, dynamic email body that provides all the essential details: the error message, the step where the failure occurred, and the pipeline name.
This level of customization not only streamlines troubleshooting but also ensures that critical issues are promptly escalated and visible to the right teams and provides a history of any issues and what was done to resolve them through the Jira ticket and comments for resolution within.
Avoiding Data Flows in Loops (and Loops in General)
Loops in Azure Data Factory (ADF) can seem like a convenient way to handle repetitive tasks, but when misused, they can quickly turn into performance and cost nightmares, especially for Data Flows within loops.
Why Avoid Data Flows in Loops?
Performance Overhead: Each iteration of a Data Flow spins up a new Spark cluster, significantly increasing runtime and causing delays.
High Costs: The overhead of repeatedly initializing clusters can inflate your bill unnecessarily.
Complex Debugging: Tracking issues across multiple iterations can be tedious, especially when the loop is executed many times resulting in many execution steps to debug and jump through.
And even without data flows inside them loops can still result in inefficiencies
creating bottlenecks when parallel execution isn’t configured, or set too high
can cause confusion when trying to understand flow and debugging of pipelines
When I first started I used to do a loop and a copy activity for each csv file found inside the source FTP directory and then drop each one to a blob location for further processing later on. I then realized this was far slower as in this case there were many many daily csv’s to copy - instead using parameters and proper settings on a standalone copy activity to copy the entire FTP folder with files ending in .csv was far more efficient.
So rather than looping try to do batch processing instead like I found out above the hard way with proper parameterizations and settings you can get it to copy all the files over with just a copy activity rather than looping for each file found. If loops are unavoidable make sure to do testing and ensure the parallel execution count is set appropriately.
Version control & CI/CD
When it comes to Azure Data Factory implementing version control & CI/CD for multiple tiers of environments is a requirement, and implementing them at the start is far easier than doing so once you have pipelines deployed and running.
Using version control helps with the clarity of what is being worked on, I like to set my branch names to the ticket number and a brief description so that way its clear what that branch is for, this helps add clarity for the rest of the team and helps to make sure the wrong branch doesn’t get a push.
Using Git or Azure DevOps depending on your flavor of choice also helps make sure there is accountability as it greatly helps showing who and when changes were pushed and allows a far more flawless review process if your using feature branches going into your development/work branch which ADF uses as its collaboration branch.
ADF’s ability to tie into a publish branch and using Azure DevOps pipelines helps add the icing to the cake of an enteprice setup allowing you to now work on your changes in a feature branch then doing a review process with the team before going to development, doing testing and creating unit tests and then lastly doing a final review into the production environemnt. The DevOps pipelines will allow you to get your changes into the production environemnt without any hassle of remaking the pipeline and can automatically repoint your linked servies to their production counter parts (You do have a Dev and Prod environment right?)
Just don’t forget our first point making sure the datasets, linked services, end points all have generic names that would make sense in both environemnts no “FinanceBlobDev” “FinanceBlobProd” just “FinanceBlob” because while DevOps will repoint to the production version the name itself will be expected to be the same.
In conclusion
By focusing on these essentials: reusability, retry attempts, error alerting, avoiding loops, and implementing version control and CI/CD you can guarantee your pipelines will be robust. These practices will save you time, reduce costs, and help you avoid headaches when issues inevitably arise.
Well that its for today! If you made it this far thank you so much and add your comments below with your thoughts! If you enjoyed this post subscribe down below for updates on new blogs and newsletters!