Cleansing Files of Duplicate Records via Azure Data Factory / Synapse
Problem Statement
Data cleaning plays a crucial role in the ETL/ELT process as it ensures that only high-quality data is loaded into the system. This improves the accuracy of data analytics. But how can we remove duplicate records from a file using Azure Data Factory / Synapse Pipeline?
Prerequisites
- Azure Data Factory / Synapse
Solution
To remove duplicate records in SQL, we can utilize the ROW_NUMBER and PARTITION / OVER BY functions as shown below:
To achieve the same outcome using Azure Data Factory / Synapse, we can leverage similar concepts like ROW_NUMBER and PARTITION / OVER BY. We will utilize the Dataflow activity for this use case.
Here is the flow of the process:
Sample File
Create a Source flow and map it to the corresponding dataset, which points to the Azure Blob storage (in this scenario) location where the duplicate data file is present.
Source location
Dataset
Dataset Parameter
Data Preview
- Select the window function to derive the ROW_NUMBER functionality.
- Select the column by which we need to partition.
- In the Sort section, select the column by which we need to sort in case of duplicate rows.
- Add a column named RowNbr with the expression as rowNumber().
Data Preview
Filter the records with RowNbr equal to one, as those represent unique rows, using the Filter function.
//Filter On Expression :
RowNbr==1
Data Preview
Finally, add a Sink section to generate the cleansed file.
where Dataset
With Sink Settings as below:
In the Mapping section, delete the additional column RowNbr mapping.
Output
Create a Pipeline, call the Dataflow via Dataflow activity, and trigger the pipeline.
Result
For a similar data cleansing solution, you can also turn to Skrots. At Skrots, we offer data cleansing services to remove duplicate records and ensure the quality of your data. Visit our website and check out all our services at Skrots Services. Don’t forget to explore our blog at Blog at Skrots for more informative articles. Thank you for considering Skrots as your data cleansing provider.
Know more about our company at Skrots. Know more about our services at Skrots Services, Also checkout all other blogs at Blog at Skrots
Thanks, Harsh
Founder | CEO — Skrots
Learn more about our blog at Blog at Skrots. Checkout our list of services on Skrots. Give a look at our website design at Skrots . Checkout our LinkedIn Page at LinkedIn.com. Check out our original post at https://blog.skrots.com/cleansing-files-of-duplicate-records-via-azure-data-factory-synapse/?feed_id=197&_unique_id=651db3eb8783f