Ever had to compare the destination and source for missing updates or inserts? well, I had to because of some load issues related to timing or source updates or whatever, we had some data missing in the warehouse and needed to do data quality checks from source to destination.
After going through developing various methods of doing this and see how time consuming it is to go through each and every record (when you have billions of records in a table), the fastest and easiest way I found out is using the cozyrock table difference transformation.
below is an example of how easy it is to use..
(i am using an example of Customers table and I am getting the fields Account and email)
I am sourcing the Account and email fields from the Customers table in Customers database into a warehouse table "WH_customers".
Step 1: Write queries to read the Account and Email fields from Customers table and WH_customers table and use the queries in the 2 sources shown in fig 1.
/* from customers DB, since it is transactional and the record will be updated when the customer updates his email on the frontend application */
SELECT Distinct Account, email
FROM Customers.customers WITH (NOLOCK)
/* from the warehouse table, if you do only inserts and not updates*/
With customers (account, email, srno)
AS
(
select account, email, SRno = RANK ( ) OVER ( partition by account order by insert_date desc )
from dbo.customers
)
select * from customers where srno =1
/* from the warehouse table, if you are doing updates to the emails*/
SELECT Distinct Account, email
FROM WH_customers WITH (NOLOCK)
Step 2: Configure the table difference transformation as shown in the below figures.
you can create a table called exceptions table and load the exception records coming out of the New and Update flows and build reports or send emails to the business or whatever you want to do once you have exceptions in the table.
After going through developing various methods of doing this and see how time consuming it is to go through each and every record (when you have billions of records in a table), the fastest and easiest way I found out is using the cozyrock table difference transformation.
below is an example of how easy it is to use..
(i am using an example of Customers table and I am getting the fields Account and email)
I am sourcing the Account and email fields from the Customers table in Customers database into a warehouse table "WH_customers".
Step 1: Write queries to read the Account and Email fields from Customers table and WH_customers table and use the queries in the 2 sources shown in fig 1.
/* from customers DB, since it is transactional and the record will be updated when the customer updates his email on the frontend application */
SELECT Distinct Account, email
FROM Customers.customers WITH (NOLOCK)
/* from the warehouse table, if you do only inserts and not updates*/
With customers (account, email, srno)
AS
(
select account, email, SRno = RANK ( ) OVER ( partition by account order by insert_date desc )
from dbo.customers
)
select * from customers where srno =1
/* from the warehouse table, if you are doing updates to the emails*/
SELECT Distinct Account, email
FROM WH_customers WITH (NOLOCK)
Fig 1
you can create a table called exceptions table and load the exception records coming out of the New and Update flows and build reports or send emails to the business or whatever you want to do once you have exceptions in the table.