Design a data validation framework whose purpose is to check whether data selected from source for loading target table matches on number of records on daily basis.
Hypothesis:
(Count(DB1[Source])=Count(DB2[Target]) per batch.
Assume source and target DB can be anything and number of tables as x
My approach:
Store a config file over cloud storage which contains key:value pairs as eg: taget_table_name:Source_table_name.
(This way we can store key value pairs for different kind of db as well for scalable solution)
Run a daily batch job and iterate over the config file saved in step 1 and insert data table wise and day wise data in a table. Ex:
date1,table1,count
date1,table2,count
Expose APIs over source table to get data in aggregate structure like below:
date1 : difference, if any
date2: difference, if any
Alternate approach anyone ?
Any feedback or GitHub links are also welcome.