Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.
This method overwrites old with new data, and therefore does not track historical data.
This method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Unlimited history is preserved for each insert.
In this Project we have used flag method
- Copy all new record from the source which is not present in the target, copy all updated records from the source to the temp table, copy all not updated records from source to temp ( set all the flag as true)
- Copy all records from target (which are updated in the source record) set flag as false, Copy all the record which is not present in the source-target set the flag as true
- Finally after step 1 & 2 override the customer_temp to the store.customer(target)
SCD type 4 provides a solution to handle the rapid changes in the dimension tables. The concept lies in creating a junk dimension or a small dimension table with all the possible values of the rapid growing attributes of the dimension.
The Type 4 method is usually referred to as using "history tables", where one table keeps the current data, and an additional table is used to keep a record of some or all changes. Both the surrogate keys are referenced in the fact table to enhance query performance.
Reference Link1Reference Link2