Basic Introduction
A Data Warehousing (DW) is process for collecting and managing data from varied sources to provide meaningful business insights. A Data warehouse is typically used to connect and analyze business data from heterogeneous sources. The data warehouse is the core of the BI system which is built for data analysis and reporting. You want a data warehouse to analyze petabytes of historical data that you’ve ingested from your systems, and for the queries to run in minutes.
Staging Layer (Focus on the “E”)
-
Mirror images of the source objects (Get data from source ASAP)
-
Non-Persistence Staging layer : Load and delete after moving to User access layer
-
Persistence Staging layer : Contain the history of data. New and updates are accommodated accordingly.
-
-
Prefer to have Persistence Staging Layer – Exact data as source data / Will need more storage / Archive to S3 after few years
User Access Layer
-
Dimensional data : Structured data as per the requirement of the frontend applications / reports
ETL (Extract, Transform, Load)
-
Initial ETL
-
One time ETL
-
Before go live get all the data from the source
-
Will bring in
-
Data needed for BI and analytics
-
Historical data
-
-
-
Incremental ETL
-
Data that refresh
-
New data
-
Modifications of data (updates, soft deletes)
-
-
Incremental ETL Patterns (Near Real Time, Hourly, Daily, Weekly)
-
Append : Appending new information
-
In-place update : Doing updates in existing rows
-
Complete replacement : Delete all existing data add the new data set
-
Rolling append : Wipe out old data set and add the latest (only have 36 months of data in DW all the time)
Data Transformation
-
Uniform the data : Getting data from different sources will have different representations. We need to unify it.
-
Data values
-
Data types & size
-
De-duplication : remove duplication data (mainly for master data)
-
Dropping columns : remove unwanted columns from the source when we move to DW
-
Value based row filtering : remove unwanted rows based on the values
-
Correcting known errors : data issues to be fixed when moving data to DW
-
-
Restructure the data
-
Design the data structure
-