Data Warehousing For a Beginners

Basic Introduction

Data Warehousing (DW) is process for collecting and managing data from varied sources to provide meaningful business insights. A Data warehouse is typically used to connect and analyze business data from heterogeneous sources. The data warehouse is the core of the BI system which is built for data analysis and reporting. You want a data warehouse to analyze petabytes of historical data that you’ve ingested from your systems, and for the queries to run in minutes.

Staging Layer (Focus on the “E”)

  • Mirror images of the source objects (Get data from source ASAP)

    • Non-Persistence Staging layer : Load and delete after moving to User access layer

    • Persistence Staging layer : Contain the history of data. New and updates are accommodated accordingly.

  • Prefer to have Persistence Staging Layer – Exact data as source data / Will need more storage / Archive to S3 after few years

User Access Layer

  • Dimensional data : Structured data as per the requirement of the frontend applications / reports

ETL (Extract, Transform, Load)

  • Initial ETL

    • One time ETL

    • Before go live get all the data from the source

    • Will bring in

      • Data needed for BI and analytics

      • Historical data

  • Incremental ETL

    • Data that refresh

      • New data

      • Modifications of data (updates, soft deletes)

 

Incremental ETL Patterns (Near Real Time, Hourly, Daily, Weekly)

  • Append : Appending new information

  • In-place update : Doing updates in existing rows

  • Complete replacement : Delete all existing data add the new data set

  • Rolling append : Wipe out old data set and add the latest (only have 36 months of data in DW all the time)

Data Transformation

  • Uniform the data : Getting data from different sources will have different representations. We need to unify it.

    • Data values

    • Data types & size

    • De-duplication : remove duplication data (mainly for master data)

    • Dropping columns : remove unwanted columns from the source when we move to DW

    • Value based row filtering : remove unwanted rows based on the values

    • Correcting known errors : data issues to be fixed when moving data to DW

  • Restructure the data

    • Design the data structure

Read More

Cloud SaaS Security Patterns & How AWS Services Can Address Them

Top-Level Cloud Security Requirements

  • R1: Must provide protection to system’s components. This requirement concerns the protection of system’s components both the software (e.g., piece of code) and hardware (e.g., sensor devices) that are parts of system.

  • R2: Must be able to prevent unauthorized access and intrusion to system and resources. This requirement is about assuring that only genius user or application can access to application or system’s resources.

  • R3: Must be able to monitor network requests. The main goal is to monitor network requests in order to prevent potential attacks to system and its resources.

  • R4: Must have auditing option and be able to recover from a breach. This requirement concerns the auditing of system and resources usage to find out the anomaly.

  • R5: Must ensure data protection at rest and in transit. This requirement concentrates on how to protect data both in transit and at rest, especially when they are in public Cloud platform.

  • R6: Must ensure privacy protection and regulatory compliance. This requirement is about how to ensure privacy protection and regulatory compliance of data processed in the Cloud infrastructure.

  • R7: Must provide secure communication between modules. A system may be made of different modules deployed in the same or different Cloud platforms. Thus, it is important to ensure a secure communication between those modules.

  • R8: Must provide protection to system’s resources. The system’s resources here refer to the Cloud resources required to run Cloud application. How to protect Cloud’s resources from excessive and unnecessary use in order to ensure economic durability and durable availability of application running on the Cloud platform.

 

AWS Services To Rescue

Category

Pattern

Description

Required

Approach

Compliance and Regulatory

Data Citizenship

How can a Cloud-based solution achieve regulatory compliance with respect to data storage locality.

Yes

AWS Tags (Location Tags for the resources)

Cryptographic Erasure

How can a dataset be reliably and securely erased after it was stored in the Cloud. If we replicate the data in multiple regions, then this needs to be addressed

No

AWS KMS (Ensure data is encrypted in rest and KMS manage the key for it)

Shared Responsibility Model

How can a Cloud services consumer effectively manage their Cloud application legal and regulatory compliance

Yes

Usage of AWS managed services

Compliant Data Transfer

How can data be transferred for processing to other parties in potentially different jurisdictions while staying in compliance with legal and regulatory requirements

Yes

AWS Tags (Location Tags for the resources)

When we use third-party functionalities which is often exposed through an APIs, we need to adhere to data transfer guidelines

Data Retention

How long is personal information retained

Yes

Lambda function to automate the data clearing process

Data Lifecycle

How to efficiently and securely manage data lifecycle in the Cloud

Yes

AWS Data Lifecycle Manager

Intentional Data Remanence

How can data in the Cloud be protected from accidental or malicious deletion

Yes

RDS data replication/redundancy

Identification, Authentication and Authorization

Multi-Factor Authentication

How to simply, yet securely authenticate physical users of Cloud-based applications

Yes

AWS Cognito with MAF

Federation (Single Sign-On)

How to authenticate with customer provided user identities

Yes

AWS Cognito with AWS SSO

Access Token

How to control human or machine user access to Cloud APIs

Yes

AWS Security Token Service with Cognito

Mutual Authentication

How to establish identity of parties in a Cloud communication channel. Without proper authentication between communicating parties, man-in-the-middle attacks are possible

Yes

AWS Client VPN, AWS TLS/SSL certificates via Certificate Manager

Secure User Onboarding

How to securely perform initial registration of Cloud application users

Yes

Define a secure onboarding process / AWS Customer on boarding process

Identity and Access Manager

How to securely and effectively manage a user database and provide authentication and authorization functionality in a Cloud application

Yes

AWS IAM & Cognito

Per-request Authentication

How to continuously prove the identity of the user when they perform sensitive operations

Yes

Cloud Watch with events and notifications. Tools monitoring the usage activities of user from the start till the end of usage session. JWT token validation throughout the request life cycle and log user activities. Detect any abnormal activities via log analysis

Access Control Clearance

How to enforce access and usage control policies for different types of authentication

Yes

Implement a central Authorization module and validate the access in FE and BE. (Role base access)

Secure Development, Operation and Administration

Bastion Server

How to access Cloud resources without exposing them directly to the Internet

Yes

Bastion Host outside the Firewall

Automated Threat Detection

How to detect network attacks on Cloud internet endpoints

Yes

AWS Guard Duty

Economic Durability

How to establish and maintain availability of the Cloud services in the face of distributed denial-of-service attacks

Yes

AWS WAF & Cloud Watch

Vulnerability Management

How to detect and respond to found vulnerabilities

Yes

Use external tools

Privacy and Confidentiality

End-to-End Security

How to communicate a message between two parties so that its confidentiality is protected across all components in the Cloud communication channel

Yes

AWS KMS and Certificate Manager (security guarantees are needed for data in transit and at rest)

Computation on Encrypted Data

How to outsource data for computation to a Cloud service without disclosing it in the process

No

Cloud provider maintains the keys, we need to fully trust the cloud provider

Data Anonymization

How to remove personal identifiers from datasets to protect privacy, while keeping the datasets still valuable for processing

Yes

AWS Athena, Cloud Watch and Lambda to automate a scan

Processing Purpose Control

How to ensure data is used or processed in accordance with its original intended purpose

No

Automated tool to trace and audit the usage of it.

Secure Architecture

Virtual Network

How to connect components of a Cloud application architecture without unnecessarily exposing them to the Internet

Yes

AWS VPC

Web Application Firewall

How to protect web API endpoints from unauthorized access and abuse

Yes

AWS Firewall

Secure Element

How to securely provide and strongly protect identity of IoT devices or external services

Yes

Using a unique identity, PKI should be the foundation of any IoT security strategy / external service

Secure Cold Storage

How to protect the availability of large amounts of data securely and cost-effectively

Yes

AWS Glacier for Cold storage with encryption

Certificate and Key Manager

How to securely and effectively create, provision and revoke certificates and keys for securing data at rest and in transit

Yes

AWS KMS and Certificate Manager

Hardware Security Module

How to best protect the cryptographic secrets owned by Cloud tenants while still enabling Cloud processing infrastructure to compute on the tenant data

No

CloudHSM

Secure Auditing

How to record and report security-related behavior in an operating Cloud system

Yes

AWS security audit check list

https://aws.amazon.com/blogs/security/auditing-security-checklist-for-aws-now-available/

 

Read More