Advanced Skills

Session 1

Session Outline

Data Warehouse
Data Integration
Types of Data Integration
Data Integration Life Cycle
Data Profiling
Data Integration Techniques
Rules and Policies
ETL
Security
Automating the Process
Licenses vs Coding
Master Data Management
Testing Strategies
Recap

Learning Objectives

Understand concepts of Data Integration and ETL techniques
Explain the difference between Data Integration and Data Migration
Explore different testing strategies for Data Integration

Data Warehouse

'Data Warehousing' is a practice in data management whereby data is copied from various operational systems into a persistant data store in a consistent format to be used for analysis, decision making and reporting.

Online Transactional Processing (OLTP)

OLTP provides transaction orientated applications, administering day to day transactions of an organisation. For example:

Supermarkets

Online banking

Airline ticket booking

Adding items to a shopping cart

Online Analytical Processing (OLAP)

OLAP consists of data analytics tools that are used for making business decisions. It provides an environment to leverage insights from multiple database systems at one time. For example:

Recommendation algorithms (e.g. Spotify suggested, Amazon products)

Virtual assistants (e.g. Alexa, Siri)

Targeted Adverts

Data Warehouse vs Database

Type of Processing

OLAP

OLTP

Data Warehouse vs Database

Data Structure

Denormalised table containing repeated data

Highly normalised with different tables

Data Warehouse vs Database

Optimised For

Rapid execution of queries on large complex datasets

Updating, deleting and modifying data

Data Warehouse vs Database

Data Timeline

Historical Data

Current real-time data

Data Warehouse vs Database

Uptime

Regular downtime to allow batch upload

Approx 100%

Data Warehouse vs Database

Query Type

Complex queries for in depth analysis

Simple transactional queries

Data Warehouse vs Database

	Data Warehouse	Database
Processing	OLAP	OLTP
Structure	Denormalised table containing repeated data	Highly normalised with different tables
Optimisation	Rapidly executing low number of complex queries on large multi-dimensional datasets	Updating, deleting and modifying data
Timeline	Historical data	Current real-time data
Uptime (SLA)	Regular downtime to allow batch uploads	Appox. 100% uptime
Query Type	Complex queries for in depth analysis	Simple transactional

Data Warehouses are examples of Data Integration products

Data Integration

Data Integration is the process of collecting data from a variety of sources into a single target.

Data Integration Sources

Text Files

Databases

Spreadsheets

Applications

Increased availability of data

Superior data integrity and quality

Collaboration opportunities

Greater insights and improvements

Improved data consistency

Benefits of Data Integration

Data Integration vs Data Migration

Data Integration is the process of collecting data from a variety of sources into a unified view for analysis and making data driven business decisions.

Data Migration is when the data is simply moved from one source to another.

Companies will typically migrate data when implementing a new system or merging to a new environment.

Types of Data Integration

Batch

Data transfered from source to target in groups periodically

Data formats and layouts must be consistent between source and target

Source and target are 'asynchronus' (source doesn't wait for target to process data)

Real-time

Data transfered from source to target instantly

Involved a much smaller amount of data and used when it is necessary to complete a single transaction

Source and target are 'synchronus' (changes in source are reflected in target)

Data Integration Life Cycle

1. Scoping

2. Profiling

3. Design

4. Testing

5. Implementation

Technical Requirements

Business Requirements

Data Requirements

Operational Requirements

1. Scoping

2. Profiling

3. Design

4. Testing

5. Implementation

Understand our data

Duplicates
Null values
Format
Data Types
Values

1. Scoping

2. Profiling

3. Design

4. Testing

5. Implementation

Decide on the architecture of the data warehouse using business, technical and operational metadata

1. Scoping

2. Profiling

3. Design

4. Testing

5. Implementation

Validation and verification of coding interface

Test the process works

User Acceptance Testing (UAT)

Technical Acceptance Testing (TAT)

Performance Stress Testing (PST)

1. Scoping

2. Profiling

3. Design

4. Testing

5. Implementation

Implement the process at an operational level

Data Profiling

Data Profiling is the process of reviewing and analysing data to be used in an extract to understand the format and content.

Uses of Data Profiling

Develop metadata and documentation

Report data formats, uniqueness, consistency, correctness and null values

Compare field names across data stores/tables

Can be difficult to arrange if it involves personal or sensitive information

Dataset Name	Format	Dataset Type
My_dataset	RDBMS	Reference

Author: Multiverse; Last editied: 01/03/2021

Field Name	Data Type	Count	Null Values	% Nulls	Maximum Value	Minimum Value
customer_surname	string	1501	0	0%	zabini	abbots

Activity

Open the products Jupyter Notebook
Using pandas, profile the data for:
- Data Format
- Field Names
- Field Data Types
- Summary Statistics of the data
- Information on Null values
- Any other information you think is necessary
Create a text document to show this information

Data Integration Techniques

Manual Data Integration

Middleware Data Integration

Uniform Access Integration

Common Storage Integration

Manual Data Integration

Whole process (e.g. data collection and cleaning, connecting sources) done manually by a human

Best for one-time instances

Manual Data Integration

Benefits

Reduced Costs

Greater Freedom

Drawbacks

Difficulty Scaling

Greater Room for Error

Less Access

Middleware Data Integration

Using softwares that connect applications and transfers between them and databases (no coding)

Acts an interpreter between systems and enacts an automatic transfer

Examples include Microsoft Dynamic CRM, SAP and Sage

Middleware Data Integration

Benefits

Fast and Efficient

Scalable

Time Saving

Drawbacks

Less Access

Limited Functionality

Uniform Access Integration

Also known as "Virtual Integration"

Data is allowed to stay in its original location when being accessed

Provides a unified view quickly to both customers and across platforms

Uniform Access Integration

Benefits

Simplified View of Data

Easy Access

Lower Storage Requirements

Drawbacks

Data Management can be Difficult

Data Integrity could be Compromised

Common Storage Integration
(Data Warehouse)

Similar to uniform access except it creates and stores a copy of the data

One of the most popular integration methods

Common Storage Integration
(Data Warehouse)

Benefits

Reduced Burden

Cleaner Data Appearance

Enhanced Data Analytics

Drawbacks

Increased Storage Costs

Higher Maintenance Costs

Rules and Policies

You must specifiy security policies (e.g. who has access?)

Data integrated should be immutable (unchanging)

Validation checks should be carried out during the process

Validate the source and target table structure and data types

Validate the column names against a mapping document

With data (particularly PII or insider data) the question has to be, does this person need access? GDPR states that data must be processed in a manner which secures its security and used in a way which is relevant. A way to ensure this is to only give access to those who need it, so having a set of security policies and rules (including how to get access approval and how they data should be used) should be established. To help protect data integrity, the data itself should not be changed once it has been integrated- it should be in the form exactly as you defined it. You should caary out validation checks throughout the process to ensure the source data is in the correct form, the column names are mapping correctly and other errors (such as incorrect data types, null values and outliers) are identified and dealt with. You should also check that there isn't a 'back door' that will allow users (potentially malicious) to gain access.

Verification is also carried out on the Data Warehouse

Verify the data is accurate

Verify the data is correct

Verify the data has not been duplicated in the Data Warehouse

If you are wanting to use Business Data...

Get Permission from the Data Owner!

Data owners are given the right to decide who can have access to enterprise data.

The process involved may be something like this:

A person (staff member, contractor, supplier, etc) requests access to information

A business resource (Data owner, manager) will review the request

A techinical resources (usually a DBA) physically grants permission to an application, database or other data store containing the data.

Often the permission follows a CRUD schema (create, read, update, delete)

ETL

E

T

L

Extract

Transform

Load

A process of Data Integration from Multiple Sources

It allows business the ability to gather data from multiple sources and consolidate into a single, centralised location

This can be hard coded or using a licensed product

Data is accessed from the source

For this stage to be effective, a basic understanding of the data is required

There are two methods:

Current system sends out a copy
Other system comes in and grabs the data

Commonly done with SQL queries if data is in databases

Extract

Extraction

Can be hard-coded or tool-based

For hard-coding in python:

CSV Reader (python or R)
Pandas

import pandas as pd

data1=pd.read_csv('dataset1.csv')

data2=pd.read_json('dataset1.json')

data1.info()
data2.info()

Pandas is not just a data analysis library but can also be used for extraction

Extraction

You can connect to a SQL server to extract data

This can be done in python using:

Psycopg2 (for postgreSQL)
SQLAlchemy
SQLite3

import psycopg2

conn = psycopg2.connect(dbname='DB_NAME', user='USERNAME', password='PASSWORD')
cur = conn.cursor()

cur.execute("SELECT table_name
FROM information_schema.tables
WHERE table_schema='public'
ORDER BY table_name")

query = "SELECT * FROM SALES LIMIT 100"
sales = pd.read_sql_query(query,connection)

Activity

Open Jupyter Notebook ETL_python

Complete Section 1: Extraction

Transform the data to be compatible with the target data structure

Sometimes simple, sometimes near on impossible

Requires detailed requirements elicitation

Transform

Transformations could include:

Mapping field from source to target
String manipulation and manual data standardisation
Aggregation and normalisation
Calculations
Dealing with duplicate values
Data validation

Transformation

Null Values

data.fillna("Missing",inplace=True)

data.dropna(inplace=True,subset=["col_A"])

Convert Datatypes

data["col_A"]=data.col_A.astype("int")
data["col_B"]=data.col_A.astype("float")
data["col_C"]=data.col_A.astype("bool")

Transformation

Deduplication

# check for duplicates

data.duplicated

# remove duplicates

data.drop_duplicates(inplace=True)

Rename Fields

data.rename(columns={"col_A":"Col_A"})

Activity

Open Jupyter Notebook ETL_python

Complete Section 2: Transformation

Load the data into the target data structure

Either write code to insert data or make use of application code that already exists

Examples include loading into a database or Data Warehouse

Could involve joining all extracted data into a single table

Load

Loading

Activity

Open Jupyter Notebook ETL_python

Complete Section 3: Loading

Security

Some tips to help you run your ETL processes more securely:

Download the data onto a secure server
Run ETL processes on local files or business/enterprise databases
If the data owner has not given you the necessary permissions to write data to the target you will need to hand your script to the development team to implement

Automating the Process

If your ETL process is unlikely to be a one off then it may be more efficient to automate the process.

You will need to assess when new data becomes available to determine how often your scripts need to run.

Task Scheduling

Microsoft has a 'Task Scheduler' which can create batch files

To help with performance, scripts should be run out of hours to ensure performance is not slowed down

Licenses vs Coding

Advantages

License

Company may already have license
Friendly GUI
Supports various databases and formats
Customer support and good documentation
Easy scalability for larger datasets

Coding

Easy to create if database is small
Easy to install

Disadvantages

License

License costs
Steep learning curve

Coding

Challenging to create (especially if schema changes frequently)
Developing scripts is time consuming
Issues around scaling to larger datasets
Requires programming expertise

Data Integration Softwares

Master Data Management

Important data about items in an organisation is called Master Data

This includes customer and product information as well as organisational structure

In business, master data management (MDM) comprises the processes, governance, policies, standards and tools that consistently define and manage the critical data of an organisation to provide a single point of reference.

Benefits

Redundancy Elimination

Master Data Edits

Data Consistency

Access Based on Role

Testing Strategies

Why do we need testing strategies?

To ensure that unified data sets are:

Correct
Complete
Up to Date

Types of Testing Strategies

Technical Acceptance Testing (TAT)

User Acceptance Testing (UAT)

Performance Stress Testing (PST)

Testing scripts to ensure they produce the correct output

Can be done manually or by automation

There are three strategies:

Unit tests
Integration tests
Functional tests

Technical Acceptance Testing (TAT)

TAT: Unit Tests

Testing individual functions or lines of code

Uses python library unittest

Naming convention test_xxx.py

Run on your command line: python -m unittest


								import unittest
								def fun(x):
								  return x+1
									
								class MyTest(unittest.TestCase):
								  def test(self):
								    self.assertEqual(fun(3),4)

Activity

Open a file in Jupyter Notebook and write a function that sums of a list of numbers
Write a test case for the function in the same script and name it test_sum.py
Write a new function in a different file that averages a list of numbers
Write a test case for the function in the same script and name it test_average.py
Open a terminal and run python -m unittest

TAT: Integration Tests

Integration tests verify that different modules or services used by your application work well together.

For example, it can be testing the interaction with a database, i.e. are you able to write queries?

TAT: Functional Tests

These focus on the business requirements of an application. They only verify the output of an action and do not check the intermediate states of the system when performing that action.

Formal tests to verify if a report or system statisfies its business requirements

Can be done manually or by automation

User Acceptance Testing (UAT)

Answers questions like:

Does the report meet the original requirements?
Does the report produce sensible information?
Is the design and layout acceptable?

User Acceptance Testing (UAT)

Focusses on validating performance characteristics of the product such as scalability and reliability

They check the behaviours of the system when it is under a significant load

Performance Stress Testing (PST)

Tests are based on non functional requirements:

Scalability
Reliability
Stability
Availability

Performance Stress Testing (PST)

Recap

Learning Objectives

Understand concepts of Data Integration and ETL techniques
Explain the difference between Data Integration and Data Migration
Explore different testing strategies for Data Integration

Complete Session Attendance Log and Update Your OTJ

Session Outline

Online Transactional Processing (OLTP)

Online Analytical Processing (OLAP)

Data Warehouse vs Database

Type of Processing

Data Warehouse vs Database

Data Structure

Data Warehouse vs Database

Optimised For

Data Warehouse vs Database

Data Timeline

Data Warehouse vs Database

Uptime

Data Warehouse vs Database

Query Type

Data Warehouse vs Database

Data Integration Sources

Benefits of Data Integration

Data Integration vs Data Migration

Batch

Real-time

Uses of Data Profiling

Activity

Data Integration Techniques

Manual Data Integration

Manual Data Integration

Benefits

Drawbacks

Middleware Data Integration

Middleware Data Integration

Benefits

Drawbacks

Uniform Access Integration

Uniform Access Integration

Benefits

Drawbacks

Common Storage Integration (Data Warehouse)

Common Storage Integration (Data Warehouse)

Benefits

Drawbacks

If you are wanting to use Business Data...

Get Permission from the Data Owner!

E

T

L

Extract

Transform

Load

Extract

Extraction

Pandas is not just a data analysis library but can also be used for extraction

Extraction

Activity

Extract

Transform

Transformation

Transformation

Activity

Extract

Transform

Load

Loading

Activity

Task Scheduling

License

Coding

License

Coding

Data Integration Softwares

Benefits

Types of Testing Strategies

Technical Acceptance Testing (TAT)

TAT: Unit Tests

Activity

TAT: Integration Tests

TAT: Functional Tests

User Acceptance Testing (UAT)

User Acceptance Testing (UAT)

Performance Stress Testing (PST)

Performance Stress Testing (PST)

Common Storage Integration
(Data Warehouse)

Common Storage Integration
(Data Warehouse)