Advanced Skills
Session 1
OLTP provides transaction orientated applications, administering day to day transactions of an organisation. For example:
Supermarkets
Online banking
Airline ticket booking
Adding items to a shopping cart
OLAP consists of data analytics tools that are used for making business decisions. It provides an environment to leverage insights from multiple database systems at one time. For example:
Recommendation algorithms (e.g. Spotify suggested, Amazon products)
Virtual assistants (e.g. Alexa, Siri)
Targeted Adverts
Suggested LinkedIn connections
Data Warehouse
vs
Database
OLAP
OLTP
Denormalised table containing repeated data
Highly normalised with different tables
Rapid execution of queries on large complex datasets
Updating, deleting and modifying data
Historical Data
Current real-time data
Regular downtime to allow batch upload
Approx 100%
Complex queries for in depth analysis
Simple transactional queries
| Data Warehouse | Database | |
|---|---|---|
| Processing | OLAP | OLTP |
| Structure | Denormalised table containing repeated data | Highly normalised with different tables |
| Optimisation | Rapidly executing low number of complex queries on large multi-dimensional datasets | Updating, deleting and modifying data |
| Timeline | Historical data | Current real-time data |
| Uptime (SLA) | Regular downtime to allow batch uploads | Appox. 100% uptime |
| Query Type | Complex queries for in depth analysis | Simple transactional |
Text Files
Databases
Spreadsheets
Applications
Increased availability of data
Superior data integrity and quality
Collaboration opportunities
Greater insights and improvements
Improved data consistency
Data Integration is the process of collecting data from a variety of sources into a unified view for analysis and making data driven business decisions.
Data Migration is when the data is simply moved from one source to another.
Companies will typically migrate data when implementing a new system or merging to a new environment.
Data transfered from source to target in groups periodically
Data formats and layouts must be consistent between source and target
Source and target are 'asynchronus' (source doesn't wait for target to process data)
Data transfered from source to target instantly
Involved a much smaller amount of data and used when it is necessary to complete a single transaction
Source and target are 'synchronus' (changes in source are reflected in target)
1. Scoping
2. Profiling
3. Design
4. Testing
5. Implementation
Technical Requirements
Business Requirements
Data Requirements
Operational Requirements
1. Scoping
2. Profiling
3. Design
4. Testing
5. Implementation
Understand our data
1. Scoping
2. Profiling
3. Design
4. Testing
5. Implementation
Decide on the architecture of the data warehouse using business, technical and operational metadata
1. Scoping
2. Profiling
3. Design
4. Testing
5. Implementation
Validation and verification of coding interface
Test the process works
User Acceptance Testing (UAT)
Technical Acceptance Testing (TAT)
Performance Stress Testing (PST)
1. Scoping
2. Profiling
3. Design
4. Testing
5. Implementation
Implement the process at an operational level
Develop metadata and documentation
Report data formats, uniqueness, consistency, correctness and null values
Compare field names across data stores/tables
Can be difficult to arrange if it involves personal or sensitive information
| Dataset Name | Format | Dataset Type |
|---|---|---|
| My_dataset | RDBMS | Reference |
Author: Multiverse; Last editied: 01/03/2021
| Field Name | Data Type | Count | Null Values | % Nulls | Maximum Value | Minimum Value |
|---|---|---|---|---|---|---|
| customer_surname | string | 1501 | 0 | 0% | zabini | abbots |
Manual Data Integration
Middleware Data Integration
Uniform Access Integration
Common Storage Integration
Whole process (e.g. data collection and cleaning, connecting sources) done manually by a human
Best for one-time instances
Reduced Costs
Greater Freedom
Difficulty Scaling
Greater Room for Error
Less Access
Using softwares that connect applications and transfers between them and databases (no coding)
Acts an interpreter between systems and enacts an automatic transfer
Examples include Microsoft Dynamic CRM, SAP and Sage
Fast and Efficient
Scalable
Time Saving
Less Access
Limited Functionality
Also known as "Virtual Integration"
Data is allowed to stay in its original location when being accessed
Provides a unified view quickly to both customers and across platforms
Simplified View of Data
Easy Access
Lower Storage Requirements
Data Management can be Difficult
Data Integrity could be Compromised
Similar to uniform access except it creates and stores a copy of the data
One of the most popular integration methods
Reduced Burden
Cleaner Data Appearance
Enhanced Data Analytics
Increased Storage Costs
Higher Maintenance Costs
You must specifiy security policies (e.g. who has access?)
Data integrated should be immutable (unchanging)
Validation checks should be carried out during the process
Verification is also carried out on the Data Warehouse
Data owners are given the right to decide who can have access to enterprise data.
The process involved may be something like this:
A person (staff member, contractor, supplier, etc) requests access to information
A business resource (Data owner, manager) will review the request
A techinical resources (usually a DBA) physically grants permission to an application, database or other data store containing the data.
Often the permission follows a CRUD schema (create, read, update, delete)
A process of Data Integration from Multiple Sources
It allows business the ability to gather data from multiple sources and consolidate into a single, centralised location
This can be hard coded or using a licensed product
Data is accessed from the source
For this stage to be effective, a basic understanding of the data is required
There are two methods:
Commonly done with SQL queries if data is in databases
Can be hard-coded or tool-based
For hard-coding in python:
import pandas as pd
data1=pd.read_csv('dataset1.csv')
data2=pd.read_json('dataset1.json')
data1.info()
data2.info()
You can connect to a SQL server to extract data
This can be done in python using:
import psycopg2
conn = psycopg2.connect(dbname='DB_NAME', user='USERNAME', password='PASSWORD')
cur = conn.cursor()
cur.execute("SELECT table_name
FROM information_schema.tables
WHERE table_schema='public'
ORDER BY table_name")
query = "SELECT * FROM SALES LIMIT 100"
sales = pd.read_sql_query(query,connection)
Open Jupyter Notebook ETL_python
Complete Section 1: Extraction
Transform the data to be compatible with the target data structure
Sometimes simple, sometimes near on impossible
Requires detailed requirements elicitation
Transformations could include:
Null Values
data.fillna("Missing",inplace=True)
data.dropna(inplace=True,subset=["col_A"])
Convert Datatypes
data["col_A"]=data.col_A.astype("int")
data["col_B"]=data.col_A.astype("float")
data["col_C"]=data.col_A.astype("bool")
Deduplication
# check for duplicates
data.duplicated
# remove duplicates
data.drop_duplicates(inplace=True)
Rename Fields
data.rename(columns={"col_A":"Col_A"})
Open Jupyter Notebook ETL_python
Complete Section 2: Transformation
Load the data into the target data structure
Either write code to insert data or make use of application code that already exists
Examples include loading into a database or Data Warehouse
Could involve joining all extracted data into a single table
Open Jupyter Notebook ETL_python
Complete Section 3: Loading
Some tips to help you run your ETL processes more securely:
If your ETL process is unlikely to be a one off then it may be more efficient to automate the process.
You will need to assess when new data becomes available to determine how often your scripts need to run.
Microsoft has a 'Task Scheduler' which can create batch files
To help with performance, scripts should be run out of hours to ensure performance is not slowed down
Advantages
Disadvantages
Important data about items in an organisation is called Master Data
This includes customer and product information as well as organisational structure
In business, master data management (MDM) comprises the processes, governance, policies, standards and tools that consistently define and manage the critical data of an organisation to provide a single point of reference.
Redundancy Elimination
Master Data Edits
Data Consistency
Access Based on Role
To ensure that unified data sets are:
Technical Acceptance Testing (TAT)
User Acceptance Testing (UAT)
Performance Stress Testing (PST)
Testing scripts to ensure they produce the correct output
Can be done manually or by automation
There are three strategies:
Testing individual functions or lines of code
Uses python library unittest
Naming convention test_xxx.py
Run on your command line: python -m unittest
import unittest
def fun(x):
return x+1
class MyTest(unittest.TestCase):
def test(self):
self.assertEqual(fun(3),4)
test_sum.py test_average.py python -m unittest Integration tests verify that different modules or services used by your application work well together.
For example, it can be testing the interaction with a database, i.e. are you able to write queries?
These focus on the business requirements of an application. They only verify the output of an action and do not check the intermediate states of the system when performing that action.
Formal tests to verify if a report or system statisfies its business requirements
Can be done manually or by automation
Answers questions like:
Focusses on validating performance characteristics of the product such as scalability and reliability
They check the behaviours of the system when it is under a significant load
Tests are based on non functional requirements: