A
Data Warehouse is a electronically stored data, is designed for Reporting and
Analysis.
DWH
Definition:
A
data warehouse is a relational database that is designed for query and analysis
rather than for transaction processing. It usually contains historical data
derived from transaction data, but it can include data from other sources.
Data warehouse is Subject Oriented, Integrated,
Time-Variant and Non-volatile collection of data that support management's
decision making process.
In
addition to a relational database, a data warehouse environment includes an
extraction, transportation, transformation, and loading (ETL) solution, an
Online analytical processing (OLAP) engine, client analysis tools, and
other applications that manage the process of gathering data and delivering it
to business users.
Data
warehousing arises in an organization’s need for reliable, consolidated, unique
and integrated reporting and analysis of its data, at different levels of
aggregation.
The most practical reality of most organization is that their
infrastructure is made-up by a collection of heterogeneous systems. Organization
system handles on customer relationship, a system that handles employees,
system that handles the sales data or production data, yet another finance data
and budgeting data...etc. Dwh efficiently answers to an organization like how much time did a sales rep has interacted with the customer and what quantity of orders he has bagged from that customer within given period of time ? Daywise employees performance within the departments? To all these questions has effectively answered by dwh system.
The term "Data Warehouse" was first
coined by Bill Inmon in 1990. He said that Data warehouse is subject Oriented,
Integrated, Time-Variant and nonvolatile collection of data.This data helps in
supporting decision making process by analyst in an organization
A data warehouse
maintains its functions in three layers: staging, integration, and access.
Staging is used to store raw data for use by developers (analysis and support).
The integration layer is used to integrate data and to have a level of
abstraction from users. The access layer is for getting data out for users.
1. Ralph Kimball's paradigm:
Data warehouse is the conglomerate of all data marts within the enterprise.
Information is always stored in the dimensional model.
Definition as Per Ralph Kimball:
A data warehouse is a copy of transaction data specifically structured for
query and analysis.
His Approach
towards towards the Data warehouse Design is Bottom-Up.In the bottom-up
approach data marts are first created to provide reporting and analytical
capabilities for specific business processes
2.Bill Inmon's paradigm:
Data warehouse is one part of the overall business intelligence system. An
enterprise has one data warehouse, and data marts source their information from
the data warehouse. In the data warehouse, information is stored in 3rd normal
form.
Definition as Per Bill Inmon:
A data warehouse
is a subject-oriented, integrated, time-variant and non-volatile collection of
data in support of management's decision making process.
Subject-Oriented:
A data warehouse can be used to analyze a particular subject area. For example,
"sales" can be a particular subject.
Integrated: A data
warehouse integrates data from multiple data sources. For example, source A and
source B may have different ways of identifying a product, but in a data
warehouse, there will be only a single way of identifying a product.
Time-Variant:
Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse.
This contrasts with a transactions system, where often only the most recent
data is kept. For example, a transaction system may hold the most recent
address of a customer, where a data warehouse can hold all addresses associated
with a customer.
Non-volatile:
Once data is in the data warehouse, it will not change. So, historical data in
a data warehouse should never be altered.
His
Approach towards towards the Data warehouse Design is Top-Down. In top-down
approach to data warehouse design, in which the data warehouse is designed
using a normalized enterprise data model. "Atomic" data, that is,
data at the lowest level of detail, are stored in the data warehouse.
OLTP
|
OLAP
|
|
Source of data
|
Operational data; OLTPs
are the original source of the data
|
Consolidation data; OLAP
data comes from the various OLTP Databases
|
Purpose of data
|
To control and run
fundamental business tasks
|
To help with planning,
problem solving, and decision support
|
What the data
|
Reveals a snapshot of
ongoing business processes
|
Multi-dimensional views
of various kinds of business activities
|
Inserts and Updates
|
Short and fast inserts
and updates initiated by end users
|
Periodic long-running
batch jobs refresh the data
|
Processing Speed
|
Typically very fast
|
Depends on the amount of
data involved; batch data refreshes and complex queries may take many hours;
query speed can be improved by creating indexes
|
Queries
|
Relatively standardized
and simple queries Returning relatively few records
|
Often complex queries
involving aggregations
|
Space Requirements
|
Can be relatively small
if historical data is archived
|
Larger due to the
existence of aggregation structures and history data; requires more indexes
than OLTP
|
Database Design
|
Highly normalized with
many tables
|
Typically de-normalized
with fewer tables; use of star and/or snowflake schemas
|
Backup and Recovery
|
Backup religiously;
operational data is critical to run the business, data loss is likely to
entail significant monetary loss and legal liability
|
Instead of regular
backups, some environments may consider simply reloading the OLTP data as a
recovery method.
|
- Data Warehouse Architecture
In general, all data warehouse systems
have the following layers:
Data Source Layer
Data Extraction Layer
Staging Area
ETL Layer
Data Storage Layer
Data Logic Layer
Data Presentation Layer
Metadata Layer
System Operations Layer
Data Source Layer
This represents the different data
sources that feed data into the data warehouse. The data source can be of any
format -- plain text file, relational database, other types of database, Excel
file, etc., can all act as a data source.
Many different types of data can be a
data source:
Operations -- such as sales data, HR data, product data, inventory data,
marketing
Data, systems data.
Web server logs with user browsing data.
Internal market research data.
Third-party data, such as census data, demographics data, or survey data.
All these data sources together form
the Data Source Layer.
Data Extraction Layer
Data gets pulled from the data source
into the data warehouse system. There is likely some minimal data cleansing,
but there is unlikely any major data transformation.
Staging Area
This is where data sits prior to being
scrubbed and transformed into a data warehouse / data mart. Having one common
area makes it easier for subsequent data processing / integration.
ETL Layer
This is where data gains its
"intelligence", as logic is applied to transform the data from a
transactional nature to an analytical nature. This layer is also where data
cleansing happens. The ETL design phase is often the most
time-consuming phase in a data warehousing project, and an ETL
tool
is often used in this layer.
Data Storage Layer
This is where the transformed and
cleansed data sit. Based on scope and functionality, 3 types of entities can be
found here: data warehouse, data mart, and operational data store (ODS). In any
given system, you may have just one of the three, two of the three, or all
three types.
Data Logic Layer
This is where business rules are
stored. Business rules stored here do not affect the underlying data
transformation rules, but do affect what the report looks like.
Data Presentation Layer
This refers to the information that
reaches the users. This can be in a form of a tabular / graphical report in a
browser, an emailed report that gets automatically generated and sent every day,
or an alert that warns users of exceptions, among others. Usually an OLAP
tool
and/or a reporting tool is used in this
layer.
Metadata Layer
This is where information about the
data stored in the data warehouse system is stored. A logical data model would
be an example of something that's in the metadata layer. A metadata
tool
is often to use to manage metadata.
System Operations Layer
Data Warehouse Applications
As discussed before Data Warehouse helps the business executives
in organize, analyze and use their data for decision making. Data Warehouse
serves as a soul part of a plan-execute-assess "closed-loop" feedback
system for enterprise management. Data Warehouse is widely used in the
following fields:
·
financial services
·
Banking Services
·
Consumer goods
·
Retail sectors.
·
Controlled manufacturing
Data Warehouse Types
Information
processing, Analytical processing and Data Mining are the three types of data
warehouse applications that are discussed below:
·
Information processing - Data Warehouse allow us to
process the information stored in it. The information can be processed by means
of querying, basic statistical analysis, reporting using crosstabs, tables,
charts, or graphs.
·
Analytical Processing - Data Warehouse supports
analytical processing of the information stored in it. The data can be analyzed
by means of basic OLAP operations, including slice-and-dice, drill down, drill
up, and pivoting.
·
Data Mining - Data Mining supports
knowledge discovery by finding the hidden patterns and associations,
constructing analytical models, performing classification and prediction. These
mining results can be presented using the visualization tools.

No comments:
Post a Comment