In today’s competitive market, most successful companies respond quickly to market changes and opportunities. The requirement to respond quickly is by effective and efficient use of data and information. “Data Warehouse” is a central repository of data that is organized by category to support the organization’s decision makers. Once data is stored in a data warehouse, it can be accessed for analysis.
The term "Data Warehouse" was first invented by Bill Inmon in 1990. According to him, “Data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.”
Ralph Kimball provided a definition of data warehouse based on its functionality. He said, “Data warehouse is a copy of transaction data specifically structured for query and analysis.”
Data Warehouse (DW or DWH) is a system used for analysis of data and reporting purposes. They are repositories that saves data from one or more heterogeneous data sources. They store both current and historical data and are used for creating analytical reports. DW can be used to create interactive dashboards for the senior management.
For example, analytic reports can contain data for quarterly comparisons or for annual comparison of sales report for a company.
Data in DW comes from multiple operational systems like sales, human resource, marketing, warehouse management, etc. It contains historical data from different transaction systems but it can also include data from other sources. DW is used to separate data processing and analysis workload from transaction workload and enables to consolidate the data from several data sources.
For example − You have a home loan agency, where data comes from multiple SAP/non-SAP applications such as marketing, sales, ERP, HRM, etc. This data is extracted, transformed and loaded into DW. If you have to do quarterly/annual sales comparison of a product, you cannot use an operational database as this will hang the transaction system. This is where the need for using DW arises.
Some of the key characteristics of DW are −
Following are few differences between Data Warehouse and Operational Database (Transaction System) −
Transactional system is designed for known workloads and transactions like updating a user record, searching a record, etc. However, DW transactions are more complex and present a general form of data.
Transactional system contains the current data of an organization whereas DW normally contains historical data.
Transactional system supports parallel processing of multiple transactions. Concurrency control and recovery mechanisms are required to maintain consistency of the database.
Operational database query allows to read and modify operations (delete and update), while an OLAP query needs only read-only access of stored data (select statement).
DW involves data cleaning, data integration, and data consolidations.
DW has a three-layer architecture − Data Source Layer, Integration Layer, and Presentation Layer. The following diagram shows the common architecture of a Data Warehouse system.
Following are the types of DW system −
Data Mart is the simplest form of DW and it normally focuses on a single functional area, such as sales, finance or marketing. Hence, data mart usually gets data only from few data sources.
Sources could be an internal transaction system, a central data warehouse, or an external data source application. De-normalization is the norm for data modeling techniques in this system.
An OLAP system contains less number of transactions but involves complex calculations like use of Aggregations − Sum, Count, Average, etc.
We save tables with aggregated data like yearly (1 row), quarterly (4 rows), monthly (12 rows) and now we want to compare data, like Yearly only 1 row will be processed. However, in an un-aggregated data, all the rows will be processed.
OLAP system normally stores data in multidimensional schemas like Star Schema, Galaxy schemas (with Fact and Dimensional tables are joined in logical manner).
In an OLAP system, response time to execute a query is an effectiveness measure. OLAP applications are widely used by Data Mining techniques to get data from OLAP systems. OLAP databases store aggregated historical data in multi-dimensional schemas. OLAP systems have data latency of a few hours as compared to Data Marts where latency is normally closer to few days.
An OLTP system is known for large number of short online transactions like insert, update, delete, etc. OLTP systems provide fast query processing and also responsible to provide data integrity in multi-access environment.
For an OLTP systems, effectiveness is measured by the number of transactions processed per second. OLTP systems normally contain only current data. The schema used to store transactional databases is the entity model. Normalization is used for data modeling techniques in OLTP system.
The following illustration shows the key differences between an OLTP and OLAP system.
Indexes − In an OLTP system, there are only few indexes while in an OLAP system there are many indexes for performance optimization.
Joins − In an OLTP system, large number of joins and data is normalized; however, in an OLAP system there are less joins and de-normalized.
Aggregation − In an OLTP system, data is not aggregated while in an OLAP database more aggregations are used.