We have a fixed number of operations to be applied on the operational databases and we have well-defined techniques such as use normalized data, keep table small, etc. These techniques are suitable for delivering a solution. But in case of decision-support systems, we do not know what query and operation needs to be executed in future. Therefore techniques applied on operational databases are not suitable for data warehouses.
In this chapter, we will discuss how to build data warehousing solutions on top open-system technologies like Unix and relational databases.
There are four major processes that contribute to a data warehouse −
Data extraction takes data from the source systems. Data load takes the extracted data and loads it into the data warehouse.
Note − Before loading the data into the data warehouse, the information extracted from the external sources must be reconstructed.
Controlling the process involves determining when to start data extraction and the consistency check on data. Controlling process ensures that the tools, the logic modules, and the programs are executed in correct sequence and at correct time.
Data needs to be in a consistent state when it is extracted, i.e., the data warehouse should represent a single, consistent version of the information to the user.
For example, in a customer profiling data warehouse in telecommunication sector, it is illogical to merge the list of customers at 8 pm on Wednesday from a customer database with the customer subscription events up to 8 pm on Tuesday. This would mean that we are finding the customers for whom there are no associated subscriptions.
After extracting the data, it is loaded into a temporary data store where it is cleaned up and made consistent.
Note − Consistency checks are executed only when all the data sources have been loaded into the temporary data store.
Once the data is extracted and loaded into the temporary data store, it is time to perform Cleaning and Transforming. Here is the list of steps involved in Cleaning and Transforming −
Cleaning and transforming the loaded data helps speed up the queries. It can be done by making the data consistent −
Transforming involves converting the source data into a structure. Structuring the data increases the query performance and decreases the operational cost. The data contained in a data warehouse must be transformed to support performance requirements and control the ongoing operational costs.
It will optimize the hardware performance and simplify the management of data warehouse. Here we partition each fact table into multiple separate partitions.
Aggregation is required to speed up common queries. Aggregation relies on the fact that most common queries will analyze a subset or an aggregation of the detailed data.
In order to recover the data in the event of data loss, software failure, or hardware failure, it is necessary to keep regular back ups. Archiving involves removing the old data from the system in a format that allow it to be quickly restored whenever required.
For example, in a retail sales analysis data warehouse, it may be required to keep data for 3 years with the latest 6 months data being kept online. In such as scenario, there is often a requirement to be able to do month-on-month comparisons for this year and last year. In this case, we require some data to be restored from the archive.
This process performs the following functions −
manages the queries.
helps speed up the execution time of queris.
directs the queries to their most effective data sources.
ensures that all the system sources are used in the most effective way.
monitors actual query profiles.
The information generated in this process is used by the warehouse management process to determine which aggregations to generate. This process does not generally operate during the regular load of information into data warehouse.