A data warehouse is a system commonly used to connect and analyze business data from disparate sources and to help an organization make decisions.
Data warehouses are central repositories of integrated data from one or more heterogeneous sources.
A data warehouse is considered a core component of business intelligence.
|Enterprise data warehouse architecture. Data warehouse architectural concepts and components. Image Credit: Wikimedia Commons, I have made some modifications in size and color.|
Data Warehouse (DW) - Characteristics, Architecture and Principles
What is a Data Warehouse (DW)?
In computing, a data warehouse (DW) is often considered a business intelligence system used for data analysis and for the day-to-day reporting needs of a business unit.
A data warehouse is a central repository of integrated data from one or more unequal sources.
The data warehouse contains all the data related to the business. Information systems can be categorized in many aspects that can be long or limited. These include classification by purpose of such systems in enterprises.
The focus here will be on both operational information systems and decision support systems.
Operational information systems are concerned with the management of the day-to-day operations of organizations and are the cornerstone of modern enterprises.
Examples of such systems are Accounting systems, hospital patient management systems, application management systems on a commercial website and student registration systems at colleges and institutes.
All these systems are concerned with the day-to-day operations of enterprises and indispensable in modern institutions.
Obviously, operational information systems store and process a large amount of data every day, and that data increases overtime to reach in some institutions to millions of gigabytes and terabytes.
Sometimes such data exist in insurance companies, universities, hospitals, and public government sectors.
Such data is a real asset and is invaluable if it is exploited by enterprises to study and analyze and then presented to the decision-makers in the establishments in a way that enables them to identify the performance of these facilities and formulate long-term strategies for them.
Hence, there was a need only for Decision Support Systems (DSS). One of the most important applications of these systems, which is the main pillar of these systems, is Data Warehousing.
Characteristics of Data warehouse
Data warehouses are systems that are concerned with studying, analyzing and presenting enterprise data in a way that enables senior management to make decisions.
The data warehouses have some characteristics that distinguish them from any other data and these characteristics are as follows:
Subject-Oriented: For example, we define subjects in a medical facility as doctors, nurses, medicines, diseases, and so on. They are sensual and moral things.
Integrated: The relationship between the data and the way in which the data is extracted and modified is the same regardless of the original source.
None-Volatile: None-Volatile: The data warehouse is not editable when loaded in repositories and not updated in real-time.
This data is updated periodically by data uploading, protecting it from the effect of instant change and only used for analysis, study, and presentation.
Time-Variant: This is the most important point that this data is related to the time factor and organized via time periods (weekly, monthly, yearly, etc.), for example, Student Registration Data at the university for over forty years.
5 Steps to Data Warehousing
Building a data warehouse in an organization is not an easy task, and requires a professional team in business information systems analysis; it is also important to have professional technical skills in that team.
Now, we will present the technical steps to build data warehouses, far from the analysis and design phase that is very close to the business view. These steps are arranged as follows:
1. Creating Data Preparation Area: a database with very high storage capacity; It stores all data coming from different Operational Systems so that data is purified and modified before being loaded into the data warehouse.
At this stage, the design of the database should be compatible with the design of the data warehouse.
2. Creating Data Warehouse: A data warehouse is created where data will be uploaded after it is extracted and purified.
Data warehouses are always designed to allow relationships of different dimensions, such as the relationship between the admissions of students to a specialization in a given year.
3. Data Marts: Some large organizations divide the data warehouse into a set of data marts.
There is a Data Mart for financial management data and another for human resources management or division based on the branches of the institution.
4. Extract, Transform, and Load (ETL) Process: At this stage, we extract data from different sources to the Data Preparation Area and we convert it from one image to another if required.
We also often integrate some data with each other or define new data that did not exist before, in addition to clearing the wrong data and deleting unimportant ones.
At the stage of Loading Data, The data is loaded from the Data Preparation Area to Data Warehouse. During this phase, the data is tested.
5. Creating Decision Support Systems (DSS) Applications: At this stage, implement DSS applications for displaying and analyzing the data warehouse, called Online Analytical Processing (OLAP) Systems.
These applications display data in several dimensions and use complex algorithms to analyze data.
At this stage, data mining tools are also used to recognize patterns in the data sets and analyze big data.
Three-Tier Data Warehouse Architecture
Most organizations adopt Three-Tier Architecture in the work of data warehouses.
In this Architecture, the data warehouse system is divided into three tiers (levels); Bottom Tier, Middle Tier, and Top-Tier.
1. Bottom Tier: It is the data warehouse database server and an RDBMS (Relational Database Management System). This may include many specific data marts and metadata repositories.
You can use back end tools and application program interfaces called a gateway to feed data at the bottom level.
The gateways and back end tools perform the extraction, cleaning, loading, and updating functions.
Examples of gateways contain Open-Linking and Embedding for Databases (OLE-DB), Open Database Connection (ODBC) and Java Database Connection (JDBC).
2. Middle Tier: A middle-tier consists of an online analytical processing (OLAP) server which can be executed for a quick query of the data warehouse. The OLAP server is implemented in the following ways:
I. Using the ROLAP (Relational online analytical processing) model, i.e., an extended relational database management system (DBMS) that assigns tasks to multidimensional data for standard relational operations.
II. Using the MOLAP (multidimensional online analytics processing) model, ie a special-purpose server that directly implements multidimensional databases and operations.
3. Top-Tier: A top-tier consists of front-end tools that contain reporting tools analysis tools and query tools to display the results provided by online analytical processing systems as well as additional tools for data mining.
Principles of Data Warehousing
There are some data warehouse principles that may include:
In a data warehouse, several phases and steps must be taken to load new data and process it including filtering, reformatting, data conversion, indexing, and metadata updates.
Data warehouses need to increase the loading of new data from time to time within narrow time windows.
Load performance should be measured in gigabytes per hour and in hundreds of millions of rows and should not artificially disrupt the data business volume.
Data Quality Management
The data warehouse verifies contextual integrity, local stability, and global consistency despite "dirty" sources and large-scale database size.
This is only useful and valuable to the extent if business stakeholders trust the data and its resources as a fact-based management system requires the highest quality of the data.
Adaptability is critical to the development of business requirements.
The business intelligence tools available in the market must be taken into account to adapt to often unexpected changes in business demands.
It is also very important to keep in mind the pace at which BI tools evolve to include additional functions and features.
In data warehouses, adaptability requires a principle and method to use alternative BI tools in the future such as various back-end or visualization tools.
Reliable database management and the fact-based direction should not be decelerated by the performance of RDBMS (relational database management system). Massive and complex queries should be completed in seconds, not hours or days.
A Terabyte contains more than a trillion bytes of digital information - 1015 bytes.
Today, the size of the data warehouses is evolving at staggering rates. This ranges from a few bytes to hundreds of terabytes and gigabytes sized data warehouses.
So scalability should be taken into account in measurement.
Finally, there are many companies that provide systems and tools for creating data warehouses and extracting data from their sources, most notably Oracle and its Architectures and Solutions such as Oracle Data Warehouse Builder and Oracle Data Mining, etc.