A data warehouse is an organized collection of large amounts of structured data. It is a database designed and intended to support decision making in organizations. It is usually batch updated and structured for rapid online queries and managerial summaries of its contents. According to Bill Inmon (1993), who is often called the “father” of data warehousing, “a data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data”. Ralph Kimball (1996), another data warehousing pioneer, notes, “A data warehouse is a copy of transaction data specifically structured for query and analysis.” So, the data warehouse or the single subject data mart stores data for data-driven decisions.
How is the data stored in a data warehouse or data mart?
Data is stored in databases (we get to the details of databases in another chapter). So far, we say that data is stored in a relational database (RDBMS) in tables. For analyzing data, it is mandatory to store and provide the data in a way that users can analyze it. Therefore, some really smart guys developed data models concepts, which allow users to really investigate data in a fast and easy-to-use way. These concepts of data models are so called modeling schemas and simplify the modeling efforts for data administrators and modelers.
Data warehouse with normalized data, data marts with denormalized data
Data in data warehouse is normalized data, which means that data is organized in columns and tables to minimize data redundancy. Normalization involves decomposing a table into less redundant (and smaller) tables without losing information, and then linking the data back together by defining foreign keys in the old table referencing the primary keys of the new ones. The objective is to isolate data so that additions, deletions, and modifications of an attribute can be made in just one table and then propagated through the rest of the database using the defined foreign keys.
Edgar F. Codd, the inventor of the relational model (RM), introduced the concept of normalization and what we now know as the First normal form (1NF) in 1970. Codd went on to define the Second normal form (2NF) and Third normal form (3NF) in 1971, and Codd and Raymond F. Boyce defined the Boyce-Codd Normal Form (BCNF) in 1974. Informally, a relational database table is often described as “normalized” if it meets Third Normal Form. Most 3NF tables are free of insertion, update, and deletion anomalies.
Data in a datamarts are denormalized, which means that data is stored NOT to minimize data redundancy. The goal of data marts is to support the analyzing process. For data marts, there are 3 common modeling schemas which are called “Star Schema”, “Snowflake Schema” and “Galaxy Schema”.
Stars, Snowflakes and Galaxies
In data marts built using an RDBMS, the most common data model is called the star schema. A related model is called a snowflake schema. And an advanced schema is a galaxy schema. A star schema is organized around a central fact table that is joined to some dimension tables using foreign key references. The fact table contains data like price, discount values, number of units sold, and dollar value of sales. The fact table usually has some summarized and aggregated data, and it is usually very large in terms of both fields and records. The basic premise of a star schema is that information can be classified into two groups: facts and dimensions. Facts are the core data elements one is analyzing. For example, units of individual items sold are facts, while dimensions are attributes about the facts. The name star schema comes from the pattern formed by the entities and relationships when they are represented as an entity-relationship diagram. Metaphorically, the results of a specific business activity are at the center of the star schema database and are surrounded by dimensional tables with data on the people, places, and things that come together to perform the business activity. These dimensional tables are the points of the star.
How does a snowflake schema differ from a star schema? A snowflake schema is an expansion and extension of a star schema to additional secondary dimensional tables. In a star schema, each dimension is typically stored in one table; the snowflake design principle expands a dimension and creates tables for each level of a dimensional hierarchy. For example, a region dimension may contain the levels of Street, City, State and Country. In a star schema, all these attributes would be stored in one table; in a snowflake schema, one would expand the schema and a designer might add city and state secondary tables.
And what is a galaxy schema? It is like a star or snowflake schema a data model which is used in the OLAP-world as well. A galaxy schema contains more than one fact table which are connected to the surrounding dimension tables. These multiple used dimension tables can be called as conformed dimensions. The context of conformed dimensions will be explained later.
To summarize: a data warehouse is a collection of data which is stored in a normalized way where data redundancy is forbidden. A data mart is a subset of the data of the data warehouse and stores the data denormalized. There are three data modeling schemas for implementing data models and each of them has different advantages and disadvantages.
About Tobias Riedner
Tobias Riedner foundet WYCDWD.com in 2015. He works and worked as innovator, consultant, analyst and educator in the fields of business intelligence and data warehousing. He learned a lot from the best consultants, managers und educators in the past and shares his knowledge worldwide. He works for a steady growing traditional company which is a leader in industry 4.0.