In this post I briefly compare the data structures and operators of relational databases compared to multidimensional databases, and then discuss why data stored in a multidimensional database can be easier to understand and manipulate than data stored in a relational database. I then go on to describe different ways in which data duplication can occur in a multidimensional star schema and discuss why it can be beneficial to store data in this way whilst avoiding the problems often associated with this sort of duplication in relational databases.
At the core of the dimensional model are Facts and Dimensions. A fact is a numerical measure of a subject of interest, for example: company A bought quantity n of product X on date D. A dimension is a perspective by which a organisation can select, group and view these facts, for example: the organisation can view the facts by time or product or customer – or all three – depending on the dimensions defined in their multidimensional model. The multidimensional model is designed primarily for querying historical data (i.e. facts) in to provide answers to questions of a temporal nature, such as “What are the sales of product X per month this year in comparison with the same months over the last 3 years?”.
The aggregated data forming the summarised facts are presented as an n-D (eg 3D) cube of data, with a dimension (at a given level within the concept hierarchy) along each axis, and a numerical fact (or simply nothing) at each intersection. Data is normally stored as a central large fact table, with a number of smaller satellite dimension tables, associated by foreign keys.
By comparison, the relational model stores it’s data in tables (or more precisely Relations). In most cases the data is normalised so that there is no duplication of data between tables. Compared to the multidimensional model, the relational model usually consists of a larger number of tables, with more complex relationships between them. A primary precept of the relational model is that data is always presented as a two-dimensional table – even where there is only a single value.
Even though the raw data of the multidimensional model is usually stored within a relational database, the multidimensional model lends itself to analysis whereas the relational model is more geared towards operational needs, namely regular insertions, updates or deletions of small amounts of data affecting a small number of tables.
Both the multidimensional model and the relational model support aggregation of data in order to satisfy queries. However, the multidimensional model will often calculate and store all possible aggregations whereas the relational model creates only the aggregates requested, does not store them (unless snapshot tables are being used), and recreates the aggregates each time.
In order to manipulate and aggregate data both the multidimensional and relational models support the use of a range of operators. Both models support the use of distributive aggregate operators including: COUNT, SUM, MIN and MAX. However, the multidimensional model supports a range of operations to manipulate data in a cube including:
- Roll-up and Drill-down to ascend and descend the dimension hierarchy.
- The Slice operation to perform a selection on one dimension of a cube.
- The Dice operation to perform a selection on two or more dimensions of a cube.
By comparison, the relational model uses SELECT … WHERE … GROUP BY operators to perform similar functions. In fact, where multidimensional data is stored in a relational database, the RDBMS uses these operators in the background in the generation of the multidimensional data.
Data in a data warehouse built on a multidimensional database will be easier to understand and manipulate than data stored in a data warehouse built on a relational DBMS because the multidimensional DBMS is designed and built specifically for this purpose. The multidimensional data warehouse pre-computes some (or all) of the possible cubes, allowing for fast and simple queries, whereas the relational data warehouse will store most (if not all) the data in it’s more detailed, non-aggregated format, therefore, queries are slower and more complex as data needs to be selected and aggregated each time.
There is often duplication in Time dimension tables. A week can be calculated from other attributes, but since weeks do not align exactly with months a grouping by months cannot be obtained by a grouping by weeks. The duplication is not an issue as it provides analysts with the flexibility to use either option. The same duplication applies to financial months versus calendar months, for example the financial month 1 can equate to calendar month 5 etc.
A second way duplication can occur in a Star schema is where a value for an attribute (other than a key or calculated value) is stored in both the fact table and the dimension table. For example, a business customer is based in London, and this is recorded as an attribute for the customer’s record in the dimension table. When that customer makes a purchase, the city is recorded in the fact table (along with the item, price, quantity, etc). Hence the city data for that customer is duplicated in both the dimension table, and every record in the fact table related to that customer’s purchases. If the customer later relocates to Cardiff, the customer’s record in the dimension table is updated to reflect the new city. By storing the city value in the fact table, we can get a true value for sales in London and Cardiff both before and after the customer re-located, whereas if we simply select and group by the city attribute in the dimension table, the sales figures would be wrong as all the sales for Cardiff would also include all that customers purchases from when they were based in London, and vice versa.