About David Jones

Profile Picture David is a Software Developer from the UK with over 12 years experience. He holds a bachelors degree in Economics, and a masters degree in Software Development, from which many of the pages on this site have evolved. His interests include all things C#/.NET along with software frameworks, architectures and design patterns.

When not developing software David plays bass in a psychedelic rock band who are currently recording their first album.

Arguments for and against the use of NULL in Relational Database Systems

by David Jones 7. June 2010 05:21

In both the Relational Model, and its practical realisation by SQL, null is used to represent “missing”, “inapplicable” or “unknown” information. For example, a record might exist for a given Part Number, but there is no attribute/value given for its weight – it is not zero grams, it is unknown. In terms of Three-Value Logic, the result of a predictate is one of: True, False or Unknown. In the Relational Model and SQL, null is not a value, but is a flag or marker, independent of any data type, that there is no value. However, in practice it is often incorrectly treated as a value.

Null is represented in the Relational Model and SQL by the keyword NULL, and as it is not a value it cannot be specified as the result of a search condition, so thus requires a mechanism to refer to it specifically: SQL uses IS NULL, a Relational Model tuple containing null values would be: <David, Jones, null, 35, null>.

Codd (1985) published a set of 13 rules to determine whether a DBMS can be considered Relational. Codd’s Rule 3 specified The Systematic Treatment of Null Values, distinct from any empty character string, zero, or any other number.

C.J. Date on the other hand, argues that nulls are a mistake because they are not values, and have no place in the Relational Model, instead he proposes using Special Values in their place, claiming that this is what we do in the real world, and that there is no such thing as null in the real world (Date, 2004). In this context, a special value would be a value (of the correct data type) inserted when no other value is available/appropriate. For example, a record might exist to indicate that a student has submitted an assignment, which records such information as the student’s name and date of submission. However, if the assignment has not yet been marked, the student’s score would be recorded as a value outside of the range of normal values to indicate this, such as -1 for a paper with possible scores of 1 – 100 marks. When using special values, it is imperative that the special value cannot represent any valid value for the attribute being represented.

A distinct disadvantage with the use of special values becomes apparent with SQL functions such as MAX(), MIN() and AVG(). SQL will skip any rows containing null in a field used as part of a search expression, therefore not including the row/value in any mathematical calculations. However, returning to the above example of student assignments, MIN(score) would return the special value of -1 as it’s a legitimate value for the data type. Similarly, AVG(score) would also include the -1, even though it clearly should not as the -1 is being used to indicate that the assignment hasn’t yet been marked. Neither the -1 nor the row itself should be included in the calculation of the average.

This means that the special value must be known and catered for in the formulation of the query and/or interpretation of the results, eg:

SELECT AVG(score)
FROM student_assignments
WHERE score > -1

In contrast to the explicit inclusion of the search condition shown above, SQL would automatically ignore any rows where the score was null. Special values require the end-user to have an intimate knowledge of the database and the special values in use, which can vary from database to database and even table to table.

Special values can lead to inconsistencies and anomalies in the data, especially where they are input by the end user. For example, a record may have fields for FirstName, MiddleName, Surname. Not everyone has a middle name so a special value is required to take the place of the null, this could be: N/A, na, none, etc. This special value must then be specifically taken into account to prevent issues such as letters addressed to “Dear David None Jones”, or worse still “Dear Mr Deceased”, which has happened on numerous occasions where end users have inserted a special value which is meaningful to themselves but not the DBMS! These issues can however be mitigated to a certain extent by the use of SQL Default values, e.g.:

CREATE TABLE customer (
…
MiddleName VARCHAR(30) DEFAULT “None”,
… );

The use of special values in place of nulls also works against standardisation, integration and interoperability. Consider ODBC which provides a standardised interface to any DBMS with a suitable ODBC Driver. The end user application can make a call to the ODBC interface to find out whether a column value is null (eg SQL’s IS NULL), irrespective of the backend DBMS, and the driver will handle the ‘translation’ of the call. The DBMS could be replaced with an entirely different implementation but the call remains unaffected – it’s the job of the DBMS manufacturer’s ODBC driver to handle the communication between the ODBC interface and the DBMS. This DBMS independence would be lost, or at least severely impaired, through the use of special values instead of nulls.

CJ Date argues that nulls have no place in the Relational Model because they do not exist in the real world. However, the Relational Model is conceptual whereas relational databases do exist and play a mission-critical role in the real world and the concept/use of nulls aid DBMS in this role, along with the people who use them – one of the essential components of any Information System.

References

Webopedia, n.d.. What is Codd's Rules? [Online] Available at http://www.webopedia.com/TERM/C/Codds_Rules.html [Accessed 6 June 2010].

Date, C.J., 2004. “An Introduction to Database Systems” Eighth Edition, page 591, Addison-Wesley

Tags:

Database Systems

Comparing Relational Databases and Multidimensional Databases

by David Jones 23. May 2010 06:20

In this post I briefly compare the data structures and operators of relational databases compared to multidimensional databases, and then discuss why data stored in a multidimensional database can be easier to understand and manipulate than data stored in a relational database. I then go on to describe different ways in which data duplication can occur in a multidimensional star schema and discuss why it can be beneficial to store data in this way whilst avoiding the problems often associated with this sort of duplication in relational databases.

At the core of the dimensional model are Facts and Dimensions. A fact is a numerical measure of a subject of interest, for example: company A bought quantity n of product X on date D. A dimension is a perspective by which a organisation can select, group and view these facts, for example: the organisation can view the facts by time or product or customer – or all three – depending on the dimensions defined in their multidimensional model. The multidimensional model is designed primarily for querying historical data (i.e. facts) in to provide answers to questions of a temporal nature, such as “What are the sales of product X per month this year in comparison with the same months over the last 3 years?”.

The aggregated data forming the summarised facts are presented as an n-D (eg 3D) cube of data, with a dimension (at a given level within the concept hierarchy) along each axis, and a numerical fact (or simply nothing) at each intersection. Data is normally stored as a central large fact table, with a number of smaller satellite dimension tables, associated by foreign keys.

By comparison, the relational model stores it’s data in tables (or more precisely Relations). In most cases the data is normalised so that there is no duplication of data between tables. Compared to the multidimensional model, the relational model usually consists of a larger number of tables, with more complex relationships between them. A primary precept of the relational model is that data is always presented as a two-dimensional table – even where there is only a single value.

Even though the raw data of the multidimensional model is usually stored within a relational database, the multidimensional model lends itself to analysis whereas the relational model is more geared towards operational needs, namely regular insertions, updates or deletions of small amounts of data affecting a small number of tables.

Both the multidimensional model and the relational model support aggregation of data in order to satisfy queries. However, the multidimensional model will often calculate and store all possible aggregations whereas the relational model creates only the aggregates requested, does not store them (unless snapshot tables are being used), and recreates the aggregates each time.

In order to manipulate and aggregate data both the multidimensional and relational models support the use of a range of operators. Both models support the use of distributive aggregate operators including: COUNT, SUM, MIN and MAX. However, the multidimensional model supports a range of operations to manipulate data in a cube including:

  • Roll-up and Drill-down to ascend and descend the dimension hierarchy.
  • The Slice operation to perform a selection on one dimension of a cube.
  • The Dice operation to perform a selection on two or more dimensions of a cube.

By comparison, the relational model uses SELECT … WHERE … GROUP BY operators to perform similar functions. In fact, where multidimensional data is stored in a relational database, the RDBMS uses these operators in the background in the generation of the multidimensional data.

Data in a data warehouse built on a multidimensional database will be easier to understand and manipulate than data stored in a data warehouse built on a relational DBMS because the multidimensional DBMS is designed and built specifically for this purpose. The multidimensional data warehouse pre-computes some (or all) of the possible cubes, allowing for fast and simple queries, whereas the relational data warehouse will store most (if not all) the data in it’s more detailed, non-aggregated format, therefore, queries are slower and more complex as data needs to be selected and aggregated each time.

There is often duplication in Time dimension tables. A week can be calculated from other attributes, but since weeks do not align exactly with months a grouping by months cannot be obtained by a grouping by weeks. The duplication is not an issue as it provides analysts with the flexibility to use either option. The same duplication applies to financial months versus calendar months, for example the financial month 1 can equate to calendar month 5 etc.

A second way duplication can occur in a Star schema is where a value for an attribute (other than a key or calculated value) is stored in both the fact table and the dimension table. For example, a business customer is based in London, and this is recorded as an attribute for the customer’s record in the dimension table. When that customer makes a purchase, the city is recorded in the fact table (along with the item, price, quantity, etc). Hence the city data for that customer is duplicated in both the dimension table, and every record in the fact table related to that customer’s purchases. If the customer later relocates to Cardiff, the customer’s record in the dimension table is updated to reflect the new city. By storing the city value in the fact table, we can get a true value for sales in London and Cardiff both before and after the customer re-located, whereas if we simply select and group by the city attribute in the dimension table, the sales figures would be wrong as all the sales for Cardiff would also include all that customers purchases from when they were based in London, and vice versa.

Tags:

Database Systems