About David Jones

Profile Picture David is a Software Developer from the UK with over 12 years experience. He holds a bachelors degree in Economics, and a masters degree in Software Development, from which many of the pages on this site have evolved. His interests include all things C#/.NET along with software frameworks, architectures and design patterns.

When not developing software David plays bass in a psychedelic rock band who are currently recording their first album.

Comparing Relational Databases and Multidimensional Databases

by David Jones 23. May 2010 06:20

In this post I briefly compare the data structures and operators of relational databases compared to multidimensional databases, and then discuss why data stored in a multidimensional database can be easier to understand and manipulate than data stored in a relational database. I then go on to describe different ways in which data duplication can occur in a multidimensional star schema and discuss why it can be beneficial to store data in this way whilst avoiding the problems often associated with this sort of duplication in relational databases.

At the core of the dimensional model are Facts and Dimensions. A fact is a numerical measure of a subject of interest, for example: company A bought quantity n of product X on date D. A dimension is a perspective by which a organisation can select, group and view these facts, for example: the organisation can view the facts by time or product or customer – or all three – depending on the dimensions defined in their multidimensional model. The multidimensional model is designed primarily for querying historical data (i.e. facts) in to provide answers to questions of a temporal nature, such as “What are the sales of product X per month this year in comparison with the same months over the last 3 years?”.

The aggregated data forming the summarised facts are presented as an n-D (eg 3D) cube of data, with a dimension (at a given level within the concept hierarchy) along each axis, and a numerical fact (or simply nothing) at each intersection. Data is normally stored as a central large fact table, with a number of smaller satellite dimension tables, associated by foreign keys.

By comparison, the relational model stores it’s data in tables (or more precisely Relations). In most cases the data is normalised so that there is no duplication of data between tables. Compared to the multidimensional model, the relational model usually consists of a larger number of tables, with more complex relationships between them. A primary precept of the relational model is that data is always presented as a two-dimensional table – even where there is only a single value.

Even though the raw data of the multidimensional model is usually stored within a relational database, the multidimensional model lends itself to analysis whereas the relational model is more geared towards operational needs, namely regular insertions, updates or deletions of small amounts of data affecting a small number of tables.

Both the multidimensional model and the relational model support aggregation of data in order to satisfy queries. However, the multidimensional model will often calculate and store all possible aggregations whereas the relational model creates only the aggregates requested, does not store them (unless snapshot tables are being used), and recreates the aggregates each time.

In order to manipulate and aggregate data both the multidimensional and relational models support the use of a range of operators. Both models support the use of distributive aggregate operators including: COUNT, SUM, MIN and MAX. However, the multidimensional model supports a range of operations to manipulate data in a cube including:

  • Roll-up and Drill-down to ascend and descend the dimension hierarchy.
  • The Slice operation to perform a selection on one dimension of a cube.
  • The Dice operation to perform a selection on two or more dimensions of a cube.

By comparison, the relational model uses SELECT … WHERE … GROUP BY operators to perform similar functions. In fact, where multidimensional data is stored in a relational database, the RDBMS uses these operators in the background in the generation of the multidimensional data.

Data in a data warehouse built on a multidimensional database will be easier to understand and manipulate than data stored in a data warehouse built on a relational DBMS because the multidimensional DBMS is designed and built specifically for this purpose. The multidimensional data warehouse pre-computes some (or all) of the possible cubes, allowing for fast and simple queries, whereas the relational data warehouse will store most (if not all) the data in it’s more detailed, non-aggregated format, therefore, queries are slower and more complex as data needs to be selected and aggregated each time.

There is often duplication in Time dimension tables. A week can be calculated from other attributes, but since weeks do not align exactly with months a grouping by months cannot be obtained by a grouping by weeks. The duplication is not an issue as it provides analysts with the flexibility to use either option. The same duplication applies to financial months versus calendar months, for example the financial month 1 can equate to calendar month 5 etc.

A second way duplication can occur in a Star schema is where a value for an attribute (other than a key or calculated value) is stored in both the fact table and the dimension table. For example, a business customer is based in London, and this is recorded as an attribute for the customer’s record in the dimension table. When that customer makes a purchase, the city is recorded in the fact table (along with the item, price, quantity, etc). Hence the city data for that customer is duplicated in both the dimension table, and every record in the fact table related to that customer’s purchases. If the customer later relocates to Cardiff, the customer’s record in the dimension table is updated to reflect the new city. By storing the city value in the fact table, we can get a true value for sales in London and Cardiff both before and after the customer re-located, whereas if we simply select and group by the city attribute in the dimension table, the sales figures would be wrong as all the sales for Cardiff would also include all that customers purchases from when they were based in London, and vice versa.

Tags:

Database Systems

Are Agile Methods Software Development's next Silver Bullet?

by David Jones 18. May 2010 08:27

In this post I am going to discuss two Software Development articles: "No silver bullet - essence and accident in Software engineering" by F. Brooks from 1987 and the more recent 2002 article "Get ready for agile methods, with care" by B. Boehm. Are Agile Software Development methods one of the "Silver Bullets" whose existence Fred Brooks questioned?

According to Brooks, the difficulties in developing software derive from it's essential complexity, need to conform to other interfaces, need to change and difficulty to visualise. As to whether recent agile methods could be a silver bullet for software development, Brooks says that there can be no silver bullet due to the nature of software, but states that there are “encouraging innovations”. It can be argued that the innovations he talks of are, in fact, essential characteristics of what later became known as agile methods. The following paragraphs compares these innovations with the agile methods in the Boehm article.

Requirements - Brooks argues that the most difficult part of developing a software system is deciding what to develop, ie establishing the requirements, the specification and designing the software, and promotes an iterative approach to eliciting and refining the requirements between the developer and client. This is comparable with the agile approach of working closely with a customer representative to establish and implement their requirements in small iterations.

Change – Brooks writes that during initial development the client often doesn't fully know their requirements for the software, that additional requirements emerge during development, and, once released, all successful software comes under pressure for change. However, Boehm tells us how agile methods welcome change as requirements emerge and change during development. Short development iterations, along with constant testing, allow changes to be incorporated into the software.

Rapid prototyping and incremental development – according to Brooks, an early prototype helps to visualise the software, allows the client to verify that it meets his requirements, allows any emergent requirements to surface, and creates a basic working system which can be gradually expanded. He also claims that getting an early prototype running boosts morale. Boehm quotes the Agile Manifesto principle of releasing software early and regularly. This gives the customer regular returns on his investment (Chromatic, 2003).

Visualisation and modelling – In 1987 Brooks wrote that software is inherently unvisualisable and attacked the usefulness of a flowchart as a design tool. However, UML and current modelling tools allow designers to create detailed models of software from which the code can be auto-generated.

People – Boehm highlights how agile methods lean towards requiring a small number of premium people, using their tacit knowledge rather than detailed plans to develop designs, and that design-by-committee often results in an inferior design. Fifteen years earlier, Brooks was saying exactly the same thing – that great software is often designed by just a few great designers rather than by committee.

Testing – Brooks states that a great deal of the effort in developing software goes into testing and bug fixing, and asks whether a silver bullet can be found which eliminates them in the system design phase, and whether this will lead to improvements in productivity and reliability. Regular testing is fundamental to agile methods. Unit testing is used to test discrete sections of code, often individual methods, whilst Continuous Integration rebuilds and tests the whole system every time a change is checked into source control, in order to check for changes in one part of the system (which might well work fine on their own) affecting other areas of the system (Niemeyer and Poteet, 2003).

Project size – Brooks claims the benefits of incremental development applies to projects of all sizes, however, Boehm's article quotes Larry Constantine as saying that it becomes increasingly difficult to apply agile methods to teams of more than 15 to 20.

In discussing Brooks' No Silver Bullet article, Black (1999) reiterates the argument that searching for a universal cure to the “software problem” is counterproductive. He goes on to reiterate that developments such as RAD, OOP and 4GLs have all had their moment as the next big thing, but ultimately become last year's incremental improvement.

In conclusion it can be argued that agile methods maybe do represent Brooks' silver bullet under certain circumstances, with the right people - designers, developers, managers and customers - and the right project. However, in Boehm's article even proponents of agile methods state that they are not suitable for very large projects or safety-critical systems, so the home grounds of the various approaches need to be matched to the project.

References

Black, R., (1999). Managing the Testing Process, Microsoft Press

Boehm, B. (2002). Get ready for agile methods, with care, IEEE Computer, Vol. 35, No. 1, January, 64–9.

Brooks, F. (1987). No silver bullet – essence and accident in software engineering, IEEE Computer, Vol. 20, No. 4, April, 10–19

Chromatic, (2003). Extreme Programming Pocket Guide, O'Reilly

Niemeyer, G., Poteet, J., (2003), Extreme Programming with Ant , Sams Publishing

Tags: , , ,

Software Development