8.27.2008

Common Mistakes in Database Design

Common mistakes:

1. Poor design/planning
2. Ignoring normalization
3. Poor naming standards
4. Lack of documentation
5. One table to hold all domain values
6. Using identity/guid columns as your only key
7. Not using SQL facilities to protect data integrity
8. Not using stored procedures to access data
9. Lack of testing

Poor design/planning

The database is the cornerstone of pretty much every business project, if you don't take the time to map out the needs of the project and how the database is going to meet them, then the chances are that the whole project will veer off course and lose direction. Furthermore, if you don't take the time at the start to get the database design right, then you'll find that any substantial changes in the database structures that you need to make further down the line could have a huge impact on the whole project, and greatly increase the likelihood of the project time-line slipping.

Admittedly it is impossible to predict every need that your design will have to fulfill and every issue that is likely to arise, but it is important to mitigate against potential problems as much as possible, by careful planning.

Ignoring Normalization

Normalization defines a set of methods to break down tables to their constituent parts until each table represents one and only one "thing", and its columns serve to fully describe only the one "thing" that the table represents.

The concept of normalization has been around for 30 years and is the basis on which SQL and relational databases are implemented. In other words, SQL was created to work with normalized data structures.

Normalizing your data is essential to good performance, and ease of development, but the question always comes up: "How normalized is normalized enough?" Still now, Upto 3rd Normal Form is essential, but 4th and 5th Normal Forms are really useful and, once you get a handle on them, quite easy to follow and well worth the time required to implement them. In reality, however, it is quite common that not even the first Normal Form is implemented correctly.

Poor naming standards

Names, are the first and most important line of documentation for your application. The names you choose are not just to enable you to identify the purpose of an object, but to allow all future programmers, users, and so on to quickly and easily understand how a component part of your database was intended to be used, and what data it stores. No future user of your design should need to wade through a 500 page document to determine the meaning of some wacky name.

A name such as tblCustomer or colVarcharAddress might seem useful from a development perspective, but to the end user it is just confusing. As a developer, you should rely on being able to determine that a table name is a table name by context in the code or tool, and present to the users clear, simple, descriptive names, such as Customer and Address.

As a practice I strongly advise against is the use of spaces and quoted identifiers in object names. You should avoid column names such as "Part Number" or, in Microsoft style, [Part Number], therefore requiring you users to include these spaces and identifiers in their code. It is annoying and simply unnecessary.

Acceptable alternatives would be PART_NUMBER, part_number, partNumber or PartNumber. Again, consistency is key. If you choose PartNumber then that's fine – as long as the column containing invoice numbers is called InvoiceNumber, and not one of the other possible variations.

Lack of documentation

By carefully naming your objects, columns, and so on, you can make it clear to anyone what it is that your database is modeling.

Documentation contain definitions on its tables, columns, relationships, and even default and check constraints, so that it is clear to everyone how they are intended to be used. In many cases, you may want to include sample values, where the need arose for the object, and anything else that you may want to know in a year or two when "future you" has to go back and make changes to the code.
Where this documentation is stored is largely a matter of corporate standards and/or convenience to the developer and end users. It could be stored in the database itself, using extended properties. Alternatively, it might be in maintained in the data modeling tools. It could even be in a separate data store, such as Excel or another relational database.

Your goal should be to provide enough information that when you turn the database over to a support programmer, they can figure out your minor bugs and fix them.


I know there is an old joke that poorly documented code is a synonym for "job security." While there is a hint of truth to this, it is also a way to be hated by your coworkers and never get a raise. And no good programmer I know of wants to go back and rework their own code years later. It is best if the bugs in the code can be managed by a junior support programmer while you create the next new thing. Job security along with raises is achieved by being the go-to person for new challenges.

One table to hold all domain values

Relational databases are based on the fundamental idea that every object represents one and only one thing. There should never be any doubt as to what a piece of data refers to. By tracing through the relationships, from column name, to table name, to primary key, it should be easy to examine the relationships and know exactly what a piece of data means.

The big myth is that the more tables there are, the more complex the design will be. So, conversely, shouldn't condensing multiple tables into a single "catch-all" table simplify the design? It does sound like a good idea but this idea is wrong in large application. This may seem a very clean and natural way to design a table for all but the problem is that it is just not very natural to work with in SQL. And in this situation, the SQL query will take long time and may arise performance issue.

The point of this tip is simply that it is better to do the work upfront, making structures solid and maintainable, rather than trying to attempt to do the least amount of work to start out a project. By keeping tables down to representing one "thing" it means that most changes will only affect one table, after which it follows that there will be less rework for you down the road.

Using identity/guid columns as your only key

First Normal Form dictates that all rows in a table must be uniquely identifiable. Hence, every table should have a primary key.

SQL Server allows you to define a numeric column as an IDENTITY column, and then automatically generates a unique value for each row.

Alternatively, you can use NEWID() (or NEWSEQUENTIALID()) to generate a random, 16 byte unique value for each row. These types of values, when used as keys, are what are known as surrogate keys. The word surrogate means "something that substitutes for" and in this case, a surrogate key should be the stand-in for a natural key.


The problem is that too many designers use a surrogate key column as the only key column on a given table. The surrogate key values have no actual meaning in the real world; they are just there to uniquely identify each row.

Now, consider the following Part table, whereby PartID is an IDENTITY column and is the primary key for the table:

PartID

PartNumber

Description

1

XXXXXXXX

The X part

2

XXXXXXXX

The X part

3

YYYYYYYY

The Y part


How many rows are there in this table? Well, there seem to be three, but are rows with PartIDs 1 and 2 actually the same row, duplicated? Or are they two different rows that should be unique but were keyed in incorrectly?

The rule of thumb I use is simple. If a human being could not pick which row they want from a table without knowledge of the surrogate key, then you need to reconsider your design. This is why there should be a key of some sort on the table to guarantee uniqueness, in this case likely on PartNumber.

In summary: as a rule, each of your tables should have a natural key that means something to the user ,and can uniquely identify each row in your table. In the very rare event that you cannot find a natural key (perhaps, for example, a table that provides a log of events), then use an artificial/ surrogate key.

Not using SQL facilities to protect data integrity (For SQL server users)

All fundamental, non-changing business rules should be implemented by the relational engine. The base rules of nullability, string length, assignment of foreign keys, and so on, should all be defined in the database.


Not using stored procedures to access data

Stored procedures are your friend. Use them whenever possible as a method to protect the database layer from the users of the data. Stored procedures make database development much cleaner, and encourage collaborative development between your database and functional programmers. A few of the other interesting reasons that stored procedures are important include the following.

Maintainability

Stored procedures provide a known interface to the data, this is probably the largest draw. Stored procedures give the database professional the power to change characteristics of the database code without additional resource involvement, making small changes, or large upgrades (for example changes to SQL syntax) easier to do.

Encapsulation

Stored procedures allow you to "encapsulate" any structural changes that you need to make to the database so that the knock on effect on user interfaces is minimized. For example, say you originally modeled one phone number, but now want an unlimited number of phone numbers. You could leave the single phone number in the procedure call, but store it in a different table as a stopgap measure, or even permanently if you have a "primary" number of some sort that you always want to display. Then a stored procedure could be built to handle the other phone numbers. In this manner the impact to the user interfaces could be quite small, while the code of stored procedures might change greatly.

Security

Stored procedures can provide specific and granular access to the system. For example, you may have 10 stored procedures that all update table X in some way. If a user needs to be able to update a particular column in a table and you want to make sure they never update any others, then you can simply grant to that user the permission to execute just the one procedure out of the ten that allows them perform the required update.


Lack of testing

As database professionals know, the first thing to get blamed when a business system is running slow is the database. Why? First because it is the central piece of most any business system, and second because it also is all too often true.


By gaining deep knowledge of the system we have created and understanding its limits through testing.
Testing is the first thing to go in a project plan when time slips a bit. And what suffers the most from the lack of testing? Functionality? Maybe a little, but users will notice and complain if the "Save" button doesn't actually work and they cannot save changes to a row they spent 10 minutes editing. What really gets the shaft in this whole process is deep system testing to make sure that the design you (presumably) worked so hard on at the beginning of the project is actually implemented correctly.


Initially, major bugs come in thick and fast, especially performance related ones. If the first time you have tried a full production set of users, background process, work flow processes, system maintenance routines, ETL, etc, is on your system launch day, you are extremely likely to discover that you have not anticipated all of the locking issues that might be caused by users creating data while others are reading it, or hardware issues cause by poorly set up hardware.


Once the major bugs are squashed, the fringe cases (which are pretty rare cases, like a user entering a negative amount for hours worked) start to raise their ugly heads. What you end up with at this point is software that irregularly fails in what seem like weird places (since large quantities of fringe bugs will show up in ways that aren't very obvious and are really hard to find.)


Now, it is far harder to diagnose and correct because now you have to deal with the fact that users are working with live data and trying to get work done. Plus you probably have a manager or two sitting on your back saying things like "when will it be done?" every 30 seconds, even though it can take days and weeks to discover the kinds of bugs that result in minor (yet important) data aberrations. Had proper testing been done, it would never have taken weeks of testing to find these bugs, because a proper test plan takes into consideration all possible types of failures, codes them into an automated test, and tries them over and over. Good testing won't find all of the bugs, but it will get you to the point where most of the issues that correspond to the original design are ironed out.

If everyone insisted on a strict testing plan as an integral and immutable part of the database development process, then maybe someday the database won't be the first thing to be fingered when there is a system slowdown.

Summary

Database design and implementation is the cornerstone of any data centric project (99.9% of business applications) and should be treated as such when you are developing. Some of the tips, like planning properly, using proper normalization, using a strong naming standards and documenting your work– these are things that even the best DBAs and data architects have to fight to make happen.


Database Design

Certain principles guide the database design process. The first principle is that duplicate information (also called redundant data) is bad, because it wastes space and increases the likelihood of errors and inconsistencies. The second principle is that the correctness and completeness of information is important. If your database contains incorrect information, any reports that pull information from the database will also contain incorrect information. As a result, any decisions you make that are based on those reports will then be misinformed.

A good database design is, therefore, one that:

- Divides your information into subject-based tables to reduce redundant data.

- Provides Access with the information it requires to join the information in the tables together as needed.

- Ensure the accuracy and integrity of your information.

- Accommodates your data processing and reporting needs.
 


The Database Design Process:

The design process consists of the following steps:

1. Determine the purpose of your database
 

This helps prepare you for the remaining steps.

2. Find and organize the information required
 

Gather all of the types of information you might want to record in the database, such as product name and order number.
 

3. Divide the information into tables
 

Divide your information items into major entities or subjects, such as Products or Orders. Each subject then becomes a table.

4. Turn information items into columns
 

Decide what information you want to store in each table. Each item becomes a field, and is displayed as a column in the table. For example, an Employees table might include fields such as Last Name and Hire Date.

5. Specify primary keys
 

Choose each table’s primary key. The primary key is a column that is used to uniquely identify each row. An example might be Product ID or Order ID.

6. Set up the table relationships
 

Look at each table and decide how the data in one table is related to the data in other tables. Add fields to tables or create new tables to clarify the relationships, as necessary.

7. Refine your design
 

Analyze your design for errors. Create the tables and add a few records of sample data. See if you can get the results you want from your tables. Make adjustments to the design, as needed.

8. Apply the normalization rules
 

Apply the data normalization rules to see if your tables are structured correctly. Make adjustments to the tables, as needed.

8.14.2008

Surrogate key

A surrogate key is a substitution for the natural primary key.

It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table.

Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key.

It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.

Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME.

A surrogate may also be called a

  • surrogate key,
  • entity identifier,
  • system-generated key,
  • database sequence number,
  • synthetic key,
  • technical key or
  • arbitrary unique identifier.