What the Null? Handling Missing Information in Database Systems

In relational database systems, a null represents missing or unknown information at the column level. A null is not the same as 0 (zero) or blank. Null means no entry has been made for the column and it implies that the value is either unknown or inapplicable.

With any relational DBMS that supports nulls you can use them to distinguish between a deliberate entry of 0 (for numerical columns) or a blank (for character columns) and an unknown or inapplicable entry (NULL for both numerical and character columns).

Nulls sometimes are inappropriately referred to as “null values.” Using the term value to describe a null is inaccurate because a null implies the lack of a value. Therefore, simply use the term null or nulls (without appending the term “value” or “values” to it).

Most RDBMSes represent null in a “hidden” column (or storage field) that is associated with each nullable column. A common name for this field is an indicator (such as in DB2). An indicator is defined for each column that can accept nulls. The indicator variable is transparent to the end user, but must be provided for when programming in a host language (such as Java or COBOL).

Every column defined to a table must be designated as either allowing or disallowing nulls. A column is defined as nullable – meaning it can be set to NULL – in the table creation DDL. Null is typically the default if nothing is specified after the column name. To prohibit the column from being set to NULL you must explicitly specify NOT NULL after the column name. In the following sample table, COL1 and COL3 can be set to null, but not COL2, COL4, or COL5:

CREATE TABLE SAMPLE1
   (COL1   INTEGER,
    COL2   CHAR(10) NOT NULL,
    COL3   CHAR(5),
    COL4   DATE     NOT NULL WITH DEFAULT,
    COL5   TIME     NOT NULL);

What Are The Issues with Null?

The way in which nulls are processed usually is not intuitive to folks used to yes/no, on/off, thinking. With null data, answers are not true/false, but true/false/unknown. Remember, a null is not known. So when a null participates in a mathematical expression, the result is always null. That means that the answer to each of the following is NULL:

  • 5 + NULL
  • NULL / 501324
  • 102 – NULL
  • 51235 * NULL
  • NULL**3
  • NULL + NULL
  • NULL/0

Yes, even that last one is null, even though the mathematician in us wants to say “error” because of division by zero. So nulls can be tricky to deal with.

Another interesting aspect of nulls is that the AVG, COUNT DISTINCT, SUM, MAX, and MIN functions omit column occurrences set to null. The COUNT(*) function, however, does not omit columns set to null because it operates on rows. Thus, AVG is not equal to SUM/COUNT(*) when the average is being computed for a column that can contain nulls. To clarify with an example, if the COMM column is nullable, the result of the following query:

    SELECT  AVG(COMM)
     FROM    EMP;

is not the same as for this query:

    SELECT  SUM(COMM)/COUNT(*)
     FROM    EMP;

Instead, we would have to code the following to be equivalent to AVG(COMM):

    SELECT  SUM(COMM)/COUNT(COMM)
     FROM    EMP;

When the column is added to the COUNT function the DBMS no longer counts rows, but instead counts column values (and remember, a null is not a value, but the lack of a value).

But perhaps the more troubling aspect of this treatment of nulls is “What exactly do the results mean?” Shouldn’t a function that processes any NULLs at all return an answer of NULL, or unknown? Does skipping all columns that are NULL return a useful result? I think what is really needed is an option for these functions when they operate on nullable columns. Perhaps a switch that would allow three different modes of operation:

  1. Return a NULL if any columns were null, which would be the default
  2. Operate as it currently does, ignoring NULLs
  3. Treat all NULLs as zeroes

At least that way, users would have an option as to how NULLs are treated by functions. But this is not the case, so to avoid confusion, try to avoid allowing nulls in columns that must be processed using these functions whenever possible.

Here are some additional considerations regarding the rules of operation for nulls:

  • When a nullable column participates in an ORDER BY or GROUP BY clause, the returned nulls are grouped either at the high or low end of the sort order depending on the DBMS. But this treats all nulls as equal when we all know they are not; they are unknown.
  • Nulls are considered to be equal when duplicates are eliminated by SELECT DISTINCT or COUNT (DISTINCT column).
  • Depending on the DBMS, a unique index may consider nulls to be equivalent and disallows duplicate entries because of the existence of nulls. Some DBMSes, such as DB2, provide a clause (in this case, WHERE NOT NULL) that allows multiple nulls in an index.
  • For comparison in a SELECT statement, two null columns are not considered equal. When a nullable column participates in a predicate in the WHERE or HAVING clause, the nulls that are encountered cause the comparison to evaluate to UNKNOWN.
  • When a nullable column participates in a calculation, the result is null.
  • Columns that participate in a primary key cannot be null.
  • To test for the existence of nulls, use the special predicate IS NULL in the WHERE clause of the SELECT statement. You cannot simply state WHERE column = NULL. You must state WHERE column IS NULL.
  • It is invalid to test if a column is <> NULL, or >= NULL. These are all meaningless because null is the absence of a value.

Examine these rules closely. ORDER BY, GROUP BY, DISTINCT, and unique indexes consider nulls to be equal and handle them accordingly. The SELECT statement, however, deems that the comparison of null columns is not equivalence, but unknown. This inconsistent handling of nulls is an anomaly that you must remember when using nulls.

Here are a couple of other issues to consider when nulls are involved.

Did you know it is possible to write SQL that returns a NULL even if you have no nullable columns in your database? Assume that there are no nullable columns in the EMP table (including SALARY) and then consider the following SQL:

SELECT  SUM(SALARY)
 FROM   EMP
 WHERE  DEPTNO > 999;

The result of this query will be NULL if no DEPTNO exists that is greater than 999. So it is not feasible to try to design your way out of having to understand nulls!

Another troubling issue with NULLs is that some developers have incorrect expectations when using the NOT IN predicate with NULLs. Consider the following SQL:

SELECT C.color
 FROM   Colors AS C 
 WHERE  C.color NOT IN (SELECT P.color 
                        FROM   Products AS P);

If one of the products has its color set to NULL, then the result of the SELECT is the empty set, even if there are colors to which no other product is set.

IS [NOT] DISTINCT FROM

Another issue that pops up when dealing with nulls is that a NULL does not equal a NULL, so extra effort is required to treat them as such. SQL provides a method for comparing columns that could be null, which is supported in DB2:

 IS [NOT] DISTINCT FROM,

Before explaining how this clause functions, let’s take a look at the problem it helps to solve. Two columns are not equal if both are NULL, because NULL is unknown and a NULL never equals anything else, not even another NULL. But sometimes you might want to treat NULLs as equivalent. In order to do that, you would have to code something like this in your WHERE clause:

WHERE COL1 = COL2
 OR (COL1 IS NULL AND COL2 IS NULL)

This coding would cause the DBMS to return all the rows where COL1 and COL2 are the same value, as well as all the rows where both COL1 and COL2 are NULL, effectively treating NULLs as equivalent. But this coding, although relatively simply, can be unwieldy and perhaps, at least not at first blush, unintuitive.

Here comes the IS NOT DISTINCT FROM clause to the rescue. The following clause is logically equivalent to the one above, but perhaps simpler to code and understand:

WHERE COL1 IS NOT DISTINCT FROM COL2

The same goes for checking a column against a host variable. You might try to code a clause specifying WHERE COL = :HV :hvind (host variable and indicator variable). But such a search condition would never be true when the value in that host variable is null, even if the host variable contains a null indicator. This is because one null does not ever equal another null. Instead we’d have to code additional predicates: one to handle the non-null values and two others to ensure both COL1 and the :HV are both null. With the introduction of the IS NOT DISTINCT FROM predicate, the search condition could be simplified to just:

WHERE COL1 IS NOT DISTINCT FROM :HV :hvind

Not only is the IS NOT DISTINCT FROM clause simpler and more intuitive, it is also a Stage 1 predicate, so it can perform well, too.

Summary

Nulls are clearly one of the most misunderstood features of SQL database systems development. Although nulls can be confusing, you cannot bury your head in the sand and ignore nulls. Understanding what nulls are, and how best to use them, can help you to create usable databases and design useful and correct queries in your database applications.

Advertisements

About craig@craigsmullins.com

I'm a strategist, researcher, and consultant with nearly three decades of experience in all facets of database systems development.
This entry was posted in NULL, SQL. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s