Have You Thought About the Impact of eDiscovery on Your Data Management Strategy?

When thinking about data management it is unlikely that your first thought is about legal requirements. Nevertheless, the legal side of data management must be considered in this day and age of regulatory compliance.

To some, the sheer volume and nature of all the sweeping regulations are too mind-boggling to fully digest. But with the EU GDPR quickly coming down tracks, it makes sense to discuss a few of the issues that will impact your databases and data management policies.

First of all, ensuring compliance requires a collaborative effort between business users, IT, and your legal department. This can prove to be a challenge because these three disparate groups are quite distinct and rarely communicate collectively. IT talks to legal only when they have to – and that is usually just to get approval on contract language for software purchase. IT and business communicate regularly (at least they should), but perhaps not as effectively as they might. But all three are required:

  • Business: must understand the legal requirements imposed on their data and systems as dictated in regulations
  • Legal: must be involved to interpret the legal language of the regulations and ensure that the business is taking proper steps to protect itself
  • IT: must be involved to implement the policies and procedures to enact the technology to support the regulatory mandates

Organizations need to map and categorize their business data in accordance with how each data element is impacted by regulations. We need to be able to answer questions like: Which data elements are under the control of which regulation? And what does the regulation require in the way we manage that data?

Once mapped, controls and policies need to be enacted that enforce compliance with the pertinent regulations. This can require better protection and security, enforce longer data retention periods, impose stricter privacy sanctions, mandate improved data quality practices, and so on.

One of the issues that should be factored into the equation by data management professionals is preparation for e-discovery. Yes, regulations mandate that we retain data longer, but there are rules and regulations that dictate when and how organizations will need to access and produce data that is retained, too. I mean, why keep that data around if there is no need ever to see it again?

The ability to produce retained data upon request is typically driven by lawsuits. You probably can recall examples of courtroom showdowns on television where truckloads of paper documents were required during the discovery process of the lawsuit. But times have changed. Increasingly, the data required during the discovery process is electronic, not written. That is, the data is stored on a computer, and much of that data is stored in a database management system.

Which brings me to the Federal Rules of Civil Procedure (FRCP), which are the rules used by US district courts to govern legal proceedings. One of the items in this set of rules dictates policies governing discovery. Discovery is the phase of a lawsuit before the trial occurs during which each party can request documents and other evidence from other parties or can compel the production of evidence.

The FRCP has been modernized and one of the key changes focuses on electronic documents: “A party who produces documents for inspection shall produce them . . . as they are kept in the usual course of business…” So clearly this change compels organizations to improve their ability to produce electronic data.

Another aspect of the FRCP deals with safe harbor from sanctions arising from spoliation. According to this section, “absent exceptional circumstances, a court may not impose sanctions under these rules on a party for failing to provide electronically stored information as a result of the routine, good faith operation of an electronic information system.” Basically, this section shines a spotlight on the need for organizations to develop a clearly articulated, well-executed, and uniformly enforced records retention program. And that program should include database data. Instituting policies and procedures for how data is treated for long-term retention can provide some level of protection from “adverse inference” rulings arising from spoliation.

There are likely to be additional implications arising from manipulating your data management standards to comply with the FRCP, especially when coupled with industry trends such as big data causing more and more data to be retained, the growing number of data breaches and the ever-increasing regulations being voted into law by federal and state governments. It means that we will be forced to treat data as the corporate asset that it is — instead of just saying that we treat it that way.

Data governance programs are becoming more popular as corporations work to comply with more and stricter governmental regulations. A data governance program oversees the management of the availability, usability, integrity, and security of enterprise data. A sound data governance program includes a governing body or council, a defined set of procedures, and a plan to execute those procedures.

So an organization with a strong data governance practice will have better control over its information. When data management is instituted as an officially sanctioned mandate of an organization data is treated as an asset. That means data elements are defined in business terms, data stewards are assigned, data is modeled and analyzed, metadata is defined, captured and managed, and data is archived for long-term data retention.

All of this should be good news to data professionals who have wanted to better define and use data within their organizations. That is, the laws are finally catching up with what we knew our companies should have been doing all along.

Posted in compliance, data, data breach, Data Growth | Leave a comment

What the Null? Handling Missing Information in Database Systems

In relational database systems, a null represents missing or unknown information at the column level. A null is not the same as 0 (zero) or blank. Null means no entry has been made for the column and it implies that the value is either unknown or inapplicable.

With any relational DBMS that supports nulls you can use them to distinguish between a deliberate entry of 0 (for numerical columns) or a blank (for character columns) and an unknown or inapplicable entry (NULL for both numerical and character columns).

Nulls sometimes are inappropriately referred to as “null values.” Using the term value to describe a null is inaccurate because a null implies the lack of a value. Therefore, simply use the term null or nulls (without appending the term “value” or “values” to it).

Most RDBMSes represent null in a “hidden” column (or storage field) that is associated with each nullable column. A common name for this field is an indicator (such as in DB2). An indicator is defined for each column that can accept nulls. The indicator variable is transparent to the end user, but must be provided for when programming in a host language (such as Java or COBOL).

Every column defined to a table must be designated as either allowing or disallowing nulls. A column is defined as nullable – meaning it can be set to NULL – in the table creation DDL. Null is typically the default if nothing is specified after the column name. To prohibit the column from being set to NULL you must explicitly specify NOT NULL after the column name. In the following sample table, COL1 and COL3 can be set to null, but not COL2, COL4, or COL5:

    COL2   CHAR(10) NOT NULL,
    COL3   CHAR(5),
    COL5   TIME     NOT NULL);

What Are The Issues with Null?

The way in which nulls are processed usually is not intuitive to folks used to yes/no, on/off, thinking. With null data, answers are not true/false, but true/false/unknown. Remember, a null is not known. So when a null participates in a mathematical expression, the result is always null. That means that the answer to each of the following is NULL:

  • 5 + NULL
  • NULL / 501324
  • 102 – NULL
  • 51235 * NULL
  • NULL**3
  • NULL/0

Yes, even that last one is null, even though the mathematician in us wants to say “error” because of division by zero. So nulls can be tricky to deal with.

Another interesting aspect of nulls is that the AVG, COUNT DISTINCT, SUM, MAX, and MIN functions omit column occurrences set to null. The COUNT(*) function, however, does not omit columns set to null because it operates on rows. Thus, AVG is not equal to SUM/COUNT(*) when the average is being computed for a column that can contain nulls. To clarify with an example, if the COMM column is nullable, the result of the following query:

     FROM    EMP;

is not the same as for this query:

     FROM    EMP;

Instead, we would have to code the following to be equivalent to AVG(COMM):

     FROM    EMP;

When the column is added to the COUNT function the DBMS no longer counts rows, but instead counts column values (and remember, a null is not a value, but the lack of a value).

But perhaps the more troubling aspect of this treatment of nulls is “What exactly do the results mean?” Shouldn’t a function that processes any NULLs at all return an answer of NULL, or unknown? Does skipping all columns that are NULL return a useful result? I think what is really needed is an option for these functions when they operate on nullable columns. Perhaps a switch that would allow three different modes of operation:

  1. Return a NULL if any columns were null, which would be the default
  2. Operate as it currently does, ignoring NULLs
  3. Treat all NULLs as zeroes

At least that way, users would have an option as to how NULLs are treated by functions. But this is not the case, so to avoid confusion, try to avoid allowing nulls in columns that must be processed using these functions whenever possible.

Here are some additional considerations regarding the rules of operation for nulls:

  • When a nullable column participates in an ORDER BY or GROUP BY clause, the returned nulls are grouped either at the high or low end of the sort order depending on the DBMS. But this treats all nulls as equal when we all know they are not; they are unknown.
  • Nulls are considered to be equal when duplicates are eliminated by SELECT DISTINCT or COUNT (DISTINCT column).
  • Depending on the DBMS, a unique index may consider nulls to be equivalent and disallows duplicate entries because of the existence of nulls. Some DBMSes, such as DB2, provide a clause (in this case, WHERE NOT NULL) that allows multiple nulls in an index.
  • For comparison in a SELECT statement, two null columns are not considered equal. When a nullable column participates in a predicate in the WHERE or HAVING clause, the nulls that are encountered cause the comparison to evaluate to UNKNOWN.
  • When a nullable column participates in a calculation, the result is null.
  • Columns that participate in a primary key cannot be null.
  • To test for the existence of nulls, use the special predicate IS NULL in the WHERE clause of the SELECT statement. You cannot simply state WHERE column = NULL. You must state WHERE column IS NULL.
  • It is invalid to test if a column is <> NULL, or >= NULL. These are all meaningless because null is the absence of a value.

Examine these rules closely. ORDER BY, GROUP BY, DISTINCT, and unique indexes consider nulls to be equal and handle them accordingly. The SELECT statement, however, deems that the comparison of null columns is not equivalence, but unknown. This inconsistent handling of nulls is an anomaly that you must remember when using nulls.

Here are a couple of other issues to consider when nulls are involved.

Did you know it is possible to write SQL that returns a NULL even if you have no nullable columns in your database? Assume that there are no nullable columns in the EMP table (including SALARY) and then consider the following SQL:


The result of this query will be NULL if no DEPTNO exists that is greater than 999. So it is not feasible to try to design your way out of having to understand nulls!

Another troubling issue with NULLs is that some developers have incorrect expectations when using the NOT IN predicate with NULLs. Consider the following SQL:

SELECT C.color
 FROM   Colors AS C 
 WHERE  C.color NOT IN (SELECT P.color 
                        FROM   Products AS P);

If one of the products has its color set to NULL, then the result of the SELECT is the empty set, even if there are colors to which no other product is set.


Another issue that pops up when dealing with nulls is that a NULL does not equal a NULL, so extra effort is required to treat them as such. SQL provides a method for comparing columns that could be null, which is supported in DB2:


Before explaining how this clause functions, let’s take a look at the problem it helps to solve. Two columns are not equal if both are NULL, because NULL is unknown and a NULL never equals anything else, not even another NULL. But sometimes you might want to treat NULLs as equivalent. In order to do that, you would have to code something like this in your WHERE clause:


This coding would cause the DBMS to return all the rows where COL1 and COL2 are the same value, as well as all the rows where both COL1 and COL2 are NULL, effectively treating NULLs as equivalent. But this coding, although relatively simply, can be unwieldy and perhaps, at least not at first blush, unintuitive.

Here comes the IS NOT DISTINCT FROM clause to the rescue. The following clause is logically equivalent to the one above, but perhaps simpler to code and understand:


The same goes for checking a column against a host variable. You might try to code a clause specifying WHERE COL = :HV :hvind (host variable and indicator variable). But such a search condition would never be true when the value in that host variable is null, even if the host variable contains a null indicator. This is because one null does not ever equal another null. Instead we’d have to code additional predicates: one to handle the non-null values and two others to ensure both COL1 and the :HV are both null. With the introduction of the IS NOT DISTINCT FROM predicate, the search condition could be simplified to just:


Not only is the IS NOT DISTINCT FROM clause simpler and more intuitive, it is also a Stage 1 predicate, so it can perform well, too.


Nulls are clearly one of the most misunderstood features of SQL database systems development. Although nulls can be confusing, you cannot bury your head in the sand and ignore nulls. Understanding what nulls are, and how best to use them, can help you to create usable databases and design useful and correct queries in your database applications.

Posted in NULL, SQL | Leave a comment

Who Owns Data?

Who owns data?

This is a complex question that can’t be answered quickly or easily. It requires some thought and you have to break things down in order to even attempt to answer it.

First of all, there is public data and private data. One could say that public data is in the open domain and available to everyone. But what makes data public? Is data you post on Facebook now public because it is available to anyone with a browser or the Facebook app? Well, probably not. It is available only to those that you have shared the data with. But when you put it up on Facebook then Facebook likely owns it.

What about governmental data that is available freely online like that available at USA.gov and data.gov? Well, you can grab that data and use it, but that doesn’t mean you own it, does it?

Then there are all the data governance and privacy laws and regulations that impact who owns what and how it can be used. It can be difficult to fully understand what all of these laws mean and how and when they apply to you and your organization. This is especially important with GDPR compliance looming before us.

But let’s back it up a minute and think just about corporate data. It is not an uncommon question, when working on a new project or application, to ask “who owns this data?” That is an important question to have an answer for! But owns is probably not the correct word.

In my humble opinion, data belongs to the company and thus, the COMPANY is the owner. Each department within an organization ought to be the custodian of the data it generates and uses to conduct its business.  Departments are the custodian because they are the ones who decide who has access to their data, must maintain the integrity of the data they use, and ensure that it is viable for making decisions and influencing executives.

Nevertheless, this answer provides only a part of the answer to the question. You really need named individuals as custodians. These can be from the business unit or the IT group supporting the business unit. Generally speaking, if custodians are appointed in IT, they should probably not be application developers or DBAs, but perhaps data analysts or higher-level IT managers.

Application developers are responsible for writing code and DBAs are responsible for the physical database structures and performance. There needs to be a data professional in charge of the accuracy of the actual data in the databases.

Here are some things to consider as you approach your data ownership/custodian planning:

  • Understand the data requirements of all current systems, those developed in-house and those you bought. Be sure that you know all of the data interdependencies of your applications and how one app can impact another.
  • Assess the quality of your existing data in all of your existing systems. It is probably worse than you think it is. Then work on methods and approaches to improve that quality. There are tools and services that can help here.
  • Redesign and re-engineer your systems if you uncover poor data quality in your current applications and databases. You might choose to change vendors, replatform or rehost apps with poor data quality, but if the old data is still required it must be cleansed before using it in the new system.
  • Work on methods to score the quality of data in your systems and tie the performance and bonuses of custodians to the scores.

What do you think? Does any of this make sense? How does your organization approach data ownership and custodians?

Posted in data, Data Quality | Leave a comment

A Look at Data Professional’s Salaries

The annual ComputerWorld IT Salary Survey for 2017 was recently published and it contains a great wealth of interesting data. So, as I’ve done in the past, this post will summarize its findings and report on what is going on with the data-related positions mentioned in the survey. Of course, please click on the link above to go to ComputerWorld for the nitty-gritty details on other positions, as well as a lot of additional salary and professional information.

Overall, the survey reports a 3 percent growth in IT pay with 50% of respondents indicating that they are satisfied or very satisfied with their current compensation. That is down from last year when the number was 54%. Clearly, though, IT as a profession seems to be a sound choice. 43% expect their organization’s IT headcount to increase and 49 percent expect it to remain the same, while only 7 percent expect a decrease in their company’s headcount.

But all is not rosy. When looking at the amount of work that needs to be done 56 percent expect IT workload to increase over the next year. But if headcount is not rising commensurate with the amount of additional workload then that means organizations will expect more work from their IT staff than they did last year.

Nevertheless, 85 percent say they are satisfied or very satisfied with their career in IT.

Now let’s get to the interesting part for data professional… and that is the salary outlook for specific data jobs.

If you are the manager of a database or data warehousing group, your total compensation increased greater than the norm last year at 4.1 percent. Average compensation grew from $110,173 to $114,635.

DBAs compensation grew 2.9 percent, which was just about the average. Average compensation for DBAs was $104,860, growing from $101,907 in 2016.

Database developer/modeler, which is an interesting grouping, grew 2.5 percent from $96,771 in 2016 to $99,235 in 2017.

So things are looking OK, but not stellar for data professionals. Which IT positions grew their salary at the highest percentage? Well, the top of the heap, somewhat surprisingly, was Web Developer which grew at 6.7 percent (to an average total compensation of $76,446). The next highest growth makes a lot of sense, Chief Security Officer, which grew 6.4 percent year over year.

The common career worries looked familiar with keeping skills up-to-date being the most worrisome, followed by flat salaries and matching skills to a good position. And the biggest workplace woe? Not surprisingly, increased IT workload. But stress levels are about the same with 61 percent of respondents indicating that their level of job stress was the same as last year.

What can you do to help grow your salary this year? Well, you might consider aligning your career with one of the hot specialties called out in the survey. The top three tech functions with the highest average compensation in 2017 are cloud computing, ERP and security.

Overall, though, it looks like an IT career is a good thing to pursue… and working with data in some capacity still makes a lot of sense!

Posted in data, DBA, salary | Leave a comment

Why isn’t e-mail more easily queryable?

Today’s blog post is just a short rumination on finding stuff in my vast archive of email…

Do you ever wonder why e-mail systems don’t use database management technology?  Don’t you store some of your e-mails for long periods of time?  Do you group them into folders?  But then, isn’t it hard to find anything later?

Anybody who uses email like I do needs to know which folder is which and which e-mail has the information you need in it.  And it isn’t usually obvious from the folder name you gave it (which made sense at the time) or the subject of the e-mail (which might not have anything to do with the actual content of the e-mail you’re looking for).  And sometimes emails get stored in the wrong folder…

I’d sure love to be able to use SQL against my e-mail system, writing something like:


Or something like that.

Wouldn’t you?

Posted in e-mail, SQL | 1 Comment

One of the Top Database Blogs on the Web

Very proud to announce that the Data & Technology Today blog was selected as one of the top 60 database blogs on the web by Feedspot.

You can read all about it here – as well as learn about 59 other great database and data-related blogs that you might want to follow.

Posted in data | 1 Comment

News from IBM InterConnect 2017

This week I am in Las Vegas for the annual IBM InterConnect conference. IBM touts the event as a way to tap into the most advanced cloud technology in the market today. And that has merit, but there is much more going on here.

If I had to summarize the theme of InterConnect I would say that it is all about cloud, IBM’s Watson, and DevOps. This is evident in terms of the number of sessions being delivered on these topics, as well as the number of vendors and IBM booths in the concourse area devoted to these topics.

But the highlight of the conference for me, so far, was Ginni Rometty’s keynote address on Tuesday. She was engaging and entertaining as she interacted with IBM customers and partners to weave the story of IBM’s cloud and cognitive computing achievements. The session is available to for replay on IBMGO and it is well worth your time to watch it if you are at all interested in how some of the biggest and most innovative organizations are using IBM technology to gain competitive advantage.

And let’s not forget that Will Smith – yes, that Will SmithWill Smith – was part of the general session on Monday. Not surprisingly, he was intelligent and amusing calling himself an African-American Watson as he described how he used primitive data analytics to review the types of movies that were most successful as he planned his acting career. My favorite piece of advice he offered was something that he learned as he moved from music to acting. When he was asked if he had ever acted before (he hadn’t) he said “Of course,” and it led to him getting case in the mega-hit sitcom The Fresh Prince of Bel-Aire. His advice? “If someone asks if you have ever done something just say ‘yes’ and figure it out later.” He had a lot more to say, but let me send you here if you are interested in reading more about Will.

Of course, there is a lot more going on here than just what is happening in the keynote and general sessions. Things I’ve learned this week include:

  • DevOps is as much about business change as technology change
  • The largest area of growth for DevOps is now on the mainframe (according to Forrester Research)
  • Some companies are bringing college grads up to proficiency in mainframe COBOL in less than a month using a modern IDE
  • Networking is the hidden lurking problem in many cloud implementations
  • The mainframe is not going away (I knew this, but it was good to hear a Forrester analyst say it)
  • And a lot more

But that is enough for now. So to conclude, I’d like to end with a quote from Ginni Rometty that I think all of us in IT should embrace: “Technology is never good or bad; it is what you do with it that makes a difference.”

Let’s all get to work and do good things with technology!

Posted in cloud, DevOps, IBM, Watson | Leave a comment

Inside the Data Reading Room – Analytics Edition

If you are a regular reader of this blog you know that, from time-to-time, I review data-related books. Of course, it has been over a year since the last book review post, so this post is long overdue.

Today, I will take a quick look at a couple of recent books on analytics that might pique your interest. First up, is The Data and Analytics Playbook by Lowell Fryman, Gregory Lampshire and Dan Meers (2017, Morgan Kaufmann, ISBN 978-0-802307-5).

This book is written as a guide to proper implementation of data management methods and procedures for modern data usage and exploitation.

The first few chapters lay the groundwork and delve into the need for a new approach to data management that embraces analytics. Then, in Chapter 3, the authors guide the reader through steps to assess their current conditions, controls and capabilities with regard to their data. The material here can be quite helpful to assist you in gauging where your organization falls in terms of data maturity. Chapter 4, which chronicles the detailed activities involved in building a data and analytics framework comprises about a quarter of the book and this chapter alone can give you a good ROI on your book purchase.

Chapter 8 is also well done. It addresses data governance as an operations process, giving advice and a framework for successful data governance. If you are at all involved in your organization’s data management and analytics practice, do yourself a favor and grab a copy of this book today.

The second book I will cover today is a product-focused book on IBM’s Watson Analytics product. Most people have heard of IBM’s Watson because of the Jeopardy challenge. But if your only knowledge of Watson is how it beat Jeopardy champions at the game several years ago, then you need to update what you know!

So what book can help? How about Learning IBM Watson Analytics by James D. Miller (2016, Packt Publishing, ISBN 978-1-78588-077-3)?

This short book can help you to understand what Watson can do for your organization’s analytics. It shows how to access and configure Watson and to develop use cases to create solutions for your business problems.

If you are a nascent user of Watson, or are just looking to learn more about what Watson can do, then this is a superb place to start. Actually, if you learn best through books, then this is the only place to start because it is currently the only book available on IBM Watson Analytics.

As with any technology book that walks you through examples and screen shots, as the product matures over time, things may look different when you actually use Watson. But that is a small issue that usually won’t cause distraction. And with all of the advice and guidance this book offers in terms of designing solutions with Watson, integrating it with other IBM solutions, and more, the book is a good place to start your voyage with Watson.

Hopefully, you’ll find these two books as interesting and worthwhile as I did!

Posted in analytics, book review, books, DBA, Watson | Leave a comment

Time to Plan Your Trip to IBM InterConnect 2017

I am looking forward to attending this year’s IBM InterConnect conference in Las Vegas, NV the week of March 19-23, 2017. And after reading my blog post today I bet you will be interested in attending, too!


The first thing you will notice is that IBM InterConnect covers a plethora of technical topics, including some of the hottest and most important for your business. If you attend the conference you can learn abour Hybrid Cloud, Process Transformation, Integration, Internet of Things, DevOps, IT Service Management, Security, Data Management, and more.  There are educational presentations as well as hands-on sessions that allow you to build your conference experience how you’d like, using the learning techniques that best suit you.

And there are a lot of learning opportunities! IBM InterConnect has over 2,000 sessions, 200 exhibitors, hundreds of labs, certification opportunities, as well as the ability to network with other IT professionals from all around the world.

For me, there are several sessions that I’m very much looking forward to attending. The Continuous Delivery keynote on March 20th promises to inform and educate on DevOps best practices including a roadmap for IBM’s UrbanCode. On Tuesday I’m excited about the session on How Watson “Really Works… I mean who wouldn’t be interested in learning more about the AI, natural language and machine learning capabilities of IBM Watson? And Wednesday offers an intriguing session for mainframers like me – “Why z/OS is a Great Platform for Developing and Hosting APIs.”

Of course, there are a lot of additional sessions that I plan to attend, but I doubt anybody is interested in a rundown of my entire agenda. Especially with so much variety and choice available to attendees this year. And if you get stuck choosing from all the great sessions that are available, this year you can solicit Watson’s help recommending sessions as you build your agenda. I tried it and it was interesting and helpful to see what Watson chose for me.

So take a look at what IBM InterConnect has to offer this year. And if you plan on attending I hope we get a chance to meet and discuss our experiences at the conference. See you in Vegas!

Posted in certification, cloud, education, enterprise computing, IoT | Leave a comment

Data Technology Today’s 2016 Year in Review

Well, another year has come and gone and I thought it might be interesting to share a bit about this blog’s activity in 2016. It was an active year that saw 17 new posts, down a bit from 2015, but still averaging more than a post a month.

Posts on the blog were viewed 47,264 times by 36,870 visitors, meaning each visitor averaged 1.28 views.

The most popular post in 2016 was actually first posted in 2011: An Introduction to Database Design: From Logical to Physical was viewed 10,575 times in 2016. Obviously database design is an interesting topic — at least for the reader’s of this blog!

The second most popular post in 2016 was On The Importance of Database Backup and Recovery, which was first posted in 2014.  The most popular post actually posted in 2016 was published in December, late in the year to lead the year, but evidently people are interested in A Useful Guide to Data Fundamentals from Fabian Pascal. As well they should be!

And the blog gets read all over the world, as shown in the Top Ten Countries visiting in 2016 below:


Yes, most of my readers are from the United States, but I’m proud of the following I have in India (and across the world).

So to end this brief synopsis of 2016, thank you to all of my regular readers – please keep visiting and suggesting more topics for 2017 and beyond. And if this is your first visit to the blog, welcome. Take some time to view the historical content – there are several informative posts that are popular every year… and keep checking back for new content on data, database, and related topics!

Posted in backup & recovery, DBA, review | Leave a comment