Happy Holidays 2017!

Just a short post to end the year wishing all of my readers everywhere a very happy holiday season – no matter which holidays you celebrate, I hope they bring you joy, contentment, and let you recharge for an even better year next year!


So enjoy the holidays and come back in January as we continue to explore the world of data and database technology…

Posted in DBA | 1 Comment

SQL Coding and Tuning for Efficiency

Coding and tuning SQL is one of the most time consuming tasks for those involved in coding, managing and administering relational databases and applications. There can be literally thousands of individual SQL statements across hundreds of applications that access your many production databases. Although the DBA is ultimately responsible for ensuring performance of the database environment, there is quite a lot that application developers can do to help out. Frequently, developers are only concerned with getting the right answer (which is, of course, required) but not with getting it in the most efficient way.

When coding SQL statements, the following steps need to occur for each and SQL statement that you write:

  1. Identify the business data requirements
  2. Ensure that the required data is available within existing databases
  3. Translate the business requirements into SQL
  4. Test the SQL for accuracy and results
  5. Review the access paths for performance
  6. Tweak or re-write the SQL for better access paths
  7. Possibly code optimization hints
  8. Repeat steps 4 through 7 until performance is within the required range.
  9. Repeat step 8 whenever performance problems arise or a new DBMS version is installed
  10. Repeat entire process whenever business needs change

SQL tuning is a complex, time consuming, and sometimes error-prone process. Furthermore, it requires cooperation and communication between the business users and application programmers for the first three steps, and between the application programmers and the DBA for the remaining steps.

It is imperative that developers learn more about SQL performance and take steps to be proactive about coding their programs with performance in mind. This is especially the case in the modern DevOps, continuous delivery, agile world where code is moved to production rapidly and numerous times a week.

If developers are not concerned about performance – or only marginally so – then it is a certainty that your organization will experience performance problems in production. There are simply not enough DBAs and performance analysts available to examine every program before it is moved to production these days.

How can you become a performance-focused developer? Here are a few suggestions:

  • Read the manuals for your DBMS of choice (Oracle, Db2, etc.), especially the one that focus on performance. Find the SQL-related items and concentrate there, but the more you understand about all elements of database performance the better coder you will be.
  • Purchase books on SQL performance. There are several good ones that talk about performance in a heterogeneous manner and there are also many books that focus on SQL for each DBMS.
  • Talk to your DBAs about SQL techniques and methods that they have found to be good for performance.
  • Learn how to explain your SQL statements and interpret the access path information either in the plan tables or in a visual explain tool.
  • Use all of the performance tools at your disposal. Again, talk to the DBAs to learn what tools are available at your site.

And always be tuning!

Posted in books, performance, SQL | 1 Comment

IT Through the Looking Glass

Sometimes I look for inspiration in what may seem — at first glance — to be odd places.  For example, I think the Lewis Carroll “Alice in Wonderland” books offer sage advice for the IT industry.  I mean, how many times have you watched a salesman grin as he spoke and then expected him to simply disappear the way the Cheshire Cat does?

Which Way Should We Go?

But perhaps that is a bad metaphor.  The Cheshire Cat was actually a pretty smart cookie (no disrespect to salespeople intended)!   Recall the passage where Alice comes to a fork-in-the-road and first meets the Cheshire Cat up in a tree.  She asks, “Would you tell me, please, which way I ought to go from here?”  And the cat responds, “That depends a good deal on where you want to go.”  Alice, in typical end-user fashion replies “It doesn’t much matter where.”  Causing the cat to utter words that we should all take to heart — “Then it doesn’t matter which way you go!”

Of course, you could follow Yogi Berra’s advice.  He said, “When you come to a fork in the road, take it!”  But, then where would that leave you.  The bottom line is that planning and understanding are both required and go hand in hand with one another.

If you have no plan for where you want to go, then at best you will just be going around in circles; at worst, you’ll be going backward!  Planning and keeping abreast of the latest technology is imperative in the rapidly changing world of information technology (IT).  As Alice might put it, IT just keep getting curiouser and curiouser.

It Means What I Mean!

But a true understanding of the IT industry, which is required to accurately and successfully plan, can be difficult to achieve because, invariably we will stumble across Humpty Dumptys.

Humpty Dumpty

You remember Humpty, don’t you?  He’s that egg who sits on the wall and spouts off about everything under the sun, sometimes without the requisite knowledge to back up his statements.

Humpty Dumpty is famous for saying “When I use a term, it means whatever I choose it to mean — nothing more, and nothing less.”  There are too many Humpty Dumptys out there.  Perhaps they have good intentions, but we all know what road is paved with those, don’t we?

There are too many people in the IT world laboring under false impressions and definitions. Whenever a new trend or technology begins to gain traction, then you can bet that almost every vendor will claim that their product should be a part of the trend. Even if the trend is completely new and the product in question was created 30 years ago!

Of course, products can be adapted and trends can change. So what is the point of this little blog post? I guess it would be to keep up with trends, don’t believe everything you read, always be learning and create your plans based on sound research.

Does anybody out there disagree with that?


Posted in business planning, IT | Leave a comment

Gaining Value from Analytics

Data volume and higher transaction velocities associated with modern applications are driving change into organizations across all industries. This is happening for a number of reasons. Customer and end user expectations for interacting with computerize systems has changed. And technology is changing to accommodate these requirements. Furthermore, larger and larger amounts of data are being generated and made available, both internally and externally to our businesses. Therefore, the desire and capability to store large amounts of data continues to expand.

One clear goal of most organizations is to be able to harness all of this data – regardless of its source or size – and to glean actionable insight from it. This is known as analytics. Advanced analytical capabilities can be used to drive a wide range of applications, from operational applications such as fraud detection to strategic analysis such as predicting patient outcomes. Regardless of the applications, advanced analytics provides intelligence in the form of predictions, descriptions, scores, and profiles that help organizations better understand behaviors and trends.

Furthermore, the desire to move up the time-to-value for analytics projects will result in a move to more real-time event processing. Many use cases can benefit from early detection and response, meaning that identification needs to be as close to real time as possible. By analyzing reams of data and uncovering patterns, intelligent algorithms can make reasonably solid predictions about what will occur in the future. This requires being adept enough to uncover the patterns before changes occur. This does not always have to happen in real time.

Issues in Deploying Advanced Analytics

When implementing an analytics project it is not uncommon to encounter problems along the way. One of the first issues that needs to be addressed when adopting analytics in the cognitive era is having organization leaders who will embrace the ability to make decisions based on data instead of gut feelings based on the illusion of having data. Things change so fast these days that it is impossible for humans to keep up with all of the changes. Cognitive computing applications that rely on analytics can ingest and understand vast amounts of data and keep up with the myriad of changes occurring daily…if not hourly. Armed with advice that is based on a thorough analysis of up-to-date data, executives can make informed decisions instead of what amounts to the guesses they are making today.

However, most managers are used to making decisions based on their experience and intuition without necessarily having all of the facts. When analytics-based decision making is deployed management can feel less involved and might balk. Without the buy-in at an executive level, analytics projects can be very costly without delivering an ROI, because the output (which would deliver the ROI) is ignored.

Another potential difficulty involves managing and utilizing large volumes of data. Businesses today are gathering and storing more data than ever before. New data is created during customer transactions and to support product development, marketing, and inventory. And many times additional data is purchased to augment existing business data. This explosion in the amount of data being stored is one of the driving forces behind analytics. The more data that can be processed and analyzed, the better the advanced analysis can be at finding useful patterns and predicting future behavior.

However, as data complexity and volumes grow, so does the cost of building analytic models. Before real modeling can happen, organizations with large data volumes face the major challenge of getting their data into a form from which they can extract real business information. One of the most time-consuming steps of analytic development is preparing the data. In many cases, data is extracted, and a subset of this data is used to create the analytic data set where these subsets are joined together, merged, aggregated, and transformed. In general, more data is better for advanced analytics.

There are two aspects to “more data”: (1) data can increase in depth (more customers, transactions, etc.), and (2) data can grow in width (where subject areas are added to enhance the analytic model). At any rate, as the amount of data expands, the analytical modeling process can elongate. Clearly performance can be an issue.

Real-time analytics is another interesting issue to consider. The adjective real-time refers to a level of responsiveness that is immediate or nearly immediate. Market forces, customer requirements, governmental regulations, and technology changes collectively conspire to ensure that data that is not up-to-date is not acceptable. As a result, today’s leading organizations are constantly working to improve operations and with access to and analysis of real-time data.

For example, consider the challenge of detecting and preventing fraud. Each transaction must be analyzed to determine its validity. The organization waits for approval while this is done in real-time. But if you err on the side of safety, valid transactions may be declined which will cut into profit and perhaps more importantly, upset your customer. The advanced analytics approach leverages predictive analysis to scrutinize current transactions along with historical data to ensure transactions that may appear suspicious aren’t the norm for this customer. The challenge is doing this in real-time.

Nimble organizations need to assess and respond to events in real-time based on up-to-date and accurate information, rules, and analyses. Real-time analytics is the use of, or the capacity to use, all available enterprise data and resources when they are needed. If, at the moment information is created (or soon thereafter) in operational systems, it is sensed and acted upon by an analytical process, real-time analytics have transpired.

As good as real-time analytics sounds, it is not without its challenges to implement. One such challenge is reducing the latency between data creation and when it is recognized by analytics processes.

Time-to-market issues can be another potential pitfall of an advanced analytics project. A large part of any analytical process is the work involved with gathering, cleansing, and manipulating data required as input to the final model or analysis. As much of 60% to 80% of the man-effort during a project goes toward these steps. This up-front work is essential though to the overall success of any advanced analytics project.

Technology Considerations

From a technology perspective, managing the boatload of data and the performance of operations against that data can be an issue. Larger organizations typically rely on a mainframe computing environment to process their workload. But even in these cases the mainframe is not the only computing platform in use. And the desire to offload analytics to other platforms is often strong. However, for most mainframe users, most of the data resides on the mainframe. If analytics is performed on another platform moving large amounts of data to and from the mainframe can become a bottleneck. Good practices, and good software will be needed to ensure that efficient and effective data movement is in place.

But before investing in a lot of data movement off of the mainframe, consider evaluating the cost of keeping the data where it is and moving the processes to it (the analytics) versus the cost of moving the data to the process. Usually, the former will be more cost effective.

Taking advantage of more in-memory processes can also be an effective approach for managing analytical tasks. Technologies like Spark, which make greater use of memory to store and process data, are gaining in popularity. Of course, there are other in-memory technologies worth pursuing as well.

Another technology that is becoming more popular for analytics is streaming data software. Streaming involves the ingestion of data – structured or unstructured – from arbitrary sources and the processing of it without necessarily persisting it. This is contrary to our common methodology of storing all data on disk.

Although any digitized data is fair game for stream computing, it is most common for analyzing measurements from devices. As the data streams it is analyzed and processed in a problem-specific manner. The “sweet spot” for streaming is situations in which devices produce large amounts of instrumentation data on a regular basis. The data is difficult for humans to interpret easily and is likely to be too voluminous to be stored in a database somewhere. Examples of types of data that are well-suited for stream computing include healthcare, weather, telephony, stock trades, and so on.

By analyzing large streams of data and looking for trends, patterns, and “interesting” data, stream computing can solve problems that were not practical to address using traditional computing methods. To put it in practical terms, think about your home fire detectors. These devices are constantly up and running, waiting for a condition. When fire or smoke is detected, an alarm is sounded. Now if this was to be monitored remotely, you wouldn’t want to store all of the moments in time when there was no fire… but you care a lot about that one piece of data when the fire is detected, right?

Consider a healthcare example. One healthcare organizations is using an IBM stream computing product, InfoSphere Streams, to help doctors detect subtle changes in the condition of critically ill premature babies.  The software ingests a constant stream of biomedical data, such as heart rate and respiration, along with clinical information about the babies.  Monitoring premature babies as a patient group is especially important because certain life-threatening conditions, such as infection, may be detected up to 24 hours in advance by observing changes in physiological data streams. The biomedical data produced by numerous medical instruments cannot be monitored manually nor can a never-ending stream of values for multiple patients be stored long term.

But the stream of healthcare data can be constantly monitored with a stream computing solution. As such, many types of early diagnoses can be made that would take medical professionals much longer to make. For example, a rhythmic heartbeat can indicate problems (like infections); a normal heartbeat is more variable. Analyzing an ECG stream can highlight this pattern and alert medical professionals to a problem that might otherwise go undetected for a long period. Detecting the problem early can allow doctors to treat an infection before it causes great harm.

A stream computing application can get quite complex. Continuous applications, composed of individual operators, can be interconnected and operate on multiple data streams. Again, think about the healthcare example. There can be multiple streams (blood pressure, heart, temperature, etc.), from multiple patients (because infections travel from patient to patient), having multiple diagnoses.

The Bottom Line

There are many new and intriguing possibilities for analytics that require an investment in learning and new technology. But the return on the investment is potentially quite large in terms of gaining heretofore unknown insight into your business, and also in better servicing your customers. After all, that is the raison d’être for coming to work each day!

Posted in analytics, Big Data | 1 Comment

Hey, DBAs… Should You Be a Jack-of-all-Trades?

You know what a Jack-of-All-Trades is, don’t you? It is that person that always seems to have know-how on a topic or project that is helpful. They have a comprehensive knowledge of all things data- and IT-related and they are always helpful to have around. It is my assertion that DBAs need to be Jacks-of-all-Trades (and masters of some trades).

In other words, most successful DBAs have to diversify. Why?

Databases interact with almost every component of the IT infrastructure. They are at the center of modern applications. If the DBMS fails, applications fail, and if applications fail, business can come to a halt. And if business comes to a halt often enough, the entire business can fail. Database administration is therefore critical to the ongoing success of modern business.

The typical IT infrastructure of today comprises many tools:

  • Programming languages and environments such as COBOL, Python, C/C++, Assembler, PHP, Java, and many others
  • Database and process design tools such as ERwin, ER-Studio, and others
  • Database Management Systems like Oracle, Db2, SQL Server, MongoDB, and many others
  • Non-DBMS data set and file storage techniques such as BSAM, VSAM, and B-tree
  • Big Data platforms such as Hadoop, Spark, and many others
  • Transaction processing systems such as CICS and Tuxedo
  • Message queueing software such as MQSeries and MSMQ
  • Networking software and protocols such as SNA, VTAM, and TCP/IP
  • Networking hardware such as bridges, routers, hubs, and cabling
  • Multiple operating systems such as Windows, Mac OS X, z/OS, UNIX, Linux, and perhaps others
  • Data storage hardware and software such as enterprise storage servers, Microsoft SMS, IBM DFHSM, storage area networks (SANs), and NAS
  • Operating system security packages such as RACF, ACF2, and Kerberos
  • Other types of storage hardware such as tape machines, silos, and solid state (memory-based) storage
  • Database administration tools
  • Systems management tools and frameworks
  • Operational control software such as batch scheduling software and job-entry subsystems
  • Virtual machines and containers and their management and orchestration
  • DevOps solutions for continuous software delivery like Chef, Puppet, GitHub, Ansible, and more
  • Software distribution solutions for implementing new versions of system software across the network
  • Internet and Web-enabled databases and applications
  • Object-oriented and component-based development technologies and techniques such as CORBA, COM, OLE DB, ADO, and EJB
  • Tablets and SmartPhones such as the iPad, iPhone, and many Android devices.

And I’m quite sure that this is an incomplete list. But you get the idea. It is impossible to become an expert in all of these technologies, but the DBA should have some knowledge of each of these areas and how they interrelate. Even more importantly, the DBA should have the phone numbers of experts to contact in case any of the associated software and hardware causes database access or performance problems.

So, being a DBA is sorta like structuring a well-invested financial portfolio: to be successful at either, you will need to diversify!

Now all of this is not to say that it is impossible to earn a living as a specialist DBA, such as a performance expert or a backup/recovery specialist. But it will be more difficult to spend a career as a specialist. There are far fewer specialist opportunities than there are for well-rounded, Jacks-of-all-Trades (and Jills-of-all-Trades, too)!

What does all of this mean? It means you should take some time to learn something new today. Pick an area in the above bulleted list where you are not strong and see what you can teach yourself. There is a lot of good (and often free) information available on the web… and you can always augment your learning experience with books.

Good luck rounding out your knowledge… and if you have a moment, share with us here the steps you have taken to keep learning and expanding what you know.

Posted in DBA | Leave a comment

Have You Thought About the Impact of eDiscovery on Your Data Management Strategy?

When thinking about data management it is unlikely that your first thought is about legal requirements. Nevertheless, the legal side of data management must be considered in this day and age of regulatory compliance.

To some, the sheer volume and nature of all the sweeping regulations are too mind-boggling to fully digest. But with the EU GDPR quickly coming down tracks, it makes sense to discuss a few of the issues that will impact your databases and data management policies.

First of all, ensuring compliance requires a collaborative effort between business users, IT, and your legal department. This can prove to be a challenge because these three disparate groups are quite distinct and rarely communicate collectively. IT talks to legal only when they have to – and that is usually just to get approval on contract language for software purchase. IT and business communicate regularly (at least they should), but perhaps not as effectively as they might. But all three are required:

  • Business: must understand the legal requirements imposed on their data and systems as dictated in regulations
  • Legal: must be involved to interpret the legal language of the regulations and ensure that the business is taking proper steps to protect itself
  • IT: must be involved to implement the policies and procedures to enact the technology to support the regulatory mandates

Organizations need to map and categorize their business data in accordance with how each data element is impacted by regulations. We need to be able to answer questions like: Which data elements are under the control of which regulation? And what does the regulation require in the way we manage that data?

Once mapped, controls and policies need to be enacted that enforce compliance with the pertinent regulations. This can require better protection and security, enforce longer data retention periods, impose stricter privacy sanctions, mandate improved data quality practices, and so on.

One of the issues that should be factored into the equation by data management professionals is preparation for e-discovery. Yes, regulations mandate that we retain data longer, but there are rules and regulations that dictate when and how organizations will need to access and produce data that is retained, too. I mean, why keep that data around if there is no need ever to see it again?

The ability to produce retained data upon request is typically driven by lawsuits. You probably can recall examples of courtroom showdowns on television where truckloads of paper documents were required during the discovery process of the lawsuit. But times have changed. Increasingly, the data required during the discovery process is electronic, not written. That is, the data is stored on a computer, and much of that data is stored in a database management system.

Which brings me to the Federal Rules of Civil Procedure (FRCP), which are the rules used by US district courts to govern legal proceedings. One of the items in this set of rules dictates policies governing discovery. Discovery is the phase of a lawsuit before the trial occurs during which each party can request documents and other evidence from other parties or can compel the production of evidence.

The FRCP has been modernized and one of the key changes focuses on electronic documents: “A party who produces documents for inspection shall produce them . . . as they are kept in the usual course of business…” So clearly this change compels organizations to improve their ability to produce electronic data.

Another aspect of the FRCP deals with safe harbor from sanctions arising from spoliation. According to this section, “absent exceptional circumstances, a court may not impose sanctions under these rules on a party for failing to provide electronically stored information as a result of the routine, good faith operation of an electronic information system.” Basically, this section shines a spotlight on the need for organizations to develop a clearly articulated, well-executed, and uniformly enforced records retention program. And that program should include database data. Instituting policies and procedures for how data is treated for long-term retention can provide some level of protection from “adverse inference” rulings arising from spoliation.

There are likely to be additional implications arising from manipulating your data management standards to comply with the FRCP, especially when coupled with industry trends such as big data causing more and more data to be retained, the growing number of data breaches and the ever-increasing regulations being voted into law by federal and state governments. It means that we will be forced to treat data as the corporate asset that it is — instead of just saying that we treat it that way.

Data governance programs are becoming more popular as corporations work to comply with more and stricter governmental regulations. A data governance program oversees the management of the availability, usability, integrity, and security of enterprise data. A sound data governance program includes a governing body or council, a defined set of procedures, and a plan to execute those procedures.

So an organization with a strong data governance practice will have better control over its information. When data management is instituted as an officially sanctioned mandate of an organization data is treated as an asset. That means data elements are defined in business terms, data stewards are assigned, data is modeled and analyzed, metadata is defined, captured and managed, and data is archived for long-term data retention.

All of this should be good news to data professionals who have wanted to better define and use data within their organizations. That is, the laws are finally catching up with what we knew our companies should have been doing all along.

Posted in compliance, data, data breach, Data Growth | Leave a comment

What the Null? Handling Missing Information in Database Systems

In relational database systems, a null represents missing or unknown information at the column level. A null is not the same as 0 (zero) or blank. Null means no entry has been made for the column and it implies that the value is either unknown or inapplicable.

With any relational DBMS that supports nulls you can use them to distinguish between a deliberate entry of 0 (for numerical columns) or a blank (for character columns) and an unknown or inapplicable entry (NULL for both numerical and character columns).

Nulls sometimes are inappropriately referred to as “null values.” Using the term value to describe a null is inaccurate because a null implies the lack of a value. Therefore, simply use the term null or nulls (without appending the term “value” or “values” to it).

Most RDBMSes represent null in a “hidden” column (or storage field) that is associated with each nullable column. A common name for this field is an indicator (such as in DB2). An indicator is defined for each column that can accept nulls. The indicator variable is transparent to the end user, but must be provided for when programming in a host language (such as Java or COBOL).

Every column defined to a table must be designated as either allowing or disallowing nulls. A column is defined as nullable – meaning it can be set to NULL – in the table creation DDL. Null is typically the default if nothing is specified after the column name. To prohibit the column from being set to NULL you must explicitly specify NOT NULL after the column name. In the following sample table, COL1 and COL3 can be set to null, but not COL2, COL4, or COL5:

    COL2   CHAR(10) NOT NULL,
    COL3   CHAR(5),
    COL5   TIME     NOT NULL);

What Are The Issues with Null?

The way in which nulls are processed usually is not intuitive to folks used to yes/no, on/off, thinking. With null data, answers are not true/false, but true/false/unknown. Remember, a null is not known. So when a null participates in a mathematical expression, the result is always null. That means that the answer to each of the following is NULL:

  • 5 + NULL
  • NULL / 501324
  • 102 – NULL
  • 51235 * NULL
  • NULL**3
  • NULL/0

Yes, even that last one is null, even though the mathematician in us wants to say “error” because of division by zero. So nulls can be tricky to deal with.

Another interesting aspect of nulls is that the AVG, COUNT DISTINCT, SUM, MAX, and MIN functions omit column occurrences set to null. The COUNT(*) function, however, does not omit columns set to null because it operates on rows. Thus, AVG is not equal to SUM/COUNT(*) when the average is being computed for a column that can contain nulls. To clarify with an example, if the COMM column is nullable, the result of the following query:

     FROM    EMP;

is not the same as for this query:

     FROM    EMP;

Instead, we would have to code the following to be equivalent to AVG(COMM):

     FROM    EMP;

When the column is added to the COUNT function the DBMS no longer counts rows, but instead counts column values (and remember, a null is not a value, but the lack of a value).

But perhaps the more troubling aspect of this treatment of nulls is “What exactly do the results mean?” Shouldn’t a function that processes any NULLs at all return an answer of NULL, or unknown? Does skipping all columns that are NULL return a useful result? I think what is really needed is an option for these functions when they operate on nullable columns. Perhaps a switch that would allow three different modes of operation:

  1. Return a NULL if any columns were null, which would be the default
  2. Operate as it currently does, ignoring NULLs
  3. Treat all NULLs as zeroes

At least that way, users would have an option as to how NULLs are treated by functions. But this is not the case, so to avoid confusion, try to avoid allowing nulls in columns that must be processed using these functions whenever possible.

Here are some additional considerations regarding the rules of operation for nulls:

  • When a nullable column participates in an ORDER BY or GROUP BY clause, the returned nulls are grouped either at the high or low end of the sort order depending on the DBMS. But this treats all nulls as equal when we all know they are not; they are unknown.
  • Nulls are considered to be equal when duplicates are eliminated by SELECT DISTINCT or COUNT (DISTINCT column).
  • Depending on the DBMS, a unique index may consider nulls to be equivalent and disallows duplicate entries because of the existence of nulls. Some DBMSes, such as DB2, provide a clause (in this case, WHERE NOT NULL) that allows multiple nulls in an index.
  • For comparison in a SELECT statement, two null columns are not considered equal. When a nullable column participates in a predicate in the WHERE or HAVING clause, the nulls that are encountered cause the comparison to evaluate to UNKNOWN.
  • When a nullable column participates in a calculation, the result is null.
  • Columns that participate in a primary key cannot be null.
  • To test for the existence of nulls, use the special predicate IS NULL in the WHERE clause of the SELECT statement. You cannot simply state WHERE column = NULL. You must state WHERE column IS NULL.
  • It is invalid to test if a column is <> NULL, or >= NULL. These are all meaningless because null is the absence of a value.

Examine these rules closely. ORDER BY, GROUP BY, DISTINCT, and unique indexes consider nulls to be equal and handle them accordingly. The SELECT statement, however, deems that the comparison of null columns is not equivalence, but unknown. This inconsistent handling of nulls is an anomaly that you must remember when using nulls.

Here are a couple of other issues to consider when nulls are involved.

Did you know it is possible to write SQL that returns a NULL even if you have no nullable columns in your database? Assume that there are no nullable columns in the EMP table (including SALARY) and then consider the following SQL:


The result of this query will be NULL if no DEPTNO exists that is greater than 999. So it is not feasible to try to design your way out of having to understand nulls!

Another troubling issue with NULLs is that some developers have incorrect expectations when using the NOT IN predicate with NULLs. Consider the following SQL:

SELECT C.color
 FROM   Colors AS C 
 WHERE  C.color NOT IN (SELECT P.color 
                        FROM   Products AS P);

If one of the products has its color set to NULL, then the result of the SELECT is the empty set, even if there are colors to which no other product is set.


Another issue that pops up when dealing with nulls is that a NULL does not equal a NULL, so extra effort is required to treat them as such. SQL provides a method for comparing columns that could be null, which is supported in DB2:


Before explaining how this clause functions, let’s take a look at the problem it helps to solve. Two columns are not equal if both are NULL, because NULL is unknown and a NULL never equals anything else, not even another NULL. But sometimes you might want to treat NULLs as equivalent. In order to do that, you would have to code something like this in your WHERE clause:


This coding would cause the DBMS to return all the rows where COL1 and COL2 are the same value, as well as all the rows where both COL1 and COL2 are NULL, effectively treating NULLs as equivalent. But this coding, although relatively simply, can be unwieldy and perhaps, at least not at first blush, unintuitive.

Here comes the IS NOT DISTINCT FROM clause to the rescue. The following clause is logically equivalent to the one above, but perhaps simpler to code and understand:


The same goes for checking a column against a host variable. You might try to code a clause specifying WHERE COL = :HV :hvind (host variable and indicator variable). But such a search condition would never be true when the value in that host variable is null, even if the host variable contains a null indicator. This is because one null does not ever equal another null. Instead we’d have to code additional predicates: one to handle the non-null values and two others to ensure both COL1 and the :HV are both null. With the introduction of the IS NOT DISTINCT FROM predicate, the search condition could be simplified to just:


Not only is the IS NOT DISTINCT FROM clause simpler and more intuitive, it is also a Stage 1 predicate, so it can perform well, too.


Nulls are clearly one of the most misunderstood features of SQL database systems development. Although nulls can be confusing, you cannot bury your head in the sand and ignore nulls. Understanding what nulls are, and how best to use them, can help you to create usable databases and design useful and correct queries in your database applications.

Posted in NULL, SQL | Leave a comment

Who Owns Data?

Who owns data?

This is a complex question that can’t be answered quickly or easily. It requires some thought and you have to break things down in order to even attempt to answer it.

First of all, there is public data and private data. One could say that public data is in the open domain and available to everyone. But what makes data public? Is data you post on Facebook now public because it is available to anyone with a browser or the Facebook app? Well, probably not. It is available only to those that you have shared the data with. But when you put it up on Facebook then Facebook likely owns it.

What about governmental data that is available freely online like that available at USA.gov and data.gov? Well, you can grab that data and use it, but that doesn’t mean you own it, does it?

Then there are all the data governance and privacy laws and regulations that impact who owns what and how it can be used. It can be difficult to fully understand what all of these laws mean and how and when they apply to you and your organization. This is especially important with GDPR compliance looming before us.

But let’s back it up a minute and think just about corporate data. It is not an uncommon question, when working on a new project or application, to ask “who owns this data?” That is an important question to have an answer for! But owns is probably not the correct word.

In my humble opinion, data belongs to the company and thus, the COMPANY is the owner. Each department within an organization ought to be the custodian of the data it generates and uses to conduct its business.  Departments are the custodian because they are the ones who decide who has access to their data, must maintain the integrity of the data they use, and ensure that it is viable for making decisions and influencing executives.

Nevertheless, this answer provides only a part of the answer to the question. You really need named individuals as custodians. These can be from the business unit or the IT group supporting the business unit. Generally speaking, if custodians are appointed in IT, they should probably not be application developers or DBAs, but perhaps data analysts or higher-level IT managers.

Application developers are responsible for writing code and DBAs are responsible for the physical database structures and performance. There needs to be a data professional in charge of the accuracy of the actual data in the databases.

Here are some things to consider as you approach your data ownership/custodian planning:

  • Understand the data requirements of all current systems, those developed in-house and those you bought. Be sure that you know all of the data interdependencies of your applications and how one app can impact another.
  • Assess the quality of your existing data in all of your existing systems. It is probably worse than you think it is. Then work on methods and approaches to improve that quality. There are tools and services that can help here.
  • Redesign and re-engineer your systems if you uncover poor data quality in your current applications and databases. You might choose to change vendors, replatform or rehost apps with poor data quality, but if the old data is still required it must be cleansed before using it in the new system.
  • Work on methods to score the quality of data in your systems and tie the performance and bonuses of custodians to the scores.

What do you think? Does any of this make sense? How does your organization approach data ownership and custodians?

Posted in data, Data Quality | Leave a comment

A Look at Data Professional’s Salaries

The annual ComputerWorld IT Salary Survey for 2017 was recently published and it contains a great wealth of interesting data. So, as I’ve done in the past, this post will summarize its findings and report on what is going on with the data-related positions mentioned in the survey. Of course, please click on the link above to go to ComputerWorld for the nitty-gritty details on other positions, as well as a lot of additional salary and professional information.

Overall, the survey reports a 3 percent growth in IT pay with 50% of respondents indicating that they are satisfied or very satisfied with their current compensation. That is down from last year when the number was 54%. Clearly, though, IT as a profession seems to be a sound choice. 43% expect their organization’s IT headcount to increase and 49 percent expect it to remain the same, while only 7 percent expect a decrease in their company’s headcount.

But all is not rosy. When looking at the amount of work that needs to be done 56 percent expect IT workload to increase over the next year. But if headcount is not rising commensurate with the amount of additional workload then that means organizations will expect more work from their IT staff than they did last year.

Nevertheless, 85 percent say they are satisfied or very satisfied with their career in IT.

Now let’s get to the interesting part for data professional… and that is the salary outlook for specific data jobs.

If you are the manager of a database or data warehousing group, your total compensation increased greater than the norm last year at 4.1 percent. Average compensation grew from $110,173 to $114,635.

DBAs compensation grew 2.9 percent, which was just about the average. Average compensation for DBAs was $104,860, growing from $101,907 in 2016.

Database developer/modeler, which is an interesting grouping, grew 2.5 percent from $96,771 in 2016 to $99,235 in 2017.

So things are looking OK, but not stellar for data professionals. Which IT positions grew their salary at the highest percentage? Well, the top of the heap, somewhat surprisingly, was Web Developer which grew at 6.7 percent (to an average total compensation of $76,446). The next highest growth makes a lot of sense, Chief Security Officer, which grew 6.4 percent year over year.

The common career worries looked familiar with keeping skills up-to-date being the most worrisome, followed by flat salaries and matching skills to a good position. And the biggest workplace woe? Not surprisingly, increased IT workload. But stress levels are about the same with 61 percent of respondents indicating that their level of job stress was the same as last year.

What can you do to help grow your salary this year? Well, you might consider aligning your career with one of the hot specialties called out in the survey. The top three tech functions with the highest average compensation in 2017 are cloud computing, ERP and security.

Overall, though, it looks like an IT career is a good thing to pursue… and working with data in some capacity still makes a lot of sense!

Posted in data, DBA, salary | Leave a comment

Why isn’t e-mail more easily queryable?

Today’s blog post is just a short rumination on finding stuff in my vast archive of email…

Do you ever wonder why e-mail systems don’t use database management technology?  Don’t you store some of your e-mails for long periods of time?  Do you group them into folders?  But then, isn’t it hard to find anything later?

Anybody who uses email like I do needs to know which folder is which and which e-mail has the information you need in it.  And it isn’t usually obvious from the folder name you gave it (which made sense at the time) or the subject of the e-mail (which might not have anything to do with the actual content of the e-mail you’re looking for).  And sometimes emails get stored in the wrong folder…

I’d sure love to be able to use SQL against my e-mail system, writing something like:


Or something like that.

Wouldn’t you?

Posted in e-mail, SQL | 1 Comment