A Couple of Database Predictions

I usually try to avoid predicting the future because humans, as a general rule, are not very good at it. But when things look repetitive or cyclical, then sometimes a “bold” prediction here and there is worth attempting.

So, with that in mind, over the course of the next 5 years or so I predict widespread consolidation in the NoSQL DBMS market. As of today (February 18, 2015), NoSQL-Database.org lists 150 different NoSQL database systems. Anybody should be able to foresee that a number of NoSQL offerings so vast is not sustainable for very long. Winners will emerge pushing laggards out of business (or to languish).

Why are there so many NoSQL options? Well, IT folks like options. Many did not necessarily want to be tied to the Big Three RDBMS vendors (Oracle, IBM and Microsoft); others were looking for novel ways to solve problems that were not very well served by relational offerings. But nobody wants a sea of incompatible, proprietary DBMSes for very long because it is hard to support, hard to train talent for, and hard to find new employees with experience.

So consolidation will happen.

Additionally, the Big Three RDBMS vendors (and others) will work to incorporate the best features and functionality of the NoSQL database systems into their products. This is already happening (witness the column store capability of IBM DB2 for LUW with BLU Acceleration). We will almost certainly see more NoSQL capabilities being added to the relational world.

So there you have it: two “bold” predictions.

  1. The NoSQL database market will consolidate with a few winners, some middle of the packers, and many losers.
  2. The relational database market will combat NoSQL by adding capabilities to their existing portfolio

What do you think? Do these predictions make sense to you? Add your thoughts and comments below…

Posted in NoSQL | 1 Comment

Models Should Form the Basis of Database and Application Development

One of the biggest challenges facing today’s organizations is ensuring the right information reaches the right people so data is used appropriately and accurately. In the age of Big Data and analytics, there is even more data being churned out more frequently than ever before. And few organizations treat data as the corporate asset it truly is, so documentation is sparse to non-existent.

Data architects can design the best possible data model, but it does not necessarily follow that developers will use the information how it was intended, or that business analysts will immediately understand the relationships between different data elements.

An integrated solution for modeling can help reduce this problem by streamlining information exchange and contextualizing the data model to minimize errors. By using models as the common language for all data stakeholders–-whether application or data warehouse developers, Business Intelligence (BI) professionals, data scientists or business users–-organizations can effectively translate otherwise disparate requirements in context of one another. For example, a data architect passing a logical data model to a development DBA needs to ensure the design is properly implemented in the database so the business’s rules are enforced. With an integrated approach, models–-and the underlying metadata–-are smoothly and accurately passed across functional boundaries.

The models for the data architect and the DBA remain similar in look and feel, and utilize the same types of notation. Deploying modeling solutions with tight integration across functional areas can reduce the amount of time, effort, and error involved in sharing information. And sharing at the model level simplifies potentially complex processes because models can more readily be consumed and understood by all required parties.

In this context, a model isn’t just a diagram with boxes and arrows. Yes, a data model must include the visual representation of the data (such as an entity/relationship diagram), but it also must be backed up by metadata. A model without metadata might as well be just a quick picture scrawled on a napkin somewhere. It’s the powerful combination of a clear diagram and the metadata definitions for the elements in the diagram that make a model useful.

Facilitating Data Integration

Among other objectives, a good data modeling solution can help you reverse engineer existing schemata into models and forward engineer models into new target schemata. This is basically what you do during an ETL process: You reverse engineer an existing data source (you extract the data along with the accompanying metadata, and it’s the latter that’s described in the schema), transform that data, and then forward engineer the target data.

For this reason, it makes sense to utilize models as a mechanism to facilitate data movement and data warehouse population. A model-driven approach to data integration can enable data architects and BI professionals to import data models and the associated metadata for use during ETL. This visual approach to data integration helps simplify the process while ensuring the complete and accurate migration and integration of data between systems.

Simplifying Application Development

Models can also be used to guide application development. The comprehensive data model can be integrated into a process model to streamline the development process. With knowledge of and access to the metadata, developers are better prepared to build the correct processes the first time.

Additionally, application modeling solutions used by development teams enable clear application models to be created that can then be translated into code. Model-based application development enables the deployment of higher quality applications in less time. And with the appropriate tools, you can then generate code directly from models into popular application development technologies.

Data Modeling, NoSQL and Big Data

It is sometimes said that Big Data and NoSQL database implementations benefit from schema flexibility. This means that one record may contain different data elements than another record within the same database row (or object or document or…) This would seem to obviate the need for a data model.

But it should not! If you simply determine that the developer or user can add any data at any time anywhere in the database then you will end up with a pile of mismatched, unmanaged data – or what I like to call a data outhouse. So sure, the flexibility offered in the NoSQL world can be a useful tool, but without a data model it turns into a disaster.

NoSQL database systems grew out of the need to resolve certain big data management issues. But running systems without a definition of the data being managed should not be the result. If it is, then things will go from good to bad to worse!

Enhancing Data Integrity Across the Enterprise

Creating data models and collecting metadata that spans the breadth of an organization can deliver significant benefits such as better decision support and business information, or richer customer information and support. However, these benefits are realized only when the information is accessible and accurate. Using integrated modeling solutions to implement model-driven data management and development can help organizations share knowledge effectively and with minimum effort. And it will certainly produce higher quality applications and data, too.

A logical data model should be used as the blueprint for designing and creating physical databases. Modeling can be integrated into your IT infrastructure, thereby helping to exploit the synergy that exists within your databases, IT activities, and application systems.

Posted in analytics, Big Data, data modeling, NoSQL | 1 Comment

2014 in review

The WordPress.com stats helper monkeys prepared a 2014 annual report for this blog.

Here’s an excerpt:

The concert hall at the Sydney Opera House holds 2,700 people. This blog was viewed about 42,000 times in 2014. If it were a concert at Sydney Opera House, it would take about 16 sold-out performances for that many people to see it.

Click here to see the complete report.

Posted in DBA | 1 Comment

Happy Holidays

Cheers!

happy-holidays

Just a short post to end the year wishing all of my readers everywhere a very happy holiday season! Enjoy the holidays and come back next year (2015) as we continue to explore the world of data and database technology…

 

Posted in DBA | Leave a comment

I’m On Toad World… and the datAvail Blog, too!

Just a short post today to let folks know that my DB2portal blog is being syndicated to the Dell Toad World portal. That means that you can follow all of my DB2-specific blog posts either at DB2portal or at Toad World.

If you follow me at DB2portal – and I do want to encourage you to continue to do that – be sure to periodically check in on my posts at Toad World, too. That is because I write a separate, exclusive post monthly for the DB2 channel on Toad World. The latest exclusive Toad World post is titled Controlling the Rolling Four Hour Average and DB2 Costs.

Additional recent posts include:

And I’d like to alert you to another new outpost for my DB2-related material at the datAvail Blog. You can think of the posts there as being the “Best of DB2 Portal.” I will be reviving some of my past content and updating it so that it is accurate and current. My most recent post there is DB2 Compression: z/OS versus LUW.

Yes, I am keeping quite busy writing about DB2, data and database issues – but I enjoy writing, so I’ll keep it up. Sorry if it is confusing with all of the outlets for my material. But here is a good rule of thumb:

That’s all for today!

Posted in data, DB2 | 1 Comment

A Data Quality Initiative

As regular readers of this blog know, I sometimes share answers to questions I get via e-mail (and other avenues). By doing so I can hopefully expand on the audience participating in the discussion…

Anyway, I received the following question via e-mail:

When trying to get management on board with a data quality initiative, what should I focus on in order to justify the investment? How do I build “the business case?”

This was my response:

The best (short) answer that I can to this question is to try to quantify the cost of poor quality data on the business. This needs to be written in business language and not technology-speak. For example, what is the cost of a lost sale because product information was incorrect? Do you have a way to identify these cases? Even anecdotal evidence can be powerful if you are talking to the manager of the product line that lost the sale.

A more in-depth example might be an approved (or unapproved) business decision made based upon the wrong data and assumption. What is the cost, in terms of lost business? Or a hiring decision made based on erroneous information. What is the cost, in terms of an ineffective work team (plus the cost of removing and replacing the bad hire).

The cost of poor data quality is pervasive… and it can be difficult to quantify effectively.

I realize that the cost of finding these problems can be enormous, too. It can help to have some industry expert help. I would recommend that you purchase and read any of the several excellent books that Thomas C. Redman has written. These books focus on the data quality problems and have some facts and figures on average cost of poor quality data to business. For more in-depth and technical treatments of data quality issues I would direct you to books written by Jack Olson and Larry English.

…I realize that this is a quick and dirty answer to a complex question, but that is usually all I can afford to do with e-mail Q+As. Did I get the basics covered? What would you have answered?

Please share your thoughts and comments…

Posted in Data Quality | Leave a comment

Database Consistency Models

Just a short blog post today to point folks to a very well-written article on database consistency models titled On the Futility of Custom Consistency Models (posted on the Hacking, Distributed blog by 

This blog post does a very nice job of discussing the current trend of more and varied consistency models for database systems. The general idea seems to be that you can eliminate some of the overhead of ensuring data consistency for some applications. And, sure, there is some validity there. But how many different consistency models do we really need?

Also, the notion that relational/SQL DBMS products do not allow for flexible consistency is absurd. For example, I’ve written about isolation levels in this blog before, and you can adjust how locking behaves — and therefore the consistency of query results — by adjusting the isolation level. For example, an uncommited read (or “dirty” read”) can be used to eliminate read locks in DB2, and therefore return dirty data. Applications using dirty reads are more efficient, but they might return bad data to the application. For some use cases this might be fine, but I sure wouldn’t want my bank to use dirty reads on my financial transactions!

So the next time you start reading about eventual consistency and how it is revolutionary, step back, reconsider, and think about what you are reading. There is merit to it for some use cases (e.g. my seller account on amazon.com), but not for most (e.g. when the data absolutely must be accurate every time you read it).

Posted in data, data integrity, Data Quality, database design | 1 Comment

Inside the Data Reading Room – Fall 2014 Edition

Welcome to yet another edition of Inside the Data Reading Room, a regular feature of my blog where I take a look at recent database- and data-related books. In today’s post we’ll examine a book on SQL queries and a book on temporal databases.

The first book is the third edition of SQL Queries for Mere Mortals by Michael J. Hernandez and John L. Viescas. If you have read either of the first two editions of this book you know how useful it can be to help a new to intermediate SQL coder learn the ins and outs of the language.

The book is designed to be used by people who are not SQL wizards. That is to say, if you can already code circles around your peers and you rarely look things up in the manual, then you are not likely to get much from SQL Queries for Mere Mortals. But if you are one of those “mere mortals” referenced in the title, this book will be quite helpful.

Part 1 of the book consists of the first three chapters and it sets the stage for the meat of the book by explaining things like relational theory, database 101, design, and a brief history of SQL. This material can safely be skipped if you feel you have the necessary background.

The rest of the book digs into the task of teaching SQL: from the basics in Part 2, through joining, grouping and aggregating data, modifying data, and more. If you have a database environment you can work with, then working through the exercises provided in each chapter will help you to hone your SQL coding skills.

Now it is true that there are a lot of SQL books that have been published… and it can be difficult to choose which ones you want to buy. I think the authors have done a nice job of servicing a well-defined subset of the market – the non-expert SQL coder looking to learn more. If you think that describes you, then you would do well to add this book to your reading list.

The next book we’ll discuss here is Time and Relational Theory: Temporal Databases in the Relational Model and SQL by C.J. Date, Hugh Darwen and Nikos A. Lorentzos. For long-time database professionals, Date and Darwen need no introduction. If you are looking for an expert to define database theory to you, then Date and Darwen should be at the top of your list.

Temporal database technology is gaining acceptance and being adopted in commercial database management systems, such as IBM’s DB2. As such, learning the theory behind temporal data is a worthwhile endeavor for DBAs and developers alike.

Although listed as a second edition, Time and Relational Theory is based on the author’s earlier book on the subject, Temporal Data & the Relational Model (Morgan Kaufmann, 2002), with additional new material and research. So it is more a new book than a second edition. The book covers the temporal extensions that have been added in SQL:2011, as well as offering insight into temporal functionality still missing from SQL.

So, if you are looking for a good introduction to temporal relational database functionality, or an adjunct to the explanation of temporal capabilities in your DBMS manuals, look no further than Time and Relational Theory from Date, Darwen and Lorentzos.

Posted in DBA | Tagged , , , | Leave a comment

More DBA Rules of Thumb

A couple of years ago I posted a set of DBA Rules of Thumb (ROT) here on the blog. Over time, though, I have thought more about it and collected additional ROTs that I’d like to share with you today.

The first new ROT is to Diversify!  A good DBA is a Jack-of-All-Trades. You can’t just master one thing and be successful in this day-and-age. The DBA maintains production, QA and test environments, monitors application development projects, attends strategy and design meetings, selects and evaluates new products, and connects legacy systems to the Web.

And if all of that is not enough, to add to the chaos, DBAs are expected to know everything about everything. From technical and business jargon to the latest management and technology fads and trends, the DBA is expected to be “in the know.” For example, the DBA must be up on trends like the Big Data and Analytics movement.

And do not expect any private time: A DBA must be prepared for interruptions at any time to answer any type of question… and not just about databases, either.

When application problems occur, the database environment is frequently the first thing blamed. The database is “guilty until proven innocent.” And the DBA is expected to be there to fix things. That means the DBA is often forced to prove that the database is not the source of the problem. The DBA must know enough about all aspects of IT to track down errors and exonerate the DBMS and database structures he has designed. So he must be an expert in database technology, but also have semi-expert knowledge of the IT components with which the DBMS interacts: application programming languages, operating systems, network protocols and products, transaction processors, every type of computer hardware imaginable, and more. The need to understand such diverse elements makes the DBA a very valuable resource. It also makes the job interesting and challenging.

To summarize, the DBA must be a Jack-of-all-Trades and a master of some!

The second new ROT is to Invest in Yourself.  Aren’t you worth investing in?

Most every IT professional continually looks for their company to invest money in on-going education. Who among us does not want to learn something new – on company time – and with the company’s money? Unless, perhaps, you are self-employed!

Yes, your company should invest some funds to train you on new technology and new capabilities – especially if it is asking you to do new things. And since technology changes so fast, most everyone has to learn something new at some point every year. But the entire burden of learning should not be placed on your company!

Budget some of your own money to invest in your career. After all, you probably won’t be working for the same company your entire career. Why should your company be forced to bankroll your entire ongoing education? Now, I know, a lot depends on your particular circumstances. Sometimes we accept a lower salary than we think we are worth because of the “perks” that are offered. And one of those perks can be training.

But some folks simply abhor spending any of their hard-earned dollars to help advance their careers. Shelling out a couple of bucks to buy some new books, subscribe to a publication, or join a professional organization shouldn’t be out of the reach of most folks in IT, though.

A willingness to spend some money to stay abreast of technology is a trait that should apply to DBAs. Most DBAs I’ve known are insatiably curious and many are willing to invest some of their money to learn something new. Maybe they bought that book on XML before anyone at their company started using it. Perhaps it is just that enviable bookshelf full of useful database books in their cubicle. Or maybe they paid that nominal fee to subscribe to the members-only content of that SQL Server portal. They could even have forked over the $10 fee to attend the local user group.

Don’t get me wrong. I’m not saying that companies should not reimburse for such expenses. They should – it provides for better-rounded, more educated, and more useful employees. But if your employer won’t pay for something that you think will help your career, then why not just buy it yourself?

And be sure to keep a record of such purchases because unreimbursed business expenses can be tax-deductible.

What do you think? Can you come up with any more Rules of Thumbs for those of us in the world of Database Administration?

Posted in DBA | Tagged , | 5 Comments

Managing Application Performance… from a database perspective.

Applications that access databases are only as good as the performance they achieve. And every user wants their software to run as fast as possible. As such, performance tuning and management is one of the biggest demands on the DBA’s time. When asked what is the single most important or stressful aspect of their job, DBAs typically respond “assuring optimal performance.” Indeed, analyst surveys regularly indicate that “performance and troubleshooting” tops the list of most challenging DBA tasks.

But when you are dealing with data in a database management system there are multiple interacting components that must be managed and tuned to achieve optimal performance. That is, every database application, at its core, requires three components to operate:

  • the system (that is, the DBMS itself, the network, and the O/S),
  • the database (that is, the DDL and database schema), and
  • the application (that is, the SQL and program logic).

Each of these components requires care and attention, but today I want to focus on the high-level aspects of performance management from the perspective of the application.

So where do we begin? A logical starting point is with program parameters, such as the package bind parms. There are many parameters and values that must be chosen from and specified when you bind an application program. The vast array of options at our disposal can render the whole process extremely confusing – especially if you don’t bind on a daily basis. And even if you do, some of the options still might be confusing if you rarely have to change them. I’m talking about parameters like ACQUIRE, RELEASE, VALIDATE, ISOLATION and DEGREE.

I will not delve into the myriad bind options and give you advice on which to use when. There are many articles and books, as well as DBMS manuals that you can use to guide you along that path. Suffice it to say, that there are some standard parameters and values that should be chosen “most of the time.” As such, a wise DBA group will set up canned routines for the programmers to use for compiling and binding their applications. Choices such as: “transaction”, “batch”, or “analytical query” can be presented to the developer and then, based on which of the various types of programs and environments that are available, the canned script can choose the proper bind options. Doing so can greatly diminish the problems that can be encountered when the “wrong” parameters or values are chosen when preparing programs for production.

Before concluding this short section on parameters consider this one important piece of advice: In production, always Bind your programs specifying EXPLAIN YES. This option will externalize the access path choices made by the optimizer at bind time. Failing to do so means that access paths will be generated, but you will not know what they are. This is akin to blinding yourself to what is going on and it makes application performance tuning much more difficult.

Access Path Management

There are several common approaches to access path management. The best approach is to Rebind your applications over time as the data changes. This approach involves some form of regular maintenance that keeps database statistics up to date and formulates new access paths as data volumes and patterns change.

Other approaches include Rebinding only when a new version of the DBMS is installed, or perhaps more ambitious, whenever new maintenance or fixpacks are applied to the DBMS. Yet another approach is to Rebind automatically after a regular period of time, whether it is days, weeks, months, or whatever period of time you deem significant. This approach can work if the period of time is wisely chosen based on the application data – but it still can pose significant administrative issues.

The final approach – the worst of the bunch – comes from the “if it ain’t broke don’t fix it” school of thought. Basically, it boils down to (almost) never Rebinding your programs. This approach penalizes every program for fear that a single program (or two) might experience a degraded access path. Oh, the possibility of degraded performance is real and that is why this approach has been adopted by some. And it can be difficult to find which statements may have degraded after a Rebind. The ideal situation would allow us to review the access path changes before hand to determine if they are better or worse.

Anyway, let’s go back to the best approach again, and that is to Rebind regularly as your data changes. This involves what is known as the three Rs: REORG, RUNSTATS, and Rebind. Your goal should be to keep your access paths up-to-date with the current state of your data. Failing to do this means that the DBMS is accessing data based upon false assumptions.

By Rebinding you will generally improve the overall performance of your applications because the access paths will be better designed based on an accurate view of the data. And as you apply changes to the DBMS, optimizer improvements and new access techniques can be used. If you never Rebind, not only are you forgoing better access paths due to data changes but you are also forgoing better access paths due to changes to the DBMS itself.

To adopt the Three R’s you need to determine when to REORG. This means looking at either RUNSTATS or Real-Time Statistics (RTS). So, perhaps we need 4 R’s:

1) RUNSTATS or RTS

2) REORG

3) RUNSTATS

4) REBIND

But is this enough? Probably not because we need to review the access paths after rebinding to make sure that there are no rogue access paths. So, let’s add another “r” – to Review the access paths generated by the REBIND. As we mentioned, the optimizer can make mistakes. And, of course, so can you. Users don’t call you when performance is better (or the same). But if performance gets worse, you can bet on getting a call from irate users.

So we need to put in place best practices whereby we test Rebind results to compare the before and after impact of the optimizer’s choices. Only then can we assure that we are achieving optimal application performance.

Tuning the Code

Of course, everything we’ve discussed so far assumes that the code is written efficiently to begin with – and that is a big assumption. We also need to make sure that we are implementing efficient application code. The application code consists of two parts: the SQL code and the host language code in which the SQL is embedded.

SQL is simple to learn and easy to start using. But SQL tuning and optimization is an art that takes years to master. Some general rules of thumb for creating efficient SQL statements include:

  • Let SQL do the work instead of the application program. For example, code an SQL join instead of two cursors and a programmatic join.
  • Simpler is generally better, but complex SQL can be very efficient.
  • Retrieve only the columns required, never more.
  • Retrieve the absolute minimum number of rows by specifying every WHERE clause that is appropriate.
  • When joining tables, always provide join predicates. In other words, avoid Cartesian products.
  • Favor using Stage 1 and Indexable predicates.
  • Avoid sorting if possible by creating indexes for ORDER BY, GROUP BY, and DISTINCT operations.
  • Avoid black boxes – that is, avoid I/O routines that are called by programs instead of using embedded SQL.
  • Avoid deadlocks by updating tables in the same sequence in every program.
  • Issue data modification statements (INSERT, UPDATE, DELETE) as close as possible to the COMMIT statement as possible.
  • Be sure to build a COMMIT strategy into every batch program that changes data. Failing to COMMIT can cause locking problems.

To tune SQL the DBA must be able to interpret the output of the access paths produced by EXPLAIN. This information is encoded in the explain tables, with the primary one usually called PLAN_TABLE. There are vendor tools to simplify this process that render access paths visually and give recommendations on tuning approaches – and these tools are well worth considering.

Finally, some attention must be paid to the host language code. Host language code refers to the application programs written in C, COBOL, Java, or the programming language du jour. SQL statements are usually embedded into host language code and it is quite possible to have finely tuned SQL inside of inefficient host language code. And, of course, that would cause a performance problem.

The Bottom Line

Although DBAs must understand all three aspects of database performance management concentrating on the application aspects of performance will most likely provide the most bang-for-the-buck. Of course, we have only touched the tip of the application performance iceberg today. But even this high-level view into application performance can serve as a nice starting place for tuning your database applications.

Good luck and happy performance tuning!

Posted in performance, SQL | 1 Comment