My New Series of Articles on Data Warehouse at TechTarget

Just a short post today to alert you to a new series of articles that I am writing on data warehouse platforms for the SearchDataManagement portal at TechTarget. I will be writing four articles discussing data warehousing architecture, platforms, products, and trends. I will also be authoring ten accompanying product overviews for leading data warehouse DBMSes, DWaaS providers, and data warehouse appliances.

Check out the first article in the series here:

The benefits of deploying a data warehouse platform

Posted in Big Data, data warehouse | Leave a comment

Teradata: Thinking Big with New Wave of Data Offerings

This week I am at the Teradata Partners 2015 conference in Anaheim, California and not only is there a lot of useful and interesting information being presented here, but Teradata has announced a slew of new products for managing and analyzing data.

The first new offering is Teradata Listener, which is designed to help organizations respond to the challenges presented by the Internet of Things (IoT) and streaming big data sources. Teradata Listener is intelligent, self-service software with real-time “listening” capabilities to follow multiple streams of sensor and IoT data wherever it exists globally, and then propagate the data into multiple platforms in an analytical ecosystem.

The world is full of connected devices generating a massive and constant stream of data. But less than 1 percent of this data is analyzed in any way. And all projections are that this trend will continue with more connected devices generating more data.

With Teradata Listener, “customers can now take full advantage of IoT data generated from nearly an unlimited number and type of devices. In addition, Teradata enables customers to combine IoT data with business operations and human behavioral data to maximize analytic value,” said Hermann Wimmer, co-president, Teradata.

Teradata Listener is now available in beta, and will be generally available globally in the first quarter of 2016.

But this was not the only announcement made by Teradata. The company also announced Teradata Aster Analytics on Hadoop, an integrated analytics solution featuring a set of more than 100 business-ready, distinctly different analytics techniques and seven vertical industry applications to run directly on Hadoop. This allows organizations to seamlessly address business problems with an integrated analytics solution.

The flexibility and simplicity of these capabilities enables everyday business analysts to perform as data scientists by tackling the organization’s most challenging problems. Teradata Aster Analytics on Hadoop allows users to combine machine learning, text, path, pattern, graph, and statistics within a single workflow. Teradata offers flexible Aster Analytics deployments that include the Teradata Big Analytics Appliance, Hadoop , the software only version, or in the Teradata Cloud.

Teradata Aster Analytics on Hadoop will be shipped globally in the second quarter of 2016.

Also announced was the release (on Monday, October 19, 2015) ofthe Teradata Integrated Big Data Platform 1800 is designed to support IoT capabilities. It enables customers to perform complex analytics at scale and is available at a cost-effective price of approximately $1,000 per terabyte of compressed data. The Teradata Database, running on the Teradata Integrated Big Data Platform, provides access to data in many formats, including XML, name-value pair, BSON (Binary JSON) and JSON from web applications, sensors, and Internet of Things-connected machines.

Next up from Teradata is the Teradata Active Enterprise Data Warehouse 6800, a data warehouse platform built to run the Teradata Database to its fullest capabilities including Teradata Virtual Storage and Teradata Active System Management. The massively parallel processing (MPP) architecture of the platform matches the parallel, shared nothing architecture of the Teradata Database.

The overall computational power of the Teradata Active Enterprise Data Warehouse is boosted by 25 percent with the latest technology from the Intel® Xeon® Processor E5-2600v3 Family. It offers up to 15 percent faster in-memory processing with the latest DDR4 SDRAM cards with up to 1.5 terabytes of memory per cabinet. To support tactical queries that require sub-second access to frequently used data, the solid state drive (SSD) size is now four times larger and offers up to 128 terabytes of hot data space per cabinet.

The Teradata Active Enterprise Data Warehouse 6800 and the Teradata Integrated Big Data Platform 1800 are now available worldwide.

The company not only provides the database technology and analytics solutions, but it also provides applications for marketing that bring together the data management, analytics and big data capabilities of Teradata’s solutions. Teradata announced global availability of the newest version of Teradata Integrated Marketing Cloud, a powerful data hub comprising integrated solutions that are already helping more than one-third of the S&P Global 100 drive revenue and improve customer engagement through data-driven integrated marketing.

Teradata’s latest release helps marketers to individualize their marketing campaigns and connect one-to-one with customers by unifying customer-interaction data across paid, earned and owned channels, at scale.  Now, from campaign inception through every customer interaction and response, Teradata provides marketers the most agile and comprehensive integrated data-driven marketing platform available on the market.

I warned you up front that there were a lot of announcements!  The company also released new versions of its Teradata Unified Data Architecture™ (UDA), and the Teradata QueryGrid™: Teradata Database-to-Presto, which enables cross-platform SQL queries initiated from Apache® Hadoop™ or the Teradata Database. In addition, Teradata streamlines data exploration and analytics by integrating a Teradata Database, Teradata Aster Analytics, and Hadoop in a single Teradata® UDA Appliance.

The Teradata UDA Appliance is an enterprise-class appliance to enable a flexible combination of the Teradata Database, Teradata Aster Analytics, and Hadoop to meet customer workload requirements. All software is installed into one cabinet, providing the advantages of an analytic ecosystem in a smaller data center footprint. It is a fully configurable analytic ecosystem, which can be deployed as a Teradata Unified Data Architecture. The appliance can easily adapt to the customer’s requirements to promote data-driven decisions on large, fast-moving structured, semi-structured, and unstructured data. It supports an open model for data acquisition and discovery, and integrates data from all parts of the organization, while matching a variety of workload requirements to the right data sets.

“Since Teradata introduced the UDA over three years ago, we have seen rapid adoption of this approach,” said Oliver Ratzesberger, president, Teradata Labs. “Our new products will further our customers’ ability to develop the most flexible, integrated, and powerful analytics platform in the world.”

Both the Teradata QueryGrid: Teradata Database-to-Presto software and the Teradata UDA Appliance will be available globally in the first quarter 2016.

Teradata also announced a new managed services business for Hadoop Data Lakes, as well as promoted announcements from partners including real-time processing to the Teradata UDA from SQLstream, a partnership with Stonebranch to automate the data supply chain beyond the warehouse, and a data analytics handbook developed by Celebrus and Teradata (which can be downloaded here

So, all in all, there has been a lot of activity from Teradata that is well worth looking into if you have big data, analytics, data warehousing or marketing needs.

Posted in DBA | 1 Comment

How Might Big Data Impact the Role of the DBA?

Red background with diagrams information and numbers superimpos uid

I was recently asked the question that is the title of this blog post and I thought, hmmm, now there’s an interesting topic for my blog. So after answering I wrote down some of my thoughts and I have assembled them here to share with you today. If you have any further thoughts on this topic, please share them in the comments area below!

So, how might life change for DBAs as organizations embrace Big Data? That’s a loaded question. Life is always changing for DBAs! The DBA is at the center of new application development and therefore is always learning new technologies – not always database-related technologies. Big Data will have a similar impact. There is a lot of new technology to learn. Of course, not every DBA will have to learn each and every type of technology.

DBAs should be learning NoSQL DBMS technologies, but not with an eye toward replacing relational. Instead, at least for the time-being, NoSQL technologies (Key/Value, column, document store, and graph) are currently very common in big data and advanced analytics projects. My view is that these products will remain niche solutions, but the technology will be widely adopted. How will that happen? Well, relational DBMSs will add functionality to combat the NoSQL offerings, just like they did to combat the Object-Oriented DBMS offerings in the 1990s. So instead of just offering a relational engine, a DBMS (such as Oracle or DB2) will offer additional engines, such as key/value or document stores.

That means that DBAs who spend the time to learn what the NoSQL database technologies do today will be well-prepared for the multi-engine DBMS of the future. Not only will the NoSQL-knowledgeable DBA be able to help implement projects where organizations are using NoSQL databases today, but they will also be ahead of their peers when NoSQL functionality is added to their RDBMS product(s).

DBAs should also learn about Hadoop, MapReduce and Spark. Now Hadoop is not a DBMS, but it is likely to be a long-term mainstay for data management, particularly for managing big data. An education in Hadoop and MapReduce will bolster a DBA’s career and make them more employable long-term. And Spark looks like it is here for the long run, too. So learning how Spark can speed up big data requests with in-memory capabilities is also a good career bet.

It would also be a good idea for DBAs to read up on analytics and data science. Although most DBAs will not become data scientists, some of their significant users will be. And learning what your users do – and want to do with the data – will make for a better DBA.

And, of course, a DBA should be able to reasonably discuss what is meant by the term “Big Data.” Big Data is undoubtedly here to stay. Of course, the industry analyst firms have come up with their definitions of what it means to be processing “Big Data”, the most famous of which talks about “V”s. As interesting as these definitions may be, and as much discussion as they create, the definitions don’t really help to define what benefit organizations can glean from Big Data.

So, with that in mind, and if we wanted to be more precise, it would probably make sense to talk about advanced analytics, instead of Big Data. Really, the analytics is the motivating factor for Big Data. We don’t just store or access a bunch of data because we can… we do it to learn something that will give us a business advantage… that is what analytics is. Discovering nuggets of reality in mounds and mounds of data. But I am not in favor of that. Why?

Well, more than half the battle is getting the attention of decision makers, and the term Big Data has that attention in most organizations. As a data proponent, I think that the data-focused professionals within companies today should be trying to tie all of the data management and exploitation technologies to the Big Data meme in order to the attention of management and be able to procure funding. C’mon, as a DBA doesn’t it make sense to take advantage of an industry meme with the word “data” right in it? By doing so we can better manage the data (small, medium and big) that we are called upon to manage!

Finally, I would urge DBAs to automate as many data management tasks as possible. The more automated existing management tasks become, the more available DBAs become to learn about, and work on the newer, more sexy projects.

Posted in DBA | Tagged , , | 1 Comment

Automating the DBA Out of Existence?


Automation and autonomics are at the forefront of database and system administration and management these days. And that is a good thing because automation and autonomics can minimize the amount of time, error and human effort involved in assuring and maintaining efficient database systems and applications.

And, yes, there is also a lot of vendor hype about self-managing database systems, from the DBMS vendors themselves, and from ISVs selling performance and maintenance solutions and tools. So I suppose it stands to reason that folks will start to ask questions like this: If the DBMS and databases are going to manage themselves, will anyone need a DBA?

But don’t get too excited about the extinction of the DBA! There are many reasons why DBAs are not going anywhere anytime soon. Self-managing databases systems are indeed a laudable goal, but we are very far away from a “lights-out” DBMS environment. Many of the self-managing features require using the built-in tools from the DBMS vendor, such as Oracle Enterprise Manager or IBM Data Studio. But many organizations prefer to use heterogeneous solutions that can administer databases from multiple vendors all from a single console. And many of these tools have had self-managing features for years. And we still need DBAs…

Most performance management solutions allow you to set performance thresholds. But these thresholds are only as good as the variables you set and the actions you define to be taken when the threshold is tripped. Some software is intelligent; that is, it “knows” what to do and when to do it. Furthermore, it may be able to learn from past actions and results. The more intelligence that can be built into a self-managing system, the better the results typically will be. This brings us to autonomics… but what is autonomic computing?

Autonomics is more than mere automation… Autonomic computing refers to the self-managing characteristics of distributed computing resources, adapting to unpredictable changes while hiding intrinsic complexity to operators and users. And yes, the more our systems can manage themselves, the better things should become.

But who among us currently trusts software to work like a grizzled veteran DBA? The management software should be configurable so that it alerts the DBA as to what action it wants to take. The DBA can review the action and give a “thumbs up” or “thumbs down” before the corrective measure is applied. In this way, the software can earn the DBA’s respect and trust. When DBAs trust the software, they can turn it on so that it self-manages “on the fly” without DBA intervention. But today, in most cases, a DBA is required to set up the thresholds, as well as to ensure their on-going viability.

Furthermore, database backup and recovery will need to be guided by the trained eye of a DBA. Perhaps the DBMS can become savvy enough to schedule a backup when a system process occurs that requires it. Maybe the DBMS of the future will automatically schedule a backup when enough data changes. But sometimes backups are made for other reasons: to propagate changes from one system to another, to build test beds, as part of program testing, and so on. A skilled professional is needed to build the proper backup scripts, run them appropriately, and test the backup files for accuracy.

And what about recovery? How can a damaged database know it needs to be recovered? Because the database is damaged any self-managed recovery it might attempt is automatically rendered suspect. I mean, if the database was all that smart to begin with why is it damaged and in need of recovery, right? Here again, we need the wisdom and knowledge of the DBA.

And there are many other DBA duties that cannot be completely automated. Of course, the pure, heads-down systems DBA may eventually become a thing of the past. Instead, the modern DBA will need to understand multiple DBMS products, not just one. And that includes non-relational big data solutions like Hadoop and NoSQL database systems.

DBAs furthermore must have knowledge of the business impact of each database under their care (for more details see Business Eye for the DBA Guy). And DBAs will need better knowledge of logical database design and data modeling — because it will advance their understanding of the meaning of the data in their databases.

So, no, we won’t automate the DBA out of existence. It can’t be done… but we can make the job of DBA more interesting and useful, as we remove much of the mundane and repetitive components through automation and autonomics.

Posted in DBA | Tagged , | 1 Comment

The Lifecycle of Data

Most data is not static. No, data has a life in which it changes, is used for perhaps multiple purposes, and gets moved all over the place. So it makes sense to think about the lifecycle of your data at your organization.

The accompanying diagram helps to demonstrate this lifecycle. Basically, there are three major stages of “life” for any piece of data.


Data is created at some point, usually by means of a transaction: a product is released, an order is processed, a deposit is made, etc. For a period of time after creation, the data enters it first state: it is operational. That is, the data is needed to complete on-going business transactions. This is where it serves it primary business purpose. Transactions are enacted upon data in this state. Most changes occur to data during its operational state.

The operational state is followed by the reference state. This is the time during which the data is still needed for reporting and query purposes, but it is not necessarily driving business transactions. The data may be needed to produce internal reports, external statements, or simply exist in case a customer asks for it.

Then, after some additional period of time, the data moves into an area where it is no longer needed for completing business transactions and the chance of it being needed for querying and reporting is small to none. However, the data still needs to be saved for regulatory compliance and other legal purposes, particularly if it pertains to a financial transaction. This is the archive state.

Finally, after a designated period of time in the archive, the data is no longer needed at all and it can be discarded. This actually should be emphasized much stronger: the data must be discarded. In most cases the only reason older data is being kept at all is to comply with regulations, many of which help to enable lawsuits. When there is no legal requirement to maintain such data, it is only right and proper for organizations to demand that it be destroyed – why enable anyone to sue you if it is not a legal requirement to do so?

Perhaps a short example would help here. You are out shopping for clothing. You pick out a nice outfit and decide to charge the purchase to your credit card. As part of this transaction, the business captures you credit card data and the items you have purchased. In other words, the data is created and is stored in an operational state.

It remains operational until your monthly billing cycle is complete and you receive your statement in the mail. At some point after this happens the data moves from an operational state to a reference state. The data is not needed to conduct any further business, but it may be needed for reporting purposes. Furthermore, the card processing company determines that there is a period of time – maybe 90 days – during which customers frequently call to get information on recent transactions. But after that time customer requests are rare.

At this point the data can pass into an archive state. It must be kept around until such time as all regulatory requirements have passed. After all need for the data, both for internal business purposes and external legal purposes, has expired it is purged from the system.

Don’t think in terms of databases or technologies that you already know when considering these data states. The data could be in three separate databases, a single database, or any combination thereof. Furthermore, don’t think about data warehousing in this context – here we are talking about the single, official store of data – and its production lifecycle.

The operational and reference states have been reasonably well implemented in organizations today, but not so for archived data. Think about how you archive data, if you archive anything today at all. Is it easily accessible? Or would it take weeks or months of work to get the archived data into any reasonable format for querying? Or prehaps, more commonly, data is never archived. Instead it languishes in the production database with operational and reference data, but is never accessed. All it does is take up space and impact the performance of queries against the rest of your data!

As you design your databases, be sure to consider the data lifecycle and plan for each stage accordingly. With increasing regulatory pressures the need to better plan for and implement database archiving will only become more pervasive over time.

Posted in archive, compliance, data, database design, information, operational, reference data, tools | 1 Comment

Database Capacity Planning

Most database implementations are anything but static. Once deployed, databases are queried, updated, loaded, unloaded, reorganized, and data is deleted and inserted from them on an ongoing basis. And data may be replicated to and from the database on a regular basis, too.

As the data composition of a database changes, it storage requirements will change as well. That means that the DBA must be vigilant in terms of planning for future growth. This involves storage and capacity monitoring. The DBA must keep an eye on the amount of data and the number of users accessing the data. When either expands, database storage may have to be modified.

As such, DBAs must be skilled in the discipline of capacity planning. At its highest level, capacity planning is a process whereby the storage required for an entire system is measured and compared against requirements. The goal of this comparison is to adjust the resources available to the system as appropriate. To successfully embark on a capacity planning exercise requires an understanding of new corporate initiatives and how they are likely to impact existing infrastructure – that is, your hardware (CPU, storage, network, and memory) and your software (application and system).

By measuring current capacity, gauging the growth of capacity over time, and factoring in the anticipated capacity requirements of new corporate and IT initiatives, you can determine whether your existing infrastructure can sustain the anticipated workload. If the projected growth outpaces the ability of your computing environment to support it, you will need to evaluate the cost dynamics of modifying and possibly scaling up your computing infrastructure.

From a storage perspective this may involve simply adding more disk devices and assigning them to the DBMS. But it may involve additional tasks to support additional data and users including:

  • Redesigning applications
  • Redesigning databases
  • Modifying DBMS parameters
  • Reconfiguring hardware components
  • Adjusting software interfaces

There are multiple perspectives from which to view storage consumption. Each is valid and provides a different view of storage utilization and growth. A system-wide perspective views the rate at which disk space is being consumed on an aggregate level. At a lower level, viewing capacity by server determine which computers are consuming disk space at the fastest rate. You also can monitor storage consumption at the file system level to determine which file systems are consuming disk space at the fastest rate.

As a DBA, though, you will want to view just those files that are associated with the DBMS. The bottom line is that you will need to be able to answer all of the following questions about storage when performing storage capacity planning:

  • When will more storage be required?
  • How much additional storage is needed?
  • Where is the additional storage needed?
  • What needs to be done to align the additional storage with the DBMS?

Fortunately, there are tools that can be deployed to help DBAs answer these questions correctly. Multiple vendors offer storage management and reorganization products that chart the growth and organization of database files over time. By vigilantly monitoring database storage structures a wise DBA can glean patterns from their databases and determine when more storage will be needed.

There are also capacity planning tools geared especially for database management. These tools go a step further by combining performance profiling and performance planning to allow DBAs to better understand current requirements and predict future expectations. Basically, a proactive capacity planning tool gathers instrumentation details and analyzes the data for trends. Upon recognizing a trend the software makes recommendations to prepare for future capacity needs – or perhaps autonomically deploys changes on its own.

Simulations are also useful for determining future requirements. There are solutions that allow DBAs to play “what if” games with their environment to see what changes in usage, demand, and volume will do to their databases. Based on the results, the DBA can better prepare for multiple future scenarios.

Of course, technology alone is not sufficient. DBAs must be business savvy as well as technology savvy. This is especially true for capacity planning because only by staying on top of business changes will you have any chance of being able to successfully plan for the future of your systems. I mean, let’s face it, if you don’t know about that new product that will increase demand and thereby escalate data growth, the future storage needs of your databases will be beyond your ability to understand. And your databases will not be ready to accommodate the new business.

The bottom line is that capacity planning is an important part of the DBA’s job. And it requires both technical knowledge and business acumen in order to succeed.

Posted in capacity planning, database design, DBA | 1 Comment

An Overview of Today’s Leading DBMS Platforms

Recently I wrote a series of DBMS product descriptions/overviews for the TechTarget web portal. The product descriptions were an adjunct a 7 part series of articles I wrote reviewing DBMS technology and considerations circa 2015.

Anyway, the purpose of today’s post is to share the DBMS product overviews with readers of this blog. What follows is a series of 14 links, each one taking you to the TechTarget web site for a different DBMS product overview… enjoy.

If your favorite DBMS is missing from this list, don’t fret. It does not mean that I (or TechTarget) deemed it to be inferior… but there was a limit on the number of product overviews that could be produced and published.

Posted in cloud, DB2, DBMS, NoSQL, open source, relational | 1 Comment

Optimizing Database Performance, part 6: Automation

Parts 1, , 3, 4, and 5 of this series are also available.

Today’s post, the final one in this series, will briefly discuss automation… and autonomics.

Setting the Stage

Implementing, managing and maintaining complex database applications spread throughout the world is a difficult task. To support modern applications a vast IT infrastructure is required that encompasses all of the physical things needed to support your applications. This includes your databases, desktops, networks, and servers, as well as any networks and servers outside of your environment that you rely upon in today’s connected world. These things, operating together, create your IT infrastructure. These disparate elements are required to function together efficiently for your applications to deliver service to their users.

But these things were not originally designed to work together. So not only is the environment increasingly complex, it is inter-related. But it is not necessarily designed to be inter-related. When you change one thing, it usually impacts others. What is the impact of this situation on DBAs?

Well, for starters, DBAs are working overtime just to support the current applications and relational features. But new RDBMS releases are being made available faster than ever before. Many organizations are not running with the most current version of the DBMS.

But technology is not the only challenge facing modern DBAs. Knowledge of the business is also mandatory. Companies that implement ERP applications require their DBAs to understand the business as they implement and administer the databases for these systems. Furthermore, failing to understand how technology and business interact can result in lost business. A DBA must know how each database and transaction impacts the business in order to prioritize his workload. Otherwise, confronted with two database outages, how would the DBA know which one to work on first? Obviously, the one that is most critical to the business should be the priority, but this requires knowledge of the business, doesn’t it?

So, the job of database administration is getting increasingly more difficult as database technology rapidly advances adding new functionality, more options, and more complex and complicated capabilities. But DBAs are overworked, under-appreciated, and lack the time to gain the essential skills required to support and administer the latest features of the RDBMS they support. What can be done?

Intelligent Automation

One of the ways to reduce these problems is through intelligent automation. As IT professionals we have helped to deliver systems that automate multiple jobs throughout our organizations. That is what computer applications do: they automate someone’s job to make that job easier. But we have yet to fully intelligently automate our DBA jobs. By automating some of the tedious day-to-day tasks of database administration, we can help to simplify this complex environment and reduce manual errors.

Although simple automation is helpful, it is not likely to be sufficient. Modern administration software is adopting autonomics that can intelligently monitor, analyze, and optimize applications using past, present, and future analysis of collected data. Simply stated, the software mimics the way a consultant works — fulfilling the role of a trusted advisor. But we’ll talk more about autonomics in the next section.

First, let’s extol the virtues of simple automation… starting with utility automation.

Whenever possible, the DBA should look into using database utilities and third-party administration tools to automate database maintenance. Automation tools can be used to query the database statistics and trigger reorganization only for those database objects that have surpassed the high-water mark for a combination of statistics. For example, the DBA may want to automatically reorganize all table spaces where the cluster ratio is below 85 percent and all indexes where the leaf distance has a value greater than 100.

Reorganizations can be costly in terms of downtime and computing resources. And, at times, determining when reorganization will actually create performance gains can be difficult. By automating reorganizations based on actual database statistics, though, your success rate should be higher than simply waiting for a problem to occur before running a reorg.

But utilities and reorganization jobs are not the only thing that can be automated! Automation tools can be used to intercept error messages and performance metrics that can improve the visibility of problems — and hopefully the reaction time to their resolution. The more polished the automation, the better the chance that a script or action can be run to resolve the problem based on the identifying information. By automating such things before they occur reactive response to problems diminishes.

But What About Autonomics?

So what are autonomics? Autonomic computing refers to the self-managing characteristics of distributed computing resources, adapting to unpredictable changes while hiding intrinsic complexity to operators and users. Autonomics implies much more than simple automation.

The goal of autonomics is to enable computer systems and software capable of managing themselves. It requires an awareness of the environment, knowledge of changing usage patterns and resources, and the ability to adapt to shifting requirements. The primary benefit is to optimize manual tasks and eliminate the growing levels of complexity.

And let’s face it, complexity is a significant driving issue in today’s modern organizations. The number of computing devices continues to grow rapidly and the average complexity of each device is increasing.

And as we mentioned earlier, the number of systems that interoperate with each other continues to grow. Connections are made between systems that were never originally intended to share data or processing. Legacy systems rapidly are being transformed to work with modern, distributed applications adding layers of complexity behind the scenes, even as the user interface becomes easier. When you look at all of these trends, you have to admit that it is next to impossible to master all  of the interconnected technologies at your shop today, let alone into the future as the trends continue and pick up speed.

So what makes up an autonomic system? At a high-level, there are four aspects of autonomic computing:

  1. It is automatic, meaning it can make decisions on its own, using accumulated performance and usage metrics and high-level policies developed by administrators.
  2. It is adaptive, meaning it can automatically adjust to changing conditions.
  3. It is aware, meaning the system can monitor (or sense) its operational context as well as its current state to determine if it is reaching its specific purpose.
  4. It is self-managing, meaning it can optimize and administer itself without human interaction being required.

Of course, there are many aspects to “self” managing, and autonomic systems can support varying degrees of self-management capabilities, including:

  • Self-configuration: Automatic configuration of the system and its components;
  • Self-healing: Automatic discovery, and correction of faults;
  • Self-optimization: Automatic monitoring and control of resources to ensure the optimal functioning with respect to the defined requirements;
  • Self-protection: Proactive identification and protection from arbitrary attacks.
  • Self-inspection: Understands itself and its interactions with other systems in order to make intelligent decisions;
  • Self-organization: Proactive modification of data structures and organization to optimize access.

From the DBA’s perspective, autonomic administration solutions should be able to collect data about the IT environment from the systems (OS, DBMS, OLTP, etc.), objects, and applications. It should require very little initial configuration, so that it is easy to use for novices and skilled users alike. It should detect conditions requiring maintenance actions, and then advise the user of the problem, and finally, and most beneficial to the user, optionally perform the necessary action to correct the problems it identifies. Many management tools (without autonomics capabilities) leave this analysis and execution up to the user. But intelligent automation solutions should be smart enough to optimize and streamline your IT environment with minimal, perhaps no, user or DBA interaction.

The end result, software that functions like a consultant, enables the precious human resources of your organization to spend time on research, strategy, planning, and implementing new and advanced features and technologies. Only through intelligent automation with autonomic computing will we be able to deliver on the promise of technology.


Applications and data are constantly changing. Users require instant response time and 24/7 availability. The database structures that support these applications must be properly maintained to ensure optimal application performance. Proper database design, appropriate choice of clustering, and reorganizing databases based on statistics help to deliver efficient databases. Furthermore, DBAs can ensure database performance by automating these processes to reduce the risk and errors associated with manual database maintenance.

As IT tasks get more complex and IT professionals are harder to employ and retain, more and more IT duties should be automated using intelligent management software and autonomics. This is especially true for very complex jobs, such as DBA. Automation through autonomics can reduce the amount of time, effort, and human error associated with managing databases and complex applications.

Portions of this post were excerpted from my book Database administration: The complete guide to DBA practices and procedures (2nd edition).

Posted in DBA | 1 Comment

Optimizing Database Performance, part 5: Database Reorganization

Portions of this tip were excerpted from my book Database administration: The complete guide to DBA practices and procedures (2nd edition).

The first , second , third, and fourth parts of this series are also available.

Today’s topic is reorganization.

Relational technology and SQL make data modification easy. Just issue an INSERT, UPDATE, or DELETE statement with the appropriate WHERE clause and the DBMS takes care of the actual data navigation and modification. In order to provide this level of abstraction, the DBMS handles the physical placement and movement of data on disk. Theoretically, this makes everyone happy. The programmer’s interface is simplified, and the DBMS takes care of the hard part—manipulating the actual placement of data. However, things are not quite that simple. The manner in which the DBMS physically manages data can cause subsequent performance issues.

Every DBA has encountered the situation where a query or application that used to perform well slows down after it has been in production for a while. These slowdowns have many potential causes—perhaps the number of transactions issued has increased, or the volume of data has expanded. However, the performance problem might be due to database disorganization. Database disorganization occurs when a database’s logical and physical storage allocations contain many scattered areas of storage that are too small, not physically contiguous, or too disorganized to be used productively. Let’s review the primary culprits. Query or application slowdowns have many potential causes:

  • The first possibility is unclustered data. If the DBMS does not strictly enforce clustering, a clustered table or index can become unclustered as data is added and changed. If the data becomes significantly unclustered, the DBMS cannot rely on the clustering sequence. Because the data is no longer clustered, queries that were optimized to access data cannot take advantage of the clustering sequence. In this case, the performance of queries run against the unclustered table will suffer.
  • Fragmentation is a condition in which there are many scattered areas of storage in a database that are too small to be used productively. It results in wasted space, which can hinder performance because additional I/Os are required to retrieve the same data.
  • Row chaining or row migration occurs when updated data does not fit in the space it currently occupies, and the DBMS must find space for the row. With row chaining, the DBMS moves a part of the new, larger row to a location within the tablespace where free space exists. With row migrations, the full row is placed elsewhere in the tablespace. In each case, a pointer is used to locate either the rest of the row or the full row. Both row chaining and row migration will result in the issuance of multiple I/Os to read a single row. Performance will suffer because multiple I/Os are more expensive than a single I/O.
  • Page splits can cause disorganized databases, too. If the DBMS performs monotonic page splits when it should perform normal page splits, or vice versa, space may be wasted. When space is wasted, fewer rows exist on each page, causing the DBMS to issue more I/O requests to retrieve data. Therefore, once again, performance suffers.
  • File extents can negatively impact performance. An extent is an additional file that is “connected” to the original file and can be used only in conjunction with the original file. When the file used by a tablespace runs out of space, an extent is added for the file to expand. However, file extents are not stored contiguously with the original file. As additional extents are added, data requests will need to track the data from extent to extent, and the additional code this requires is unneeded overhead. Resetting the database space requirements and reorganizing can clean up file extents.

Depending on the DBMS, there may be additional causes of disorganization. For example, if multiple tables are defined within a single tablespace, and one of the tables is dropped, the tablespace may need to be reorganized to reclaim the space.

The need to access a table in a tablespace having any of these issues can cause a previously efficient application to perform worse.

Reorganizaing Disorganized Tablespaces

To correct disorganized database structures, the DBA can run a database or tablespace reorganization utility, or REORG, to force the DBMS to restructure the database object, thus removing problems such as unclustered data, fragmentation,  row chaining, etc. The primary benefit of reorganization is the resulting speed and efficiency of database functions because the data is organized in a more optimal fashion on disk. In short, reorganization maximizes availability and reliability for databases.

Both tablespaces and indexes can be reorganized. How the DBA runs a REORG utility depends on the specific DBMS in use. Some DBMS products ship with a built-in reorganization utility; others require the customer to purchase separate software or “roll their own.” Still others claim that the customer will not need the utility at all when using their DBMS. I have found the last claim to be untrue. Every DBMS incurs some degree of disorganization as data is added and modified.

To “roll your own,” DBAs can manually reorganize a database by completely rebuilding it. Accomplishing such a reorganization requires a complex series of steps as depicted below. reorg-process

Figure 1. Typical steps for a manual reorganization.

Of course, depending upon the DBMS and the specific options chosen for the object additional steps (such as decompression/recompression and/or sorting) may be necessary.

If a utility is available for reorganizing, from either the DBMS vendor or a third-party vendor, the process is greatly simplified. Sometimes the utility is as simple as issuing a command such as:


A traditional reorganization requires the database (or at least the database object) to be offline. The high cost of downtime creates pressure both to perform and to delay preventive maintenance—a no-win situation familiar to most DBAs. Most modern REORG utilities are available that perform the reorganization while the database structures are online. Such a reorganization is accomplished by making a copy of the data. The online REORG utility reorganizes the copy while the original data remains online. When the copied data has been reorganized, an online REORG uses the database log to “catch up” by applying to the copy any data changes that occurred during the process. When the copy has caught up to the original, the online REORG switches the production tablespace from the original to the copy. Performing an online reorganization requires additional disk storage and a slow transaction window. If a large number of transactions occur during the online reorganization, REORG may have a hard time catching up.

Today, there are some next generation utilities being introduced that eliminate steps and adjust the manner in which the reorganization occurs based on the database object being reorged. Such modern software may be required for high availability, large volume applications.

Determining When to Reorganize

System catalog statistics can help to determine when to reorganize a database object. Each DBMS provides a method of reading through the contents of the database and recording statistical information about each database object. Depending on the DBMS, this statistical information is stored either in the system catalog or in special pages within the database object itself.

One statistic that can help a DBA determine when to reorganize is cluster ratio. Cluster ratio is the percentage of rows in a table that are actually stored in clustering sequence. The closer the cluster ratio is to 100 percent, the more closely the actual ordering of the rows on the data pages matches the clustering sequence. A low cluster ratio indicates bad clustering, and a reorganization may be required. A low cluster ratio, however, may not be a performance hindrance if the majority of queries access data randomly instead of sequentially.

Tracking down the other causes of disorganization can sometimes be difficult. Some DBMSs gather statistics on fragmentation, row chaining, row migration, space dedicated to dropped objects, and page splits; others do not.

One example: Oracle provides a plethora of statistics in dynamic performance tables that can be queried in a special group of tables called the Oracle Dynamic Performance Tables.

Tablespaces are not the only database objects that can be reorganized. Indexes, too, can benefit from reorganization. As table data is added and modified, the index too must be changed. Such changes can cause the index to become disorganized.

A vital index statistic to monitor is the number of levels. Most relational indexes are b-tree structures. As data is added to the index, the number of levels of the b-tree will grow. When more levels exist in the b-tree, more I/O requests are required to move from the top of the index structure to the actual data that must be accessed. Reorganizing an index can cause the index to be better structured and require fewer levels.

Another index statistic to analyze to determine wheher reorganization is required is the distance between the index leaf pages, or leaf distance. Leaf distance is an estimate of the average number of pages between successive leaf pages in the index. Gaps between leaf pages can develop as data is deleted from an index or as a result of page splitting. Of course, the best value for leaf distance is zero, but achieving a leaf distance of zero in practice is not realistic. In general, the lower this value, the better. Review the value over time to determine a high-water mark for leaf distance that will indicate when indexes should be reorganized.


Database reorganization is a very important component in the DBA’s aresenal of performance tools. Taking the time to understand how your DBMS supports reorganization will be an important part of the overall job of maintaining optimal database/application performance.

This series of blog posts is excerpted from Craig Mullins’ classic book Database administration: The complete guide to DBA practices and procedures. Consult the book for more in-depth information on any of these topics – and much, much more.

Posted in DBA | Tagged , , , | 2 Comments

Optimizing Database Performance, part 4: File Placement and Page Sizing

Portions of this tip were excerpted from my book Database administration: The complete guide to DBA practices and procedures (2nd edition).

The first , second , and third parts of this series are also available.

Today we start by discussing file placement and data set allocation.

File Placement and Allocation

The location of the files containing the data for the database can have an impact on performance. A database is very I/O intensive, and the DBA must make every effort to minimize the cost of physical disk reading and writing.

This discipline entails

  • Understanding the access patterns associated with each piece of data in the system
  • Placing the data on physical disk devices in such a way as to optimize performance

The first consideration for file placement on disk is to separate the indexes from the data, if possible. Database queries are frequently required to access data from both the table and an index on that table. If both of these files reside on the same disk device, performance degradation is likely. To retrieve data from disk, an arm moves over the surface of the disk to read physical blocks of data on the disk. If a single operation is accessing data from files on the same disk device, latency will occur; reads from one file will have to wait until reads from the other file are processed. Of course, if the DBMS combines the index with the data in the same file, this technique cannot be used.

Another rule for file placement is to analyze the access patterns of your applications and separate the files for tables that are frequently accessed together. DBAs should do this for the same reason they should separate index files from table files.

A final consideration for placing files on separate disk devices occurs when a single table is stored in multiple files (partitioning). It is wise in this case to place each file on a separate disk device to encourage and optimize parallel database operations. If the DBMS can break apart a query to run it in parallel, placing multiple files for partitioned tables on separate disk devices will minimize disk latency.

With modern disk systems such as RAID devices, precise file placement is often difficult, if not impossible, to achieve. RAID storage consists of arrays of disk drives, and the files stored on the device are split across multiple physical disk drives. Be sure to read the documentation for your specific DBMS and follow the guidance for the placement of critical system files (such as log and configuration files). And keep in mind that a hard disk drive with two partitions, one holding log files and the other partition holding data files, is not the same as two disks, each dedicated to either log or data files.

Database Transaction Log Placement

Placing the transaction log on a separate disk device from the actual data allows the DBA to back up the transaction log independently from the database. It also minimizes dual writes to the same disk. Writing data to two files on the same disk drive at the same time will degrade performance even more than reading data from two files on the same disk drive at the same time. Remember, too, that every database modification (write) is recorded on the database transaction log.

Distributed Data Placement

The goal of data placement is to optimize access by reducing contention on physical devices. Within a client/server environment, this goal can be expanded to encompass the optimization of application performance by reducing network transmission costs.

Data should reside at the database server where it is most likely, or most often, to be accessed. For example, Chicago data should reside at the Chicago database server, Los Angeles–specific data should reside at the Los Angeles database server, and so on. If the decision is not so clear-cut (e.g., San Francisco data if there is no database server in San Francisco), place the data on the database server that is geographically closest to where it will be most frequently accessed (in the case of San Francisco, L.A., not Chicago).

Be sure to take fragmentation, replication, and snapshot tables into account when deciding upon the placement of data in your distributed net-work.

Disk Allocation

The DBMS may require disk devices to be allocated for database usage. If this is the case, the DBMS will provide commands to initialize physical disk devices. The disk initialization command will associate a logical name for a physical disk partition or OS file. After the disk has been initialized, it is stored in the system catalog and can be used for storing table data.

Before initializing a disk, verify that sufficient space is available on the physical disk device. Likewise, make sure that the device is not already initialized.

Use meaningful device names to facilitate more efficient usage and management of disk devices. For example, it is difficult to misinterpret the usage of a device named DUMP_DEV1 or TEST_DEV7. However, names such as XYZ or A193 are not particularly useful. Additionally, maintain documentation on initialized devices by saving script files containing the actual initialization commands and diagrams indicating the space allocated by device.

Page Size (Block Size)

Sizing database blocks (or pages) is another important method that can be used to control and manage database and application performance.

Most DBMSs provide the ability to specify a page, or block, size. The page size is used to store table rows (or, more accurately, records that contain the row contents plus any overhead) on disk. For example, consider a table requiring rows that are 125 bytes in length with 6 additional bytes of overhead. This makes each record 131 bytes long. To store 25 records on a page, the page size would have to be at least 3,275 bytes. However, each DBMS requires some amount of page overhead as well, so the practical size will be larger. If page overhead is 20 bytes, the page size would be 3,295—that is, 3,275 + 20 bytes of overhead.

This discussion, however, is simplistic. In general practice, most table spaces will require some amount of free space to accommodate new data. Therefore, some percentage of free space will need to be factored into the equation.

To complicate matters, many DBMSs limit the page sizes that can be chosen. For example, DB2 for z/OS limits page size to 4K, 8K, 16K, or 32K; Microsoft SQL Server supports only an 8K page size. In the case of DB2, the DBA will need to calculate the best page size based on row size, the number of rows per page, and free space requirements.

Consider this question: “In DB2 for z/OS, what page size should be chosen if 0 percent free space is required and the record size is 2,500 bytes?”

The simplistic answer is 4K, but it might not be the best answer. A 4K page would hold one 2,500-byte record per page, but an 8K page would hold three 2,500-byte records. The 8K page could provide for more efficient sequential I/O processing, because reading 8K of data would return three rows, whereas reading 8K of data using two 4K pages would return only two rows.

Choosing the proper page size is an important DBA task for optimizing database I/O performance.


Page/block size and file/data set allocation can be important criteria for DBAs as they optimize and manage their database systems. Understanding the options available for your specific DBMS will allow you to become a better DBA and improve the performance of your databases, applications, and systems.

This series of blog posts is excerpted from Craig Mullins’ classic book Database administration: The complete guide to DBA practices and procedures. Consult the book for more in-depth information on any of these topics – and much, much more


Posted in database design, DBA, performance | 3 Comments