Data and Technology Today

Data Lake vs. Data Warehouse

Posted on January 4, 2024 by craig@craigsmullins.com

There is one thing we can say for certain about the IT and computer industry – we sure love creating new terms and categories… even when, it can be argued, that we don’t need them. And even when new terms might be useful, it sure can be difficult to come to grips with them. Especially when every vendor out there tries to grab onto the new term to prove that their technology is relevant, or hot.

Well, data lake is one of these “newer” terms that has been created for a new category of data store for us to create and manage. But the term is not well understood; so, just what is a data lake?

Well, the “what is” definition at TechTarget says that “A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.” This is a reasonable definition, at a high level. The key portion here is that the data is stored in its “native format.” The data is not manipulated or transformed in any meaningful way; it is simply stored and cataloged for future use.

Why is this an interesting or useful thing to do? Any type of data can be stored in a data lake: structured, semi-structured and unstructured. For example, organizations can use a data lake for customer information captured from multiple sources for future analysis and aggregation. This can consist of typical number, characters, dates and times, as well as complex documents, text, multimedia and more. In general, the data is ingested without transformation and data scientists can run analytical models against the data, business analysts can augment business intelligence activities with the data, and it can even be used as a long-term data archive.

Organizations are under intense pressure these days to capture any data that could be relevant to their business. And the number of sources and amount of data continues to skyrocket. So the desire to grab the data when it is available is high, but the time to organize and understand that data fully at the time of capture is not usually available.

Therein, however, lies one potential problem for the long-term viability of a data lake. Unless the data in the data lake is properly defined and documented then it can be quite difficult to extract business value. Perhaps there remains value for data scientists in a data lake because they possess the skills to wrangle and manipulate data. But to expect that business people will be able to use data that is just dumped into a data lake with little or no details about it is unrealistic.

On-going Management

You should not just treat the data lake as a dumping ground for data. It is important to have a means of understanding and managing the data that is stored in the data lake. Without a mechanism for defining, populating, accessing and managing the data in your data lakes, you will find them to be less than useful.

Populating a data lake requires knowledge of and proper tools for data integration. Because the data lake contains multiple types of data from multiple sources, it must include support for a wide array of different platforms, data types and structures, interfaces and processing capabilities.

You will also need some form of metadata management for a data lake environment to remain useful and healthy. Minimally, a data lake requires information about each type of data stored there, but also some guidance on where the data originated (that is, its provenance), the data elements it contains, the meaning of each, and how to read them. Of course, the metadata can be minimal to begin with and then fleshed out as your data scientists and analytics teams explore the data.

The words of Sean Martin, CTO of Cambridge Semantics, as expressed in The enterprise data lake: Better integration and deeper analytics, ring true here. He said “We see customers creating big data graveyards, dumping everything into HDFS (Hadoop Distributed File System) and hoping to do something with it down the road. But then they just lose track of what’s there.”

Planning the on-going viability of the data lake is important and that means active management to ensure accessibility and understanding of what is available.

Wither the Data Warehouse?

Some people worry that data lakes will summon the demise of data marts and data warehouses. But this is not likely. A data warehouse, as defined by Bill Inmon, is “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process.”

In contrast with a data lake, where data is captured and stored with no transformation or aggregation, a data warehouse contains data transformed from multiple sources and is designed for business users. A data lake cannot serve the same purpose unless the data is modified from its “native format”… and then it stops being a data lake by definition.

To contrast these two (data lake and data warehouse) let’s take a look at how James Dixon, CTO of Pentaho and the man who coined the term data lake, describes them. He likens a data mart or warehouse to a bottle of water… it has been “cleansed, packaged and structured for easy consumption.” On the other hand, a data lake is more like a body of water in its natural state, a lake or a stream.

There are, of course, many other differences. A data warehouse contains structured data whereas a data lake can contain structured, unstructured, and semi-structured data. Data in the data lake comes from multiple sources and will have varying schemata. As such, the data lake requires schema-on-read capability – and a platform, like Hadoop, that supports such a requirement. With data from multiple, disparate sources all being stored in its native format, data lakes cannot support schema-on-write like data warehouses do.

Of course, Hadoop is not the only technology that can be used for data lakes. For example, some organizations with a more cloud-focused mentality are using solutions from cloud providers like Amazon Web Services, IBM, and Microsoft Azure to implement their data lake.

The type of storage that can be used also separates data warehouses from data lakes. With a data warehouse performance is important and you do not want to store data that will be queried by business professionals on slower, less-costly storage devices. On the other hand, storing a data lake on such devices makes a lot of sense!

So it is important to understand the differences.

What About the Data Lakehouse?

An even newer term is data lakehouse. At first blanch it seems like just another IT term to confuse us all, but it actually makes some sense. A data lakehouse takes the concept of a data lake, and adds elements to make the data in the data lake usable. This additional analytical infrastructure adds things like metadata, data lineage information, historical details, and usage information like transformations to the data lake. Furthermore, a data lakehouse may also contain machine learning and analytics technologies.

A Place for Everything and Everything in its Place

The bottom line here is that you really do need to understand terms and their associated technologies. Even if, at first glance, similarities abound, take the time to investigate and learn before attempting to adopt any new technology.

A data lake is not a data warehouse. They are designed for different purposes, and with different goals in mind. Properly implemented, a data lake can feed your data marts and enterprise data warehouse… and it can support the analytical queries and models of your data scientists. Adding an analytical infrastructure to the data lake turning it into a data lakehouse furthers its usefulness.

So, there is a place for both the data warehouse and the data lake, properly implemented as a data lakehouse, in most organizations!

Posted in analytics, architecture, business intelligence, data, data warehouse, enterprise computing, metadata | Tagged data lake, data lakehouse, data warehouse | 1 Comment

Seasons Greetings 2023

Posted on December 6, 2023 by craig@craigsmullins.com

We have entered the holiday season 2023. Did you know there are nearly two dozen different holidays and celebrations observed from the end of October through the beginning of January every year? With that in mind, I like to make an annual post wishing well to everyone who celebrates any (or all) of these festivities!

So with that in mind, have a great holiday season… and we’ll see you again back here next year!

Posted in Uncategorized | Tagged christmas, holidays | Leave a comment

My Latest Book: Mainframe Specialty Processors

Posted on November 30, 2023 by craig@craigsmullins.com

Today’s post will be brief… just to announce my latest book. It does have relevance to data, but only for IBM System Z mainframe shops.

The book is titled: Mainframe Specialty Processors: Understanding zIIPs, Licensing, and Cost Savings on the IBM System z. And the title pretty much sums up the content of the book. But here are some more details…

If you are an IT professional who works on IBM z Series mainframes, then you’ve probably heard about zIIPs and other “specialty processors.” But you may not really know what they are, what they do, and why they exist. This book will clarify the purpose of specialty processors and how you can best utilize them for cost optimization.

The book provides a high-level overview of pertinent mainframe internals such as control blocks, SRBs, and TCBs, and why they are important for understanding how zIIPs work.

Additionally, because reducing mainframe software cost is essential to the purpose of specialty processors, the book proivdes a high-level introduction to understanding mainframe licensing and pricing. But mainframe pricing and billing is a complex topic that is understood by few. Nevertheless, this book will help you become conversant on the topic.

Furthermore, the book describes the types of workloads that can take advantage of specialty processors, including advice on how to promote zIIP usage in your applications and systems. One of the biggest beneficiaries of zIIPs is Db2 workload, but only specific types of Db2 processing qualify. And the book will guide you as to what types of Db2 processes qualify!

The book is primarily focused on zIIP usage, but it also takes a look at the other types of specialty processors, including what they are used for and why it can be beneficial to use them. And finally, the book speculates on the future of specialty processors in the realm of the IBM Z.

So, if you work on mainframe systems, or know a loved one who does, why not pick up a copy of the book on amazon right now?

Posted in architecture, books, capacity planning, DB2, IBM, mainframe, optimization, performance, software cost, zIIP | Leave a comment

Q+A: Upgrading the Company DBMS

Posted on November 16, 2023 by craig@craigsmullins.com

I regularly receive questions from readers and every now and then I choose one to highlight here on the blog. Today, I want to share a Q+A from a reader/student asking for guidance on things a company should consider when upgrading their DBMS…

Question:

I am writing a company proposal on a technical issue of my choice. Since I took a MySQL class and it may be related to my career field I thought it would be a good idea to pick “upgrading the company’s DBMS.” So my question would be, can you give me a little insight on the quality of DBMSes. Such as why choose Oracle over MySQL, and is new brands better than older brands. For the sake of my report I choose MySQL as the Brand that I wanted to propose to my company, mainly because it is the one that I am familiar with. Any insight would be very much appreciated.

My Answer:

Those are interesting questions indeed, and not always easy to answer. Here is how I would go about determining the database architecture for an organization.

The first thing to do is to determine the primary relational DBMS that will serve as the backbone of the organization’s transactional systems. This could be MySQL, of course, or you may need to go with a DBMS that has a stronger track record of enterprise support (such as IBM Db2, Microsoft SQL Server, or Oracle). Factors to examine include the O/S supported by the DBMS and your organization’s required O/S support, performance and availability requirements, scalability needs, available technicians, support for the cloud, total cost of ownership, and analyst recommendations.

Keep in mind that an organization may have multiple relational/SQL database systems due to packaged application support and historical needs. Most relational/SQL DBMSes will support transaction processing for most applications. The next step is to decide on an analytical DBMS for the backbone of analytical, data warehousing and data science type processing. Again, we should choose a relational/SQL DBMS, but it should have the capability to perform OLAP and column-based queries. If the DBMS supports HTAP (Hybrid Transaction Analytical Processing) then perhaps the same DBMS can be used for both transactions and analytics, but this often requires additional configuration and perhaps an appliance (ex. IBM Db2 Analytics Accelerator).

Then you need to review the use cases for all upcoming projects. If the project can be accomplished using the DBMSes selected already, this is ideal. However, life is not always ideal and sometimes this requires adding a new DBMS, and even a new type of DBMS. For example, consider the NoSQL DBMS variants (graph, wide column, document, and key/value databases). Each of these support specific use cases and might be required for web, social media, IoT, or other types of projects that are not easily supported by relational/SQL database systems.

And do not forget legacy DBMS systems required to run applications that have been around for a while. This may mean that you will have to also support database systems, for example IBM IMS (hierarchical), Broadcom IDMS (network/CODASYL), Software AG Adabas (inverted list), or OO database systems.

Once you have your infrastructure set, which will likely consist of multiple DBMS products, you need to plan for the update cycle for each DBMS product. This means coordinating with the published release cycle for each vendor. You must always ensure that you are running a supported version of the DBMS. Vendors typically release new versions every 18 to 36 months, but support a product for 5 or more years running from GA (general availability), thru EOM (end of marketing – meaning no longer sold), and EOS (end of support – meaning no longer supported). You must plan to move to a new release of each DBMS before the EOS deadline.

Further complicating issues is that many vendors have moved to an agile, continuous availability model. This means that smaller, more frequent updates are being made available. You must decide where your organization fits on the Gartner technology adoption framework (https://www.gartner.com/it-glossary/type-a-b-and-c-enterprises) and create a strategy for how frequently you accept new modifications. At any rate, you will need to understand the mechanisms used by your vendor(s) and strategically determine how you will introduce changes into your database environment.

The cloud throws another monkey-wrench in to this because if you use a cloud database like SQL Azure, for example, then the cloud provider (in this case Microsoft) will always be running the most recent version of the DBMS. So you will have to adopt a more aggressive strategy for accepting change into your DBMS environment.

Best of luck with your project and I hope this information has helped.

Posted in architecture, availability, change management, maintenance, relational, scalability | Leave a comment

My Recent Data-Related Articles on Elnion

Posted on October 19, 2023 by craig@craigsmullins.com

Today’s post is a short one intended to let my readers know about the articles that I have written for Elnion on data and database related topics.

First of all, what is Elnion?

Elnion is a web-based research portal for IT and related technology news and information. Published by Sociaall.com, the core topics it covers include AI, cloud computing, data, the digital enterprise, telco and mobile, cybersecurity, infrastructure, automation, and supply chain.

I’ve been writing for Elnion now for about 18 months or so, predominantly in the data channel. My most recent article was published this month and it has the provocative title Does Unstructured Data Really Matter? I hope you’ll read through it and weigh in with your thoughts on the matter.

Here is a list of the other articles I’ve published at Elnion.

Metadata Management and Data Quality: Mandatory for Regulatory Compliance

The Chaning World of the Database Management System

Harnessing Data to Drive Business

The Test Data Management Struggles of Modern App Development

zIIPing Along with Mainframe Specialty Processors

Whither Goest Relational?

Public Cloud Continues to Grow but Not Without Challenges

Data Management Trends for 2023

IBM Db2 13 for z/OS SQL Data Insights

Mainframe Modernization is a Non-Sequitur

Data Mesh? Data Fabric? I Don’t Care What You Call It, You Need It!

DBA and the Cloud: Not Never, Always, All, or Nothing!

Please consider clicking through to the ones that interest you!

Posted in Big Data, cloud, data, data availability, data governance, data integration, data management, DB2, DBA, DBMS, digital transformation, IBM, mainframe, metadata, thought leadership | Leave a comment

Optimizing Database Performance

Posted on August 29, 2023 by craig@craigsmullins.com

Last week I was proud to publish my latest database book, this time on the important topic of achieving and maintaining efficient performance for your databases and the applications that access them. Appropriately enough, this book is titled Optimizing Database Performance, and it is available now in both print and ebook formats on amazon. So I hope you will click on the links and check it out!

I wrote the book to explain the many aspects of performance management that need to be addressed when dealing with database management systems. You see, even though it is generally accepted that you should tune your systems and that high performance is a “good thing,” rarely does anybody really take the time to define what is meant by the term “performance.” Especially when things are falling apart or on fire!

One of the first things this book does is to define the elements of database performance. After grasping the basic concepts, the book expands to detail the components and issues that contribute to how your database systems and application perform. It defines the four levels of database performance and offers in-depth guidance on the components of each. It also offers up steps that can be taken to not only review database performance, but to optimize the systems and applications accessing your databases.

Armed with the information in this book you can confidently implement, maintain, and tune your databases and applications to ensure their effectiveness and efficiency. In other words, ensure optimal database performance.

Get your copy today!

Posted in analytics, availability, data availability, DBA, indexing, operational, optimization, performance | Tagged database, optimization, performance, tuning | Leave a comment

Database Performance versus Scalability

Posted on August 9, 2023 by craig@craigsmullins.com

Database performance and scalability are both important aspects of managing and optimizing databases, but they refer to different characteristics and considerations.

Database performance refers to how efficiently a database system can respond to and process queries, transactions, and other operations. It is focused on the speed, responsiveness, and overall efficiency of database operations. Factors that influence database performance include:

Query Optimization: Ensuring that database queries are structured and written in a way that minimizes resource usage and execution time.
Indexing: Creating and maintaining appropriate indexes on tables to speed up data retrieval and query execution.
Data Modeling: Designing the database schema and relationships in a way that minimizes data redundancy and maximizes query performance.
Caching: Implementing caching mechanisms to store frequently accessed data in memory for faster retrieval.
Hardware Resources: Utilizing sufficient hardware resources, such as CPU, memory, and disk speed, to support the database workload.
Tuning: Regularly monitoring and tuning the database system to identify and address performance bottlenecks.
Concurrency Control: Managing concurrent access to the database to ensure data consistency and prevent conflicts among multiple users or processes.

Database scalability refers to the ability of a database system to handle increasing amounts of data, workload, and users without significantly sacrificing performance. It involves the capacity of the database to grow and handle increased demand over time. There are two main types of scalability:

Vertical Scalability (Scaling Up): Increasing the capacity of a single server by adding more resources, such as upgrading the CPU, adding more memory, or increasing storage capacity. This approach has limits and can become costly.
Horizontal Scalability (Scaling Out): Distributing the database workload across multiple servers or nodes in a cluster. New servers can be added as needed, allowing for more linear scalability. This approach is more flexible and can potentially handle very large workloads.

Scalability considerations include:

Partitioning/Sharding: Dividing the data into smaller partitions or shards to distribute the load across multiple nodes in a distributed system.
Replication: Creating copies of data on multiple nodes to improve fault tolerance and allow for read scaling.
Load Balancing: Distributing incoming queries and requests evenly across multiple nodes to avoid overloading any single node.
Consistency and Coordination: Ensuring data consistency and coordination across distributed nodes while maintaining acceptable performance.

In summary, database performance focuses on optimizing the efficiency and responsiveness of individual database operations, while database scalability focuses on the system’s ability to handle increased data and workload demands while maintaining acceptable performance levels. Both aspects are crucial for designing and managing a successful database system.

Posted in performance, scalability | 1 Comment

Perhaps The Most Important Data Management Issue Today

Posted on July 11, 2023 by craig@craigsmullins.com

Although there are many important issues that organizations are struggling with regarding data and data management these days, few would argue that securing and protecting their data is not among the most vital. With the increasing reliance on digital systems and the proliferation of sensitive data, organizations face significant challenges in protecting their databases from unauthorized access, data breaches, and cyber threats.

Data breaches can lead to severe consequences, including financial losses, reputational damage, and legal repercussions. As a result, ensuring the security and privacy of data has become a critical concern for organizations across various industries. It involves implementing robust access control mechanisms, encryption techniques, and adopting best practices for data governance and compliance. Of course, these things are easier said than done.

It is also worth noting that there are many industry and governmental regulations driving the need to improve data protection, management, and administration. One of the more visible governmental regulations is the Sarbanes-Oxley Act (SOX), officially known as the U.S. Public Accounting Reform and Investor Protection Act of 2002. The goal of SOX is to regulate corporations in order to reduce fraud and to improve disclosure and financial reporting. The impact and cost of the law on IT and database management is significant. Section 404 of SOX specifies that the CFO must guarantee the processes used to produce financial reports. Those processes are typically computer programs that access data in a database.

Consider also the Health Insurance Portability and Accountability Act, commonly referred to as HIPAA. This legislation mandates that health care providers protect individual’s health care information, stating that providers must be able to document everyone who even so much as looked at their information. HIPAA audits frequently require the examination of the processes used to create, document and review exception reports and logs. When confronted with a HIPAA audit, organizations can be required to produce a list of exceptions to policy, such as, “When were patient records accessed during off hours and by whom?” Without database auditing software, it is impossible to produce a list of users who looked at a specific row or set of rows in any database.

Other compliance related legislation includes the Gramm-Leach-Bliley (GLB) Act (also known as the Financial Modernization Act of 1999) and the E-Government Act, passed in 2002 as a response to terrorist threats. Title III of the act is named the Federal Information Security Management Act (FISMA), which states that federal agencies, contractors, and any entity that supports them, must maintain security commensurate with potential risk.

And we would be remiss without mentioning the PCI-DSS, which is the Payment Card Industry Data Security Standard. PCI-DSS is an industry standard, as opposed to the others mentioned previously which are governmental regulations. Established by the major credit card companies, PCI-DSS was established to dictate the requirements for organizations who accept payment cards. Its goal is to help prevent credit card fraud, hacking, and other security issues. Failing to comply risks losing the right to accept credit cards as payment. PCI-DSS emphasizes the importance of real time monitoring and tracking of access to cardholder data, as well as continuous assessment of security health status of the database storing the data.

Regulatory compliance holds an important sway over upper level management at most medium- to large-size organizations because of its potent impact. Business executives are keenly aware of the need to comply, although they are not always aware of all the details that involves. This is so because failure to comply can result in prosecution which may involve huge fines and even. Regulatory compliance can impose upon C-level executives the need to be able to prove that corporate data (and therefore, database systems) are protected and that processes and procedures enacted upon the data are accurate and required.

Of course, there are other important issues in terms of data security and protection including data archiving, database auditing, data access controls, encryption, data masking, and more.

Database archiving protects data by removing it from the database system and storing it for posterity in an archive data store. There are over 150 international, federal and local laws that establish the mandated retention period for different types of data. Many of these laws greatly expand the duration over which data must be retained. Many organizations deal with retention periods based on business needs only, but this can be short-sighted. Depending on the industry, what was once a short retention period may now expand into decades or even longer.To comply with these laws corporations must re-evaluate their established methods and policies for managing and retaining data. What worked in the past to retain data for a few
years is no longer sufficient over a much longer period.

Database auditing refers to the process of monitoring and recording activities that occur within a database system. It involves capturing and analyzing various actions and events, such as user logins, data modifications, schema changes, and system configuration updates. The primary goal of database auditing is to ensure the integrity, security, and compliance of the database environment. You can think of it as “monitoring who did what to which data when” in the database. Database auditing should be conducted in alignment with applicable legal and privacy requirements. Organizations should define clear auditing policies, specify the scope of auditing, and ensure the appropriate level of data protection and access controls are in place to safeguard the audit logs themselves.

Data access controls refer to mechanisms and procedures enacted to manage the access and usage of data within a database. These controls are designed to ensure that only authorized individuals or entities can access, modify, or retrieve specific data based on predefined permissions and security policies. Data access controls are an essential component of data security and play a crucial role in protecting sensitive information from unauthorized access, misuse, or data breaches. Examples include authentication, authorization, Role-Based Access Control (RBAC), data encyrption, and database views.

It’s important to implement a comprehensive and layered approach to data access controls, combining multiple control mechanisms to ensure the security and integrity of sensitive data. The specific access control measures implemented may vary based on the nature of the data, the system architecture, and the regulatory requirements applicable to the organization.

Another technique to protect database data is data masking, which is the process of protecting sensitive data from inappropriate visibility by replacing it with realistic, but not accurate data. The goal is for sensitive, personally identifiable information (PII) to not be exposed outside of authorized environments. Data masking can prevent fraud, identity theft, & other criminal activities. Data masking can be done while provisioning test environments so that copies created to support development & testing do not expose sensitive information. Valid production data is replaced with usable, referentially intact, but obfuscated data. After masking, the test data is usable just like production data, but the information content is secure.

And let’s not forget about another long-standing security issue, SQL injection, which is a form of web hacking whereby SQL statements are specified in the fields of a web form to cause a poorly designed application to dump database content to the attacker. In order for SQL injection to succeed, the application code used by the website must be vulnerable to an injection attack. SQL injection relies on programs that do not adequately filter for string literal escape characters embedded in SQL statements or where user input is not strongly typed. So instead of inputting data into a form, SQL statements are supplied. The SQL is “injected” from the web form into the database causing it to be executed and access (or even modify) unintended data. This form of attack has been known for decades now, but it still happens to poorly-designed applications.

Furthermore, as the volume of data continues to grow exponentially, another important issue is managing the scalability and performance of databases. Ensuring that data is protected, while at the same time being efficiently accessible, is a major data management challenge being faced today.

Posted in auditing, availability, compliance, data, data availability, data breach, data governance, DBA | Leave a comment

The Impact of Row Size on Query Performance

Posted on July 3, 2023 by craig@craigsmullins.com

As part of my job as a database consultant, I regularly keep my eyes open for interesting topics in relevant blogs and forums. Recently, one interaction caught my eye, where somebody asked what kind of a performance impact could be expected if a query was issued against two similar tables. The first table had (say) 20 columns, and the second table had the same 20 columns, as well as 35 additional columns.

There were a significant number of folks providing input on the topic, but the overwhelming consensus was that as long as the query was going against the same columns then performance should be about the same.

I disagree. Here is why.

You also need to factor in the I/O requests that are required to return the data. Most DBMSes perform I/O operations at the block (or page) level. And this is the case regardless of whether you want to return one row or millions of rows.

For multi-row results, accessing data from the table with the wider row (more columns) will usually take longer than accessing from a narrower row (fewer columns). This is the case because the table with narrower rows will have more rows per page/block than the table with wider rows.

Just a quick calculation should help you to understand what I mean. Assume that the narrower row (say with 100 columns) has an overall row length of 600 bytes. If the block size of the table/tablespace is 4K, then about 6 rows can be stored in each page.

How assume that the wider row (say with 120 column) has an overall row length of 900 bytes. Again with a block size of 4K, this time we can store 4 rows per page.

So what, you may ask? In each case I only want to return the same 4 columns. Who cares what else those columns are stored with?

Well, remember that the DBMS stores data in blocks/pages and also reads at the block/page level. This is the case no matter how many columns you want to access. So, let’s consider the worst case scenario, a table scan with no indexes. There will be more pages with wider columns! Consider the case where there are just 1000 total rows:

The wider table – at 4 rows per page – will require at least 1000/4 or 250 pages
The narrower table – at 6 rows per page – will require at least 1000/6 or 167 pages

So, with a table scan the wider table requires 250 I/O operations, whereas the narrower table requires only 167 I/O operations – or 83 fewer I/Os! And that is the case regardless of the number of columns being returned.

Now which one of these do you think will be more efficient?

Of course, this is just looking at things at a macro level, and we all now that there are many nuances and variations that can impact performance. Nevertheless, and speaking in generalities, the fewer rows/page, the less efficient queries against that table will be.

Think about it this way. Is it faster to pull smaller peaches out of a basket than bigger peaches? That is about the same type of question and anybody can envision the process. Say you want 100 peaches. Small peaches fit 25 per basket; big peaches fit ten per basket. To get 100 small peaches you’d need to pull 4 baskets from the shelf. To get 100 big peaches you’d need to pull 10 baskets from the shelf. The second task will clearly take more time.

Of course, the exact performance difference is difficult to calculate – especially without knowledge of the specific DBMS being used, information about in-memory tactics and buffer pools, and the exact type of data and queries being used. But there will, more than likely, be a performance effect on queries when you add columns to a table.

Posted in database design, performance, SQL | Leave a comment

From Data to Knowledge and Beyond

Posted on June 27, 2023 by craig@craigsmullins.com

The basic building block of knowledge is data. Data is a fact represented out of context and with no relation to other facts. Examples of data are:

27
JAN
010110

Without additional details, we know nothing about any of these three pieces of data. Consider:

Is 27 a number in base ten, or is it in octal (which would translate to 23 in base ten)?
If 27 is a number in base ten, what does it represent? Is it an age, a dollar amount, an IQ, a shoe size, or something else entirely?
What does JAN represent? Is it a woman’s name (or a man’s name)? Or does it represent the first month of the year? Or perhaps it is something else entirely?
Finally, What about 010110? Is it a binary number? Or is it a representation of a date, perhaps January 1, 1910? January 1, 2010? Or something else entirely?

Because of the lack of context, these are all examples of data. To progress to a level where we can understand the data, we must add context, turning the data into information. Information adds context by specifying relationships between data elements, and possibly between and among other pieces of information.

Data in context with metadata makes information. The relationships may represent information, yet the relations do not actually constitute information until they are understood. In addition, the relationships that represent data have a tendency to be limited in context, mostly about the past or present, with little if any implication for the future.

The next step is to progress from information to knowledge. Webster’s New Collegiate Dictionary defines knowledge as “the fact or condition of knowing something with familiarity gained through experience or association.” Knowledge adds understanding and retention to information. It is the next natural progression after information. To have “knowledge” requires information in conjunction with patterns found in data, information, and other knowledge. Therefore, knowledge couples information with understanding and cognition.

The final step would be to move from knowledge to wisdom. Wisdom can be thought of as applied knowledge. You may have the knowledge that fatty foods are bad for you, but if you eat them anyway, you are not wise.

In order for data to be anything more than simply data, metadata is required. Without metadata, data has no identifiable meaning—it is merely a collection of digits, characters, or bits. Metadata gives data its form and makes it usable by information professionals.

Metadata Strategy

A wise organization will develop a metadata strategy to collect, manage, and provide a vehicle for accessing metadata. A sound metadata strategy should address the following:

A policy for how metadata is used in the organization
Procedures for identifying and defining data ownership and stewardship
Identification of the types of metadata that need to be collected
A description of the purpose for each type of metadata that is identified—a clear and concise reason why each piece of metadata is required by the organization
Methods for the collection and storage of metadata (typically using a repository)
Methods for accessing the metadata
Policies to enforce data stewardship procedures and security for metadata access
Identification of metadata sources, both internal and external
Measurements to gauge the quality and usability of metadata

Metadata publicizes and supports the data your organization produces and maintains. By assembling and managing metadata, your organization will have access to relevant facts about your data, making your systems more usable and your databases more useful.

DBAs should participate in the team that develops the metadata strategy, but the data administration organization, if one exists, should be the leader of the metadata effort.

Posted in data, information, metadata | Tagged data, information, knowledge, metadata, wisdom | Leave a comment

Data and Technology Today

Data Lake vs. Data Warehouse

Seasons Greetings 2023

My Latest Book: Mainframe Specialty Processors

Q+A: Upgrading the Company DBMS

My Recent Data-Related Articles on Elnion

Optimizing Database Performance

Database Performance versus Scalability

Perhaps The Most Important Data Management Issue Today

The Impact of Row Size on Query Performance

From Data to Knowledge and Beyond

A Guide to Db2 Performance for Application Developers

Feedpost Selects Data & Technology Today Top 60 Database Blog

Recent Posts

Archives

Categories

Meta

Blogroll

Twitter

Share this blog

Share this blog