Readings in Database Systems

Categories:

Recommended

Chapter 1: Background Introduced by Michael Stonebraker

I am amazed that these two papers were written a mere decade ago! My amazement about the anatomy paper is that the details have changed a lot just a few years later. My amazement about the data model paper is that nobody ever seems to learn anything from history. Lets talk about the data model paper first.

A decade ago, the buzz was all XML. Vendors were intent on adding XML to their relational engines. Industry analysts (and more than a few researchers) were touting XML as “the next big thing”. A decade later it is a niche product, and the field has moved on. In my opinion, (as predicted in the paper) it succumbed to a combination of:

• excessive complexity (which nobody could understand)

• complex extensions of relational engines, which did not seem to perform all that well and

• no compelling use case where it was wildly accepted

It is a bit ironic that a prediction was made in the paper that X would win the Turing Award by success- fully simplifying XML. That prediction turned out to be totally wrong! The net-net was that relational won and XML lost.

Of course, that has not stopped “newbies” from rein- venting the wheel. Now it is JSON, which can be viewed in one of three ways:

• A general purpose hierarchical data format. Anybody who thinks this is a good idea should read the section of the data model paper on IMS.

• A representation for sparse data. Consider attributes about an employee, and suppose we wish to record hobbies data. For each hobby, the data

we record will be different and hobbies are fundamentally sparse. This is straightforward to model in a relational DBMS but it leads to very wide, very sparse tables. This is disasterous for disk- based row stores but works fine in column stores. In the former case, JSON is a reasonable encoding format for the “hobbies” column, and several RDBMSs have recently added support for a JSON data type.

• As a mechanism for “schema on read”. In effect, the schema is very wide and very sparse, and essentially all users will want some projection of this schema. When reading from a wide, sparse schema, a user can say what he wants to see at run time. Conceptually, this is nothing but a projection operation. Hence, ’schema on read” is just a relational operation on JSON-encoded data.

In summary, JSON is a reasonable choice for sparse data. In this context, I expect it to have a fair amount of “legs”. On the other hand, it is a disaster in the making as a general hierarchical data format. I fully expect RDBMSs to subsume JSON as merely a data type (among many) in their systems. In other words, it is a reasonable way to encode spare relational data.

No doubt the next version of the Red Book will trash some new hierarchical format invented by people who stand on the toes of their predecessors, not on their shoulders.

The other data model generating a lot of buzz in the last decade is Map-Reduce, which was purpose-built by Google to support their web crawl data base. A few years later, Google stopped using Map-Reduce for that application, moving instead to Big Table. Now, the rest of the world is seeing what Google figured out earlier; Map-Reduce is not an architecture with any broad scale applicability. Instead the Map-Reduce market has morphed into an HDFS market, and seems poised to become a relational SQL market. For example, Cloudera has recently introduced Impala, which is a SQL engine, built on top of HDFS, not using Map-Reduce.

More recently, there has been another thrust in HDFS land which merit discussion, namely “data lakes”. A reasonable use of an HDFS cluster (which by now most enterprises have invested in and want to find something useful for them to do) is as a queue of data files which have been ingested. Over time, the enterprise will figure out which ones are worth spending the effort to clean up (data curation; covered in Chapter 12 of this book). Hence, the data lake is just a “junk drawer” for files in the meantime. Also, we will have more to say about HDFS, Spark and Hadoop in Chapter 5.

In summary, in the last decade nobody seems to have heeded the lessons in “comes around”. New data models have been invented, only to morph into SQL on tables. Hierarchical structures have been reinvented with failure as the predicted result. I would not be surprised to see the next decade to be more of the same. People seemed doomed to reinvent the wheel!

Category:
Tag:

Attribution

Peter Bailis, Joseph M. Hellerstein, and Michael Stonebraker (2015), Readings in Database Systems, URL: http://www.redbook.io/

This work is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license  (https://creativecommons.org/licenses/by-nc-sa/4.0/).

VP Flipbook Maker

Display your work as digital flipbook and share with others! We can convert our work or create customize flipbook by VP Online Flipbook Maker. Try it now!