Organizations trying to develop a “data warehouse” with Apache Hive, Google BigQuery, or Amazon Redshift may be surprised to discover that these products don’t contain standard data warehouse features.
What do you call a database without foreign keys? What do you call a data warehouse without foreign key relationships or indexing? While both these questions should be the setup to jokes, they aren’t.
Since around 2011, when server and cloud hosting companies (along with their consulting partners) began popularizing the term “big data“, there has been a very deliberate campaign intended to convince companies to retire traditional databases and replace them with something else. What that “something else” is depends on who is selling, but the hype is unavoidable, and has been for years:
“Everyone is switching to Hadoop” ; “Big data is the future; and ‘legacy’ databases can’t do big data” ; “You can do everything in Hive/Spark/Hadoop you can do with traditional databases!”
While the future is here, and “big data” marketing is everywhere, a look into solutions that specifically claim to be a “data warehouse” show that these offerings are not technically data warehouses but something quite different. And their lack of standard data warehouse features may come as a shock to experienced tech professionals who expect data solutions to support features previously taken for granted.
Part 1 – Relational Databases: The Full Monty
Before discussing Hive—a database-like structure that is built on the distributed data lake service Hadoop—it’s worth being upfront about relational databases.
Relational databases likes PostgreSQL, MySQL, SQL Server, and Oracle DB can handle massive amounts of records efficiency, quickly, and without compromising on data integrity. Data structures like this have handled production data successfully for decades and offer some of the most reliable and historically dependable data management solutions out there. However, they do require a few things.
Specifically, relational databases requires:
- Developers who take one or two rigorous college-level courses on relational databases.
- Developers who put deep thought into their application designs.
- Documentation that is maintained and periodically updated, as appropriate.
- Staff to maintain the database, periodically test and update the database software.
- Training programs that educate new team members on the database’s structure and design.
What you get for this time and effort is unambiguous:
- For your day-to-day business operations you get a blazing-fast, logically structured, and high-integrity database solution.
- For your analytics activities you get fast, intelligently-designed, and high-integrity datasets.
The speed, integrity, and value of relational databases comes mostly thanks to their use of standard features like foreign keys, indexing, and views; all which ensure that query results are fast while also being intelligently structured for speed and access (and easy to centrally maintain).
Relational data warehouse solutions allow you to easily pull data that is tailor-designed for reporting and statistical analysis. This allows you to easily access data for analysis and data visualizations; visualization like the one below, which shows some hierarchical relationships in US employment statistics:
(Note: there is an enhanced version of this animation at the end of this article.)
In Praise of the Foreign Key
Relational databases are difficult to properly maintain without foreign keys. So are star-schema data warehouses, which use foreign key relationships to ensure consistency across datamarts while reducing data redundancy. The best-designed data warehouses give your data analysis teams all the data they need in a reliable and well-processed format, and they allow for easy tracking of data changes across time.
Foreign key relationships are easily one of the most useful data structure of the modern era—the “sliced bread” of data models, if you will—and most databases would be unmanageable without them. Most of the web is built on relational databases and use data models that depend on foreign key relationships; virtually all major corporate databases use relational databases to power essential business functions. And unlike in the olden days of mainframe computers, modern database solutions like PostgreSQL are easy to develop on, easy to modify, easy to customize, and can be optimized for data of all types and forms.
Why isn’t your data currently in a relationship?
Given how efficient, secure, and scalable relational databases are it’s always worth asking: why wouldn’t you want to use a relational database in all cases? Turns out there are sometimes good reasons not to be in a relationship.
One reason is that you could work for a search engine/ad company where much or all of you your operational data comes from logs files. Or maybe you are in a business where high levels of data accuracy and integrity may not be especially critical to your day-to-day business operations, or you need an environment where data analysts can mess around and structure their data as needed for ad hocs.
While you will want a relational database for your employee time and pay records, other business data may require less structures. If, for example, your business model involves scraping and storing tens of millions of web pages, caching them, and then periodically purging them, a relational database to store this content may have too much overhead. If your data tables are all independent from one another, and there is no need (or no business value) in trying to connect complex relationships between them, this may be another situation where a non-relational data storage could work for you. Some data really is better off single, or at least doesn’t mind not being in a relationship.
Still, these cases are not especially common. And if you are not currently working with relational data in your organization it’s most likely because—for whatever reason—a decision was made in previous years to jump on the Hadoop bandwagon. You may now uses Spark for your data processing, possibly in combination with other trendy products like Beam or Kafka. While there is nothing inherently wrong with this, and solutions like this can work for certain kind of operational data, it can also cause some complications.
If want to build a data model that combines, slices, and dices your data while ensuring data integrity, you’ll most likely still need a relational database solution. If you want your data model to enforce certain data relationships, and you want to leverage those data relationships to gain business insights with minimal error, you’ll need some kind of a relational database solution. And even if you store much of your operational data in hadoop-like data lakes, odds are good that for business reporting and internal compliance you’ll likely want to move at least some of your data into optimized relational databases or a (relational) data warehouse to meet your organization’s data needs.
Part 2 – Defining a “Data Warehouse”
While a rose by any other name may smell a sweet, if you call cabbage “a rose” it will still smell like cabbage. This is also true for tech branding.
While it’s not widely appreciated, much of the names and buzzwords used by the tech sales community (“quantum”, “AI”, “analytics”, “data science”, “robust”) have no established definition and many sales people use them however they want to. While we may believe these terms have clear meanings, some in tech sales or marketing folks like to pretend that words mean whatever they want them to mean at the time they use them. The result of this culture is that certain entities are calling their data product a “data warehouse” if they think it will make the sale. Google does it for BigQuery. Amazon does it for Redshift. Cloudera/Hortonworks does it too.
These products however—despite branding themselves “data warehouses”—are not data warehouses in the technical and traditional sense. Instead, what they are delivering is really just a modified first normal form database: effectively a collection of disconnected flat files with some datatype constrains added on table fields.
These mock “data warehouse” solutions do not offer built-in support for schema-enforced foreign key relationship; they don’t support foreign key relationships at all. That makes these products look less like a database or data warehouse solution, and more like a kind of enhanced spreadsheet solution.
By comparison here are some key features of real Data Warehouses:
- A Data Warehouse uses a relational data structure, optimized for reporting and data analysis, that allows easy access to key data.
- A Data Warehouse solution supports dimensional modeling, allowing you to construct elegant data models like the star-schema or OLAP cube.
- A Data Warehouse captures key data changes, and stores data in a format that makes historical data easily and quickly accessible.
- A Data Warehouse allows you to link data with integrity using well-defined foreign key relationships between compartmentalized/modular datamarts.
Until recently, few could imagine a data warehouse solution that didn’t support schema-enforced foreign key relationships between data tables. That’s not because of a failure of imagination either: it’s because as recently as 5 years ago calling something like this a “data warehouse” would have violated professional norms and would have been widely mocked in expert communities.
And so we should be honest: the products and services (Hive, BigQuery, Redshift, etc.) are not actually data warehouses. Instead, they are something else. But if none of these are “data warehouse” solutions, what should we call them instead?
As Data Warehouses solutions have sometimes been called “Business Intelligence” solutions or BI, maybe a name for these non-warehouse solutions should follow in the footsteps of the branding of NoSQL, and we should call these data solution NoBI (or “no buy”) solutions?
The simple truth is you can’t effectively implement a data warehouses star-schema or other sophisticated data structures (such as the legendary OLAP Cube) with these non-warehouse solutions: you can’t easily build fact or dimension-based datamarts with them, and you can’t efficiently or securely joins tables across your data warehouse with them. Effective data warehouse solutions require foreign keys. And NoBI services—or whatever you call them—do not support schema-enforced data relationships via foreign keys: they don’t support foreign key relationship; period.
These data services do not allow table-to-table relationships to be enforced through the data model itself: there are no built-in “cascade on delete” options; no indexing. If you want that functionality you have to onerously hard-code it into an application yourself: your data model for these NoBI solutions will not enforce it.
The value of relational data models it that data is rarely disconnected from the data that surrounds it. By dumping all the data in a “lake” with poor documentation or without concern to how it may be used, data lakes effectively require an organization’s analysts to spend their time wading through mounds of low-value data in search of hidden value, instead of giving them prepared data that allows them to spend their time focusing on things that will quickly provided keen insights.
“Big Data” Solutions Optimized for Inefficiency
These “alt-warehouse” solutions, in abandoning traditional database structures and standard features, also effectively adopt a data model that almost appears designed to maximize the amount of disk space used, and maximize the amount of computational power required to get the same results at the same speed as traditional relational databases and data warehouses.
It looks oddly like these products were developed by people renting server space and server resources. And—surprise—they were.
One of the reasons why Data Warehouses were preferred is that they usually try to minimize data redundancy while making the data itself as human-readable as possible. Traditional data warehouses prefer “second normal form” over “third normal form” relational structures so that individual data rows are easier to read and faster to query for reporting purpose. Successful data warehouses are partially de-nomaralized, but they do keep some level of data normalization (in the database sense) to prevent, say, having to store a user’s address appear on every one of the 100,000,000+ records in an event table.
However, it’s absolutely no surprise that the companies selling cloud services are pushing data solutions that don’t readily support even first-normal form data models: even first-normal form data models require foreign keys. While you can try and create foreign-key like entities using these non-database solution, you have to program the functionality yourself.
Since cloud computing companies almost all make money based on how much space and processing power their client use, it should be no surprise that their primary “data warehouse” offerings don’t support data models that would limit the amount of space you have to use. And products like BigQuery almost look like they were optimized to be as inefficient as possible without looking patently ridiculous.
This, to me, looks like the equivalent of a gas company selling you a heating/cooling system for your home: you can safely bet whatever they sell you is going to be as inefficient as they think they can get away with. After all, why would a company that profits off of inefficiency sell you something efficient?
The irony that most cloud computing companies are attempting to brand the slowest and most inefficient data models as the future, while attempting to dismiss fast and efficient data tools as “legacy”, should not be lost on anyone.
A Gateway Drug to “AI”
In addition to maximizing the resources required for search and storage, these non-warehouse/NoBI solution to “big data” have a not-no-secret agenda: by requiring clients to structure their data in this awkward and inefficient manner, they hope to hook business into paying for enormously expensive AI/ML services.
And while what I call “NoBI” solutions (such as BigQuery and Redshift) are not always helpful for storing data in a form readable by humans, it is in a form optimized for training AI data models. Because while non-relational data is awkward and painful for a human to make sense out of, it’s what AI training algorithms prefer.
However, as of early 2021 AI/ML is often an inferior solution compared to traditional low-cost data analysis techniques. “Modern” AI services are expensive to train, widely inaccurate, require near-perfect data, and they replace statistically-informed doubt and confidence intervals with false absolute certainty. While AI appears to be most successful at making deepfakes and playing board games, behind the scenes the real-world AI failures keep piling up.
IBM Health, for example—built on the enormously hyped Watson AI—appears destined to go down in history as one of the biggest tech failures of our era. Because despite the manufactured buzz around the product, the service reportedly gave doctors inaccurate or dangerous recommendations which could have killed patients if they were followed. (And that was if hospitals could get it to work at all.)
On the other end of the divide, Google’s much-hyped AI-driven car service Waymo shows signs of being beset by technical problems and glitches, and the division’s CEO left in early 2021 after years of exaggerated claims. Amazon’s recent venture into AI for internal business uses—such as hiring screenings—reportedly caused the bot to prefer white men over women and made hiring recommendations seemingly by random. Microsoft’s much-buzzed AI Twitter bot was shut down hours after its debut when it started spewing hate speech.
Finding problems with modern AI tech and algorithms is also so easy that you can usually find deal-breaking issues within minutes of preliminary testing. None of these things are signs of a “mature” technology. This tech is completely experimental, and—despite slick marketing campaigns intended to make businesses think otherwise—largely unproven in its usefulness. (AI products are also commonly released using open source licenses, or delivered under licensing terms that offer little to no warranty.)
It’s quite the risk for an organization to adopt a long-term data storage solution that do not support previously standard data modeling features, such as foreign keys. And while abandoning classic data structures in order to ensure to be “ready for AI” may sound forward-thinking it can counter-intuitively make an organization’s data much less accessible or useful to key decision-makers.
If an organization is attempting to modernize their data architecture, the time and money may be best spent making data usable and maintainable for humans, not machines. The “smart” machines of today look more daft than deft, and chances are very good this will continue to be the case for years to come.
Conclusion – Should you use Hive (or its peers) for “Data Warehousing”?
Every organization is different, but there should be a high bar to pass before abandoning traditional relational database solutions. Apache Hive and its comparable products are, in most cases, an inferior substitute to relational databases due to their lack of foreign keys support and associated data-integrity features. The lack of built-in support of foreign key-based relational data models make them ill-suited for scehma-enforced data integrity, data analysis, and human modeling, and they are a clear no-go for those trying to implement a genuine relational data warehouse.
While I use relational data warehouse-esque models in my own personal projects and have a high opinion of these solutions, they are not right for every case. Still, one of the chief advantages of using star schema data warehouse is that it encourages you think deeply about data and how it will be used, and also gives you a clear game plan for cleaning and linking data.
The end goal of any data warehouse project is that you can easy store and quickly access hierarchical data with relatively little cost spent on servers and systems. The payoff for this initial investment is that your end results allow you to some very cool things with your data; things that both impress and offer the promise of intelligent interpretation and insight:
[Built with base R, using Data Warehouse ETL methods and techniques.]
Kimball, R. and Ross, M., The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence, 2016.