Industry News Details
The hidden costs of NoSQL Posted on : Jan 06 - 2016
NoSQL is a powerful data model, but perhaps not enough to justify many independent datastores.
This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter’s approach.
The NoSQL industry was developed quickly on the promise of schema-free design, infinitely scalable clusters and breakthrough performance. But there are hidden costs, including the added complexity of an endless choice of datastores (now numbering 225), the realization that analytics without SQL is painful, high query latencies require you to pre-compute results, and the inefficient use of hardware leads to server sprawl.
All of these costs add up to a picture far less rosy than initially presented. However, the data model for NoSQL does make sense for certain workloads, across key-value and document data types. Fortunately, those are now incorporated into multi-mode and multi-model databases representing a simplified and consolidated approach to data management.
Let’s take a closer look at the impetus for the NoSQL movement and the true impact of abandoning SQL.
Dawn and decline of the NoSQL movement
The popularity of NoSQL grew from the need to scale beyond what traditional disk-based relational databases could handle, and because high performance solutions from large database companies get very expensive very quickly. Coupled with data growth, developers needed a better way for the growing use of simple data structures like users and profile information associated with mobile applications. NoSQL promised an easy path to performance.
Another explanation for NoSQL popularity comes from the perception that SQL can be hard to learn. But Michael Stahnke, director of engineering at Puppet Labs, claims that is an early, and invalid argument, noting that, “instead you must learn one query language for each tool you use.”
A few things changed in recent years that have led to the assimilation of NoSQL into the broader database market.
First, in-memory architectures have proven that you can have performance and SQL together, addressing part of the reason for ditching SQL initially.
Second, most NoSQL datastores begin with a limited language for key/value workloads, and then attempt more SQL-like constructs or even try to recreate SQL itself. Starting with SQL means you incorporate core architectural features like multi-version concurrency control (MVCC) or indexes, both critical for real-time analytics on changing data sets.
Finally, relational database vendors have recognized the value of multiple data models by incorporating them into a comprehensive offering.
Perhaps the NoSQL fade away is best summarized by leading analyst firm Gartner: “By 2017, the ‘NoSQL’ label will cease to distinguish DBMSs, which will reduce its value and result in it falling out of use” (as quoted in Dataversity).
The value of SQL
Ironically, following the NoSQL hype, the value of SQL-as-a-layer has become immediately valuable to companies and datastores alike. Witness SQL-as-a-layer efforts in rescuing data from Hadoop with projects like Impala (Cloudera), Drill (MapR), and Hive (Hortonworks), as well as solutions like Presto developed at Facebook.
And processing frameworks like Spark, with its popular Spark SQL functions, have proven to be a saving grace for document and key-value datastores that left SQL back on the cutting room floor.
Meanwhile in-memory, distributed systems enable the relational model to remain intact, achieve groundbreaking performance and scale for modern workloads, and incorporate NoSQL data types like JSON.
Long live multi-model databases
Of course the death of the NoSQL label does not mean death of the NoSQL model. Rather it points to the use of multiple data models within a single database. This was recently outlined in a webcast by Matt Aslett, research director of Data Platforms and Analytics at 451 Research, on the Internet of Things and Multi-model Data Infrastructure, in which he states:
-
The database market has been dominated for 40 years by the relational database model (and SQL) – typically with separate databases for operational and analytics workloads.
-
Emerging databases take advantage of in-memory and advanced processing performance to deliver combined operational and analytic processing.
-
Polyglot persistence drove the expansion of the database market with NoSQL – specialists databases for specialist purposes and multiple data models.
-
The use of multiple databases to support an individual application can lead to operational complexity and inflexibility driven by interdependence.
-
Multi-model enables the flexibility of polyglot persistence without the operational complexity by supporting multiple data models.
The presentation showcases how multi-model, multimode databases support a combination of the SQL and NoSQL data models, especially JSON and key-value, as well as other workloads.
Calculating The Hidden Costs
So while NoSQL promised scale and performance at lower costs, NoSQL deployments can actually be far costlier than initially imagined. Let’s look at a few hidden cost areas.
* Added complexity. As referenced by Aslett of 451 Research, “use of multiple databases to support an individual application can lead to operational complexity.”
Every new datastore adds to the financial and operational burden of the data team. Having to support more databases that only fill a niche workload adds cost.
* Lack of analytics. By abandoning the relational algebra implicit in SQL, NoSQL stores have an uphill battle when it comes to analytics. Many NoSQL stores implemented SQL-like query layers such as the Cassandra Query Language (CQL) or N1QL for Couchbase. These provide some analytical functionality but they are not the same as ANSI SQL and they disqualify these datastores from natively connecting with the enterprise tools that use SQL. This bifurcation can weigh negatively on an enterprise trying to design around open standards like SQL.
* Query latency. Complex analytics can be challenging for NoSQL datastores, so many companies are forced to pre-compute results. Tapjoy found this to be the case with HBase and outlined their challenges at the In-Memory Computing Conference in San Francisco during their Hitchhiker’s Guide to Building a Data Science Platform presentation. This batch processing workflow introduces system latency and reduces that business value of data. Never mind that a batch oriented workflow means the results are inherently out of date and disqualifies the opportunity to deliver real-time analytics.
* Hardware sprawl. While scale, and in particular the number of nodes in a cluster, can be a badge of honor, the goal is not how many nodes can be deployed, but rather how few. Even more important is the efficiency of transactions for each node. When NoSQL solutions need to be coupled with additional SQL layers, or pre-computing must be completed before queries can be run, it adds to hardware sprawl and costs.
* Preserve the model, consolidate workloads. There are other options, recently referred by Gartner as the “avant-garde” of relational databases that provide solutions using relational properties of SQL, and the performance needed to scale, frequently through the use of in-memory technologies. Many of these avant-garde databases also incorporate capabilities like JSON to provide data models for structured and semi-structured data.
Today customers are discovering that what appeared like a novel lower cost solution of NoSQL is actually much higher than initially thought. Fortunately, those challenges can be solved with a database that provides the performance needed and the ability to perform comprehensive SQL analytics all in a single solution.
Many big data industry participants have noted that a revolution is underway in the way companies capture and process data. But perhaps the climate is best summarized by Gwen Shapira, a prominent spokesperson on big data Source