“The single most important lesson I’ve learned is to keep it simple. I find designers sometimes deliver over-complex, generic solutions that could (in theory) do anything, but in reality are remarkably difficult to operate, and often misunderstood.”–John Ryan
I have interviewed John Ryan, Data Warehouse Solution Architect (Director) at UBS.
Q1. You are an experienced Data Warehouse architect, designer and developer. What are the main lessons you have learned in your career?
John Ryan: The single most important lesson I’ve learned is to keep it simple. I find designers sometimes deliver over-complex, generic solutions that could (in theory) do anything, but in reality are remarkably difficult to operate, and often misunderstood. I believe this stems from a lack of understanding of the requirement – the second most important lesson.
Everyone from the senior stakeholders to architects, designers and developers need to fully understand the goal. Not the solution, but the “problem we’re trying to solve”. End users never ask for what they need (the requirement), but instead, they describe a potential solution. IT professionals are by nature delivery focused, and, get frustrated when it appears “the user changed their mind”. I find the user seldom changes their mind. In reality, the requirement was never fully understood.
To summarise. Focus on the problem not the solution. Then (once understood), suggest a dozen solutions and pick the best one. But keep it simple.
Q2. How has the Database industry changed in the last 20 years?
John Ryan: On the surface, not a lot. As recent as 2016 Gartner estimated Oracle, Microsoft and IBM accounted for 80% of the commercial database market, but that hides an underlying trend that’s disrupting this $50 billion industry.
Around the year 2000, the primary options were Oracle, DB2 or SQL Server with data warehouse appliances from Teradata and Netezza. Fast forward to today, and the database engine rankings include over 300 databases of which 50% are open source, and there are over 11 categories including Graph, Wide column, Key-value and Document stores; each suited to a different use-case.
While relational databases are still popular, 4 of the top 10 most popular solutions are non-relational (classed as NoSQL), including MongoDB, Redis and Cassandra. Cross-reference this against a of survey of the most highly sought skills, and we find MongoDB, Redis and Cassandra again in the top ranking, with nearly 40% of respondents seeking MongoDB skills compared to just 12% seeking Oracle expertise.
Likewise open source databases make up 60% of the top ten ranking, with open source database MySQL in second place behind Oracle in the rankings, and Gartner states that “By 2018, more than 70% of new in-house applications will be developed on an [Open Source] DBMS”.
The move towards cloud computing and database-as-a-service is causing further disruption in the data warehouse space with cloud and hybrid challengers including Vertica, Amazon Redshift and Snowflake.
In conclusion, the commercial relational vendors currently dominate the market in revenue terms. However, there has been a remarkable growth of open source alternatives, and huge demand for solutions to handle high velocity unstructured and semi-structured data. These use-cases including social media, and the Internet of Things are ill-suited to the legacy structured databases provided by Oracle, DB2 and SQL Server, and this void has been largely filled by open source NoSQL and NewSQL databases.
Q3. RDBMS vs. NoSQL vs. NewSQL: How do you compare Database Technologies?
John Ryan: The traditional RDBMS solutions from Oracle, IBM and Microsoft implement the relational model on a 1970s hardware architecture, and typically provide a good general purpose database platform which can be applied to most OLTP and Data Warehouse use cases.
However, as Dr. Michael Stonebraker indicated in this 2007 paper, The End of an Architectural Era (It’s Time for a Complete Rewrite), these are no longer fit for purpose, as both the hardware technology, and processing demands have moved on. In particular, the need for real time (millisecond) performance, greater scalability to handle web-scale applications, and the need to handle unstructured and semi-structured data.
The Key-Value NoSQL databases were developed to handle the massive transaction volume, and low latency needed to handle web commerce at Amazon and LinkedIn. Others (eg. MongoDB) where developed to handle semi-structured data, while still others (eg. Neo4J) were built to efficiently model data networks of the type found at Facebook or LinkedIn.
The common thread with NoSQL databases is they tend to use an API interface rather than industry standard SQL, although increasingly that’s changing. They do however, entirely reject the relational model and ACID compliance. They typically don’t support cross-table join operations, and are entirely focused on low latency, trading consistency for scalability.
The so-called NewSQL databases include VoltDB , MemSQL and CockroachDB are a return to the relational model, but re-architected for modern hardware and web scale use cases. Like NoSQL solutions, they tend to run on a shared nothing architecture, and scale to millions of transactions per second, but they also have full transaction support and ACID compliance that are critical for financial operations.
Q4. What are the typical trade-off of performance and consistency when using NoSQL and NewSQL databases to support high velocity OLTP and real time analytics?
John Ryan: The shared nothing architecture is built to support horizontal scalability, and when combined with data replication, can provide high availability and good performance. If one node in the cluster fails, the system continues, as the data is available on other nodes. The NoSQL database is built upon this architecture, and to maximize throughput, ACID compliance is relaxed in favor of Eventual Consistency, and in some cases (eg. Cassandra), it supports tunable consistency, allowing the developer to trade performance for consistency, and durability.
For example, after a write operation, the change cannot be considered durable (the D in ACID) until the change is replicated to at least one, ideally two other nodes , but this would increase latency, and reduce performance. It’s possible however, to relax this constraint, and return immediately, with the risk the change may be lost if the node crashes before the data is successfully replicated. This becomes even more of a potential issue if the node is temporarily disconnected from the network, but is allowed to continue accepting transactions until the connection is restored. In practice, consistency will be eventually be achieved when the connection is reestablished – hence the term Eventual Consistency.
A NewSQL database on the other hand accepts no such compromise, and some databases (eg. VoltDB), even support full serializability, executing transactions as if they were executed serially. Impressively, they manage this impressive feat at a rate of millions of transactions per second, potentially on commodity hardware.
Q5. One of the main challenges for real time systems architects is the potentially massive throughput required which could exceed a million transactions per second. How do handle such a challenge?
John Ryan: The short answer is – with care! The longer answer is described in my article, Big Data – Velocity. I’d break the problem into three components, Data Ingestion, Transformation and Storage.
Data ingestion requires message based middleware (eg. Apache Kafka), with a range of adapters and interfaces, and the ability to smooth out the potentially massive spikes in velocity, with the ability to stream data to multiple targets.
Transformation, typically requires an in-memory data streaming solution to restructure and transform data in near-real time. Options include Spark Streaming, Storm or Flink.
Storage and Analytics is sometimes handled by a NoSQL database, but for application simplicity (avoiding the need to implement transactions or handle eventual consistency problems in the application), I’d recommend a NewSQL database.
All the low-latency, high throughput of the NoSQL solutions, but with the flexibility and ease of a full relational database, and full SQL support.
In conclusion, the solution needs to abandon the traditional batch oriented solution in favour of an always-on streaming solution with all processing in memory.
Q6. Michael Stonebraker introduced the so called “One Size no longer fits all”-concept. Has this concept come true on the database market?
John Ryan: First stated in the paper One Size Fits All – An Idea Whose Time Has Come And Gone Dr. Michael Stonebraker argued that the legacy RDBMS dominance was at an end, and would be replaced by specialized database technology including stream processing, OLTP and Data Warehouse solutions.
Certainly disruption in the Data Warehouse database market has been accelerated with the move towards the cloud, and as this Gigaom Report illustrates, there are at least nine major players in the market, with new specialized tools including Google Big Query, Amazon Redshift and Snowflake, and the column store (in memory or on secondary storage) dominates.
Finally, the explosion of specialized NoSQL and NewSQL databases, each with its own specialty including Key-Value, Document Stores, Text Search and Graph databases lend credence to the statement “One Size no longer fits all”.
I do think however, we’re still in a transformation stage, and the shake-out is not yet complete. I think a lot of large corporations (especially Financial Services) are wary of change, but it’s already happening.
I think the quote Stewart Brand is appropriate: “Once a new technology rolls over you, if you’re not part of the steamroller, your part of the road”.
Q7. Eventual consistency vs ACID. A pragmatic approach or a step too far?
John Ryan: As with so many decisions in IT, it depends. Eventual Consistency was built into the Amazon Dynamo database as a pragmatic decision, because it’s difficult to maintain high throughput and availability at scale. Amazon accepted the relatively minor risk of inconsistencies because of the huge benefits including scalability and availability. In many other web scale applications (eg. Social media), the implications of temporary inconsistency are not important, and it’s therefore an acceptable approach.
Having said that, it does place a huge addition burden on the developer to code for relatively rare and unexpected conditions, and I’d question why would anyone settle for a database which supported eventual consistency, when full ACID compliance and transaction handling is available.
Q8. What is your take on the Lambda Architecture ?
John Ryan: The Lambda Architecture is a an approach to handle massive data volumes, and provide real time results by running two parallel solutions, a Batch Processing and a real time Speed Processing stream.
As a Data Warehouse and ETL designer, I’m aware of the complexity involved in data transformation, and I was immediately concerned about an approach which involved duplicating the logic, probably in two different technologies.
I’d also question the sense of repeatedly executing batch processing on massive data volumes when processing is moving to the cloud, on a pay for what you use basis.
I’ve since found this article on Lambda by Jay Kreps of LinkedIn and more recently Confluent (who developed Apache Kafka), and he describes the challenges from a position of real (and quite painful) experience.
The article recommends an alternative approach, a single Speed Processing stream with slight changes to allow data to be reprocessed (the primary advantage of the Lambda architecture). This solution, named the Kappa Architecture, is based upon the ability of Kafka to retain and re-play data, and it seems an entirely sensible and more logical approach.
Qx Anything else you wish to add?
John Ryan: Thank you for this opportunity. It may be a Chinese curse that “may you live in interesting times”, but I think it’s a fascinating time to be working in the database industry.
John Ryan, Data Warehouse Solution Architect (Director), UBS.
John has over 30 years experience in the IT industry, and has maintained a keen interest in database technology since his University days in the 1980s when he invented a B-Tree index, only to find Oracle (and several others) had got there already.
He’s worked as a Data Warehouse Architect and Designer for the past 20 years in a range of industries including Mobile Communications, Energy and Financial Services. He’s regularly found writing articles on Big Data and database architecture, and you can find him on LinkedIn.
– ODBMS.org: BIG DATA, ANALYTICAL DATA PLATFORMS, DATA SCIENCE
– ODBMS.org: NEWSQL, XML, RDF DATA STORES, RDBMS
– ODBMS.org: NOSQL DATA STORES
– On the InterSystems IRIS Data Platform ,ODBMS Industry Watch, 2018-02-09
– Facing the Challenges of Real-Time Analytics. Interview with David Flower, ODBMS Industry Watch, 2017-12-19
– On the future of Data Warehousing. Interview with Jacque Istok and Mike Waas, ODBMS Industry Watch, 2017-11-09
– On Vertica and the new combined Micro Focus company. Interview with Colin Mahony, ODBMS Industry Watch, 2017-10-25
– On Open Source Databases. Interview with Peter Zaitsev, ODBMS Industry Watch, 2017-09-06
Follow us on Twitter: @odbmsorg