Month: August 2010

The problems with ACID, and how to fix them without going NoSQL

(This post is coauthored by Alexander Thomson and Daniel Abadi)

It is a poorly kept secret that NoSQL is not really about eliminating SQL from database systems (e.g., see these links). Rather, systems such as Bigtable, HBase, Hypertable, Cassandra, Dynamo, SimpleDB (and a host of other key-value stores), PNUTS/Sherpa, etc. are mostly concerned with system scalability. It turns out to be quite difficult to scale traditional, ACID-compliant relational database systems on cheap, shared-nothing scale-out architectures, and thus these systems drop some of the ACID guarantees in order to achieve shared-nothing scalability (letting the application developer handle the increased complexity that programming over a non-ACID compliant system entails). In other words, NoSQL really means NoACID.

Our objective in this post is to explain why ACID is hard to scale. At the same time, we argue that NoSQL/NoACID is the lazy way around these difficulties—it would be better if the particular problems that make ACID hard to scale could be overcome. This is obviously a hard problem, but we have a few new ideas about where to begin.

ACID, scalability and replication

For large transactional applications, it is well known that scaling out on commodity hardware is far cheaper than scaling up on high-end servers. Most of the largest transactional applications therefore use a shared-nothing architecture where data is divided across many machines and each transaction is executed at the appropriate one(s).

The problem is that if a transaction accesses data that is split across multiple physical machines, guaranteeing the traditional ACID properties becomes increasingly complex: ACID’s atomicity guarantee requires a distributed commit protocol (such as two-phase commit) across the multiple machines involved in the transaction, and its isolation guarantee insists that the transaction hold all of its locks for the full duration of that protocol. Since many of today’s OLTP workloads are composed of fairly lightweight transactions (each involving less than 10 microseconds of actual work), tacking a couple of network round trips onto every distributed transaction can easily mean that locks are held for orders of magnitude longer than the time each transaction really spends updating its locked data items. This can result in skyrocketing lock contention between transactions, which can severely limit transactional throughput.

In addition, high availability is becoming ever more crucial in scalable transactional database systems, and is typically accomplished via replication and automatic fail-over in the case of a crash. The developer community has therefore come to expect ACID’s consistency guarantee (originally promising local adherence to user-specified invariants) to also imply strong consistency between replicas (i.e. replicas are identical copies of one other, as in the CAP/PACELC sense of the word consistency).

Unfortunately, strongly consistent replication schemes either come with high overhead or incur undesirable tradeoffs. Early approaches to strongly consistent replication attempted to synchronize replicas during transaction execution. Replicas executed transactions in parallel, but implemented some protocol to ensure agreement about any change in database state before committing any transaction. Because of the latency involved in such protocols (and due to the same contention issue discussed above in relation to scalability), synchronized active replication is seldom used in practice today.

Today’s solution is usually post-write replication, where each transaction is executed first at some primary replica, and updates are propagated to other replicas after the fact. Basic master-slave/log-shipping replication is the simplest example of post-write replication, although other schemes which first execute each transaction at one of multiple possible masters fall under this category. In addition to the possibility of stale reads at slave replicas, these systems suffer a fundamental latency-durability-consistency tradeoff: either a primary replica waits to commit each transaction until receiving acknowledgement of sufficient replication, or it commits upon completing the transaction. In the latter case, either in-flight transactions are lost upon failure of the primary replica, threatening durability, or they are retrieved only after the failed node has recovered, while transactions executed on other replicas in the meantime threaten consistency in the event of a failure.

In summary, it is really hard to guarantee ACID across scalable, highly available, shared-nothing systems due to complex and high overhead commit protocols, and difficult tradeoffs in available replication schemes.

The NoACID solution

Designers of NoSQL systems, aware of these issues, carefully relax some ACID guarantees in order to achieve scalability and high availability. There are two ways that ACID is typically weakened. First, systems like Bigtable, SQL Azure, sharded MySQL, and key-value stores support atomicity and isolation only when each transaction only accesses data within some convenient subset of the database (a single tuple in Bigtable and KV stores, or a single database partition in SQL Azure and sharded MySQL). This eliminates the need for expensive distributed commit protocols, but at a cost: Any logical transaction which spans more than one of these subsets must be broken up at the application level into separate transactions; the system therefore guarantees neither atomicity nor isolation with respect to arbitrary logical transactions. In the end, the programmer must therefore implement any additional ACID functionality at the application level.

Second, lazy replication schemes such as eventual consistency sacrifice strong consistency to get around the tradeoffs of post-write replication (while also allowing for high availability in the presence of network partitions, as specified in the CAP theorem). Except with regard to some well-known and much-publicized Web 2.0 applications, losing consistency at all times (regardless of whether a network partition is actually occurring) is too steep a price to pay in terms of complexity for the application developer or experience for the end-user.

Fixing ACID without going NoSQL

In our opinion, the NoSQL decision to give up on ACID is the lazy solution to these scalability and replication issues. Responsibility for atomicity, consistency and isolation is simply being pushed onto the developer. What is really needed is a way for ACID systems to scale on shared-nothing architectures, and that is what we address in the research paper that we will present at VLDB this month.

Our view (and yes, this may seem counterintuitive at first), is that the problem with ACID is not that its guarantees are too strong (and that therefore scaling these guarantees in a shared-nothing cluster of machines is too hard), but rather that its guarantees are too weak, and that this weakness is hindering scalability.

The root of these problems lies in the isolation property within ACID. In particular, the serializability property (which is the standard isolation level for fully ACID systems) guarantees that execution of a set of transactions occurs in a manner equivalent to some sequential, non-concurrent execution of those transactions, even if what actually happens under the hood is highly threaded and parallelized. So if three transactions (let’s call them A, B and C) are active at the same time on an ACID system, it will guarantee that the resulting database state will be the same as if it had run them one-by-one. No promises are made, however, about which particular order execution it will be equivalent to: A-B-C, B-A-C, A-C-B, etc.

This obviously causes problems for replication. If a set of (potentially non-commutative) transactions is sent to two replicas of the same system, the two replicas might each execute the transactions in a manner equivalent to a different serial order, allowing the replicas’ states to diverge.

More generally, most of the intra- and inter-replica information exchange that forms the basis of the scalability and replication woes of ACID systems described above occurs when disparate nodes in the system have to forge agreement about (a) which transactions should be executed, (b) which will end up being committed, and (c) with equivalence to which serial order.

If the isolation property were to be strengthened to guarantee equivalence to a predetermined serial order (while still allowing high levels of concurrency), and if a layer were added to the system which accepts transaction requests, decides on a universal order, and sends the ordered requests to all replicas, then problems (a) and (c) are eliminated. If the system is also stripped of the right to arbitrarily abort transactions (system aborts typically occur for reasons such as node failure and deadlock), then problem (b) is also eliminated.

This kind of strengthening of isolation introduces new challenges (such as deadlock avoidance, dealing with failures without aborting transactions, and allowing highly concurrent execution without any on-the-fly transaction reordering), but also results in a very interesting property: given an initial database state and a sequence of transaction requests, there exists only one valid final state. In other words, determinism.

The repercussions of a deterministic system are broad, but one advantage is immediately clear: active replication is trivial, strongly consistent, and suffers none of the drawbacks described above. There are some less obvious advantages too. For example, the need for distributed commit protocols in multi-node transactions is eliminated, which is a critical step towards scalability. (Why distributed commit protocols can be omitted in distributed systems is non-obvious, and will be discussed in a future blog post; the topic is also addressed at length in our paper.)

A deterministic DBMS prototype

In our paper, entitled “The Case for Determinism in Database Systems”, we propose an architecture and execution model that avoids deadlock, copes with failures without aborting transactions, and achieves high concurrency. The paper contains full details, but the basic idea is to use ordered locking coupled with optimistic lock location prediction, while exploiting deterministic systems’ nice replication properties in the case of failures.

We go on in the paper to present measurements and analyses of the performance characteristics of a fully ACID deterministic database prototype based on our execution model, which we implemented alongside a standard (nondeterministic) two-phase locking system for comparison. It turns out that the deterministic scheme performs horribly in disk-based environments, but that as transactions get shorter and less variable in length (thanks to the introduction of flash and the ever-plummeting cost of memory) our scheme becomes more viable. Running the prototype on modern hardware, deterministic execution keeps up with the traditional system implementation on the TPC-C benchmark, and actually shows drastically more throughput and scalability than the nondeterministic system when the frequency of multi-partition transactions increases.

Our prototype system is currently being reworked and extended to include several optimizations which appear to be unique to explicitly deterministic systems (see the Future Work section in our paper’s appendix for details), and we look forward to releasing a stable codebase to the community in the coming months, in hopes that it will spur further dialogue and research on deterministic systems and on the scalability of ACID systems in general.

Ksplice for Fedora!

In response to many requests, Ksplice is proud to announce we’re now providing Uptrack free of charge for Fedora! Fedora will join Ubuntu Desktop among our free platforms, and will give Fedora users rebootless updates as long as Fedora maintains each …

This blog is inactive (at least for the time being)

It’s been more than 2 years and a half since I last blogged on Oracle performance. I had the feeling I could not carry on this extra time activity when I started managing a 20 people DBA team and 4 technologies (including 2  Oracle competitors- Microsoft and Sybase which has ASE and IQ). It would […]

I’ll be a presenter at Oracle Open World 2010

I submitted an abstract to the Oracle OpenWorld 2010 that was found to be worthwhile enough to be accepted. Here are the session details: Speaker(s) Christian BILIEN, BNP PARIBAS Corporate Investment Banking, Head of the DBAs Monday, September 20, 2:00PM | Moscone South, Rm 236 60 min. Session ID: S314649 Title: Large-Scale Oracle RAC (and […]

“What’s a bind variable?”

It would be reasonable to assume that everything that could ever be written about bind variables has already been written. Certainly regarding the concept of them and why they are a good thing anyway. A little while back, I was digging around on a system which was suffering from very high parse rates. I did […]

Essay: 3G and me

In 2002, I got my first cell phone.

June was stuffy in Manhattan, and my summer internship copy-editing the New York Sun, the now-defunct right-wing newspaper, was just about to start. I swam through the humid air past Madison Square Park to get to the store before closing.

“You want this one,” said the salesman at the RadioShack, pointing to a sleek model then on sale. “It’s a 3G phone. It’ll work with Sprint’s new 3G network they’re rolling out later this summer.”

“Ok,” I said. Sure enough, it had 3G:

Sanyo SCP-6200. QUALCOMM 3G CDMA
Fig. 1: Sprint’s Sanyo 3G phone, circa mid-2002. An orange of more recent vintage looks on.

A few months later — after all the Sun‘s editorials casting doubt on whether lead paint can really poison you had been edited and sent off to our eight readers, and I was back at school — Sprint did roll out their 3G network:

Sprint launched nationwide 3G service in the 2002 third quarter. The service,

marketed as “PCS Vision”, allows consumer and business customers to use their

Vision-enabled PCS devices to take and receive pictures, check personal and

corporate e-mail, play games with full-color graphics and polyphonic sounds and

browse the Internet wirelessly with speeds up to 144 kbps (with average speeds

of 50 to 70 kbps).

I called Sprint and tried to subscribe. “Sir, you need a 3G phone to sign up,” they told me.

“I have one!” I said proudly. “It says 3G CDMA right on the back!”

“Oh, I’m sorry sir. We’ve changed the labeling of that model. That phone doesn’t have true 3G. It doesn’t say that on the back any more. If you like I would be happy to sell you the next model, the SCP-6400, which has true 3G.”

“No, thanks,” I said, thinking that 3G was pretty much a crock, while wryly appreciating RadioShack’s ability to make you feel cheated even on a $30 cellphone.

Sure enough, when my phone died and had to be replaced, I saw the new one only said “QUALCOMM CDMA” — no more “3G”. It had been revised downward.

Meanwhile, Sprint’s competitors were busy deploying their own nationwide 3G networks. Cingular, then a joint venture of SBC and BellSouth, trumpeted each step in the process:

June 2003:

ATLANTA, June 30 — Cingular Wireless today announced the world’s

first commercial deployment of wireless services using Enhanced Datarate

for Global Evolution (EDGE) technology. Cingular’s initial EDGE service

offering is in its Indianapolis market, with subsequent deployments

expected later in the year.

Building on more than a decade of wireless data experience, Cingular’s

EDGE technology enables true “third generation” (3G) wireless data

services with data speeds typically three times faster than those

available on GSM/GPRS networks.

Or October 2003:

Cingular began offering its 3G service EDGE (Enhanced Datarate for Global

Evolution) in Indianapolis in July, becoming the first commercial wireless

company in the world to offer the service.

Or June 2004:

This year, further enhancements have been made to the network with the

launch of EDGE in Connecticut, a high-speed wireless data service which

gives customers true “third generation” (3G) wireless data services with

data speeds typically three times faster than what was available on GPRS.

Those of you who care about these things will probably be jumping up and down right now, and/or closing the browser window. “EDGE isn’t 3G!” you are saying. “It’s 2.9G at best! And neither is 1xRTT, which is all the Sanyo SCP-6200 had. That’s barely 2.5G! Maybe 2.75G on a clear day.”

These people, who while enthusiastic sometimes seem to have been born yesterday, would point to the kerfuffle when Apple released the original iPhone in 2007 for Cingular and only supported EDGE. As the Wall Street Journal wrote:

Detractors and fans are going toe to toe on online forums. Much of the latest criticism is zooming in on Apple’s choice of technologies to use with the new phone and its decision to partner exclusively with AT&T Inc.’s Cingular Wireless, which is being rebranded as AT&T.

For example, the iPhone won’t use the fastest wireless Internet connection available, relying on so-called second-generation, or 2G, rather than faster 3G networks now being rolled out by major wireless carriers. Because of this, industry experts expect features of the iPhone such as Web browsing and downloading not to be very fast.

Tim Cook, Apple’s chief operating officer, said during a conference call with analysts yesterday the company is sold on Cingular’s 2G EDGE network because “it’s much more widespread and widely deployed in the U.S.” Mr. Cook didn’t comment on whether Apple will eventually support 3G but said, “Obviously we would be where the technology is over time.” Some people refer to EDGE as 2.5G.

By 2007, Cingular/AT&T was happy to downgrade its EDGE offerings in favor of a newer kind of 3G (known as W-CDMA or UMTS). From an interview with AT&T’s chief, Randall Stephenson, in the New York Times in June 2007:

”I got to tell you, carrying this thing around and experiencing those kinds of speeds on a wireless handset, your imagination begins to run in terms of what’s possible,” he said, ”and by the way, there’s not a 3G network available in Ottumwa, Iowa,” referring to the so-called third generation of Web-enabled cellphones that require faster networks. ”If you want to sell these devices in a variety of places, Edge is the only opportunity you have.”

AT&T has invested $16 billion in its network over the last two years, and the network is now designed to handle the expected increase in wireless data users, he said, adding: ”Capacity won’t be an issue. The network is ready.”

Ok, what are some quick takeaways here?

  • What Sprint sold as “3G” in 2002 (1xRTT voice), it rescinded later that year and relabeled the phones.
  • What counted as “3G” for Sprint in 2003 (1xRTT data), isn’t any more either.
  • What in 2004 constituted “true ‘third generation’ (3G)” to Cingular/AT&T, the company had retroactively downgraded to 2G or 2.5G or 2.9G by 2007.
  • From an engineer’s perspective, the 3G interfaces, if you read a book on telecom engineering, are CDMA2000 (including 1xRTT and EV-DO), EDGE, and W-CDMA (including UMTS, with or without HSUPA and HSDPA). The International Telecommunications Union has published a standard for third-generation wireless communications, known as IMT-2000, that includes those three and a few others.
  • To a first approximation, the first launch of “3G” in the United States, around 2002 and 2003, was a dud. The carriers responded by dusting themselves off, redoubling their efforts, deploying a new thing and retroactively downgrading their old “3G” product to be… some smaller number of G’s. “3G” itself it not a technical term with a whole lot of meaning, especially as it lumps together so many incompatible, competing air interface protocols. The situation for consumers was less confused in Europe, where GSM and W-CDMA are dominant, governments auctioned new frequencies set aside for “3G,” and the carrier offerings were more distinct.
  • The same song-and-dance is likely to play out over “4G” — a term that engineers tentatively apply to a forthcoming ITU standard called IMT-Advanced, and carriers apply to whatever they want you to buy now. You might notice that Sprint is currently selling Mobile WiMAX as “4G.” Mobile WiMAX is part of IMT-2000 — the 3G standard. Verizon Wireless is selling something called “LTE” as “4G” — it ain’t in IMT-Advanced either. Today’s “4G” products are like the “3G” of 2002 and 2003 — they will become “3.75G” as soon as the next hot thing comes out.

But the point I really want to make is: this is all a red herring. Focusing on the protocol between your cell phone and the tower — or worse, spending money on that basis — is letting yourself be distracted. It’s like the secret pick-me-up in Geritol, concocted by Madison Avenue instead of a chemist.

A cell phone is essentially sharing a swath of radio spectrum with a bunch of other people within a cell. Think of it like a cable modem or any other ISP. You can have the world’s most sophisticated modem, but if it’s trying to talk in a tiny slice of spectrum shared with everybody else within miles around (because there aren’t enough towers to divide you up into cells), it’ll still be awful.

Consider, for example, the performance I get from a Verizon “3G” USB modem:

3060 packets transmitted, 3007 received, 1% packet loss, time 3061925ms

rtt min/avg/max/mdev = 121.554/404.199/22307.520/1213.055 ms, pipe 23

Pretty sad! But hey, it’s 3G. In truth, a lot of boring factors control the performance of your cell phone data transmissions, principally:

  1. How much spectrum has the carrier licensed in my city, and how much is allocated to this kind of modulation?
  2. How many other people am I sharing the local tower with? In other words, how big is my cell, and how many towers has the carrier built or contracted with?
  3. How much throughput are my cellmates trying to consume?
  4. How much throughput has the carrier built in its back-end network connecting to the tower?

You might notice that all of these meat-and-potatoes factors involve the carrier spending money, and they all involve gradual improvement in behind-the-scenes infrastructure that’s hard to get customers excited. Persuading you to buy a new cell phone with a sophisticated modem and sign up for a two-year contract is a different story. So they don’t sell you something measurable where they could be held accountable; they sell how sweet it feels to be using a sophisticated radio modem protocol to talk to them.

Don’t get me wrong — UMTS and EV-DO are sophisticated protocols, and a lot of smart people and clever techniques made them legitimate engineering accomplishments. But the boring factors — the raw resources being shared among the nearby customers — dictate your performance just as much as incremental improvements in the air interface. What we really ought to care about is the same as with any Internet service provider — the throughput and latency and reliability you get to the endpoints you want to reach. That’s what matters, not the sophistication of one piece of the puzzle.

If the carrier sold you “384 kbps Internet access anywhere in the coverage area, outdoors,” that would be something you could hold them accountable for. The carrier might even have to put a brake on signing up new customers until it could build new towers or license more spectrum for everybody to share, if it made that guarantee.

Some have proposed even more freely enterprising business models — like having your phone get minute-to-minute bids from the local towers on who will carry your traffic for what price, and accept the lowest bidder who offers acceptable performance.

Selling you “3G” — well, that’s a lot easier to live up to. And it changes every year. So don’t tell me how many G’s your new phone has. We’ve loved and lost so many G’s at this point. Tell me you got a new phone where you pay to get 1 Mbps and 100 ms rtt to major exchange points. When the market moves forward enough to make that a reality, that’ll be a generation worth celebrating.

~keithw

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569