layout: true

class: title background-image: url(action-balance-fun-305250.jpg) background-size: cover

## Approaching the Unacceptable Workload Boundary

### Baron Schwartz • SREcon18 Americas

class: img-right, bigger

# Logistics & Stuff

.col[ Slides are at xaprb.com/talks/.

Ask questions anytime.

Please get in touch: @xaprb or baron@vividcortex.com. ]

.rc[ ]

class: bigger

# Introduction

What happens as systems get bigger and more heavily loaded?

– * What is a system’s operating domain?

– * How is load defined?

– * Where is the load limit? How can you see it coming?

– * How does the system behave near this limit?

– * Can you measure and model this behavior?

background-image: url(nature-3258924-1280.jpg) class: title

.smokescreen[

# The Operating Domain

]

class: center, img-300h

# Operating Domain and Failure Boundaries

Rasmussen’s model describes an **operating domain** bounded by economic risk, effort, and
safety. The system’s **operating state** is a point within the domain, always moving
around.

background-image: url(rasmussens-model.jpg)

class: img-450h, center

# The Actual Boundaries Are Unknown

class: img-450h, center

# We Draw Limits Where We Think It’s Safe

class: img-450h, center, two-column

# The Buffer Zone Is Nonlinear

.col[ ]

–

.col[ We think the gradient looks like this.

It really looks more like this.

]

class: bigger

# Complex Systems Run In Degraded Mode

Richard Cook lists 18 precepts of system failure in How Complex Systems Fail. Precepts 4) and 5) are especially relevant.

–

4) Complex systems contain changing mixtures of failures latent within them.The complexity of these systems makes it impossible for them to run without multiple flaws being present.

5) Complex systems run in degraded mode.A corollary to the preceding point is that complex systems run as broken systems.

???

Systems can and do function beyond their load limits.

class: title background-image: url(gears-1236578-1280.jpg)

.smokescreen[

# System Load

]

class: bigger

# What Is The Definition Of Load?

There’s no one right answer to this question, but there’s a **useful answer**
for this discussion.

–

Load is the **sum of task residence times** during an observation interval
\(T\). This is equivalent to average **concurrency** of tasks queued or in
service:

\[ N = \frac{\sum_{}^{}{R}}{T} \]

.footnote[ You can prove this with Little’s Law. ]

class: bigger

# Load, Utilization, And Queueing

Load (concurrency) is related to **utilization and queue length**, but it’s not
the same.

– * Concurrency is the number of requests in process simultaneously.

– * Average concurrency is an average over an observation interval \(T\).

– * Utilization is the fraction of \(T\) that was busy.

– * Queue length is the instantaneous or time-averaged number of tasks waiting to be serviced.

class: bigger

# Utilization, Queue Length, & Concurrency

By Little’s Law, utilization and queue length are **types of concurrency**.

- Utilization is the concurrency of in-service tasks.

– * Queue length is the concurrency of queued tasks.

class: two-column, bigger

# What Is The Load Limit?

If the load limit were defined in terms of utilization, queueing theory could
tell us where the **load limit** will be.

– But it can’t: load can be infinite, utilization ranges 0-1.

.col[ Plus it’s impractical: * The “hockey stick” queueing curve is hard to use * The “knee” is unintuitive ]

.col[ ]

??? This is appealing because utilization has a clear limit: it can’t be more than 100%.

So we need to translate the problem to a different domain, where the units match. Scalability is the answer.

class: title background-image: url(snow-3260088-1280.jpg)

.smokescreen[

# Scalability

]

class: bigger

# What’s the Definition of Scalability?

There’s a mathematical definition of scalability **as a function of
concurrency**.

–

I’ll illustrate it in terms of a **parallel processing system** that uses
concurrency to achieve speedup.

??? It’s practical, easy to use, and matches the domain well.

I’ll show how the equation is composed piece by piece, but don’t sweat the math.

class: bigger, img-center

# Linear Scaling

Suppose a clustered system can complete **X tasks per second** with no
parallelism.

–

With parallelism, it divides tasks and executes subtasks
concurrently, **completing tasks faster**.

–

Faster completion also means **increased throughput.**

??? * Tasks per second is throughput. * Throughput is a function of concurrency.

class: bigger, img-center

# Linear Scaling

Ideally, **throughput increases linearly with concurrency**.

??? * Linear scaling is the ideal. * Another way to say this is that the system’s output is a linear function of load.

class: two-column, bigger

# The Linear Scalability Equation

.col[ The equation that describes ideal scaling is

\[ X(N) = \frac{\lambda N}{1} \]

where the slope is \(\lambda=X(1)\). ]

.col[ ]

??? - X is throughput - N is concurrency, which is the workload - Lambda is the system’s output when there’s no parallelism - Really important to note that N is the independent parameter, the driver

class: center, bigger

# But Our Cluster Isn’t Perfect

Linear scaling comes from subdividing tasks **perfectly**.

–

What if a portion isn’t subdividable?

class: two-column,bigger

# Amdahl’s Law Describes Serialization

\[ X(N) = \frac{\lambda N}{1+\sigma(N-1)} \]

.col[
Amdahl’s Law describes throughput when
**a fraction \(\sigma\) can’t be
parallelized**.
]

.col[ ]

class: bigger

# Amdahl’s Law Has An Asymptote

\[ X(N) = \frac{\lambda N}{1+\sigma(N-1)} \]

Parallelism delivers speedup, but there’s a limit:

\[ \lim_{N \to \infty}{X(N)} = \frac{1}{\sigma} \]

–

e.g. a 5% serialized task can’t be sped up more than 20-fold.

??? If 5% of the work is serialized, infinite concurrency will still result in tasks taking 5% as long as non-parallelized tasks.

class: img-center, bigger

# What If Workers Coordinate?

Suppose the parallel workers **also have dependencies** on each other?

–

class: two-column, bigger, img-center, img-300h

# How Bad Is Coordination?

\(N\) workers = \(N(N-1)\) pairs of interactions, which is \(\mathcal{O}(n^2)\) in \(N\).

.col[ ]

.col[ ]

class: two-column, bigger

# The Universal Scalability Law

\[ X(N) = \frac{\lambda N}{1+\sigma(N-1)+\kappa N(N-1)} \]

.col[ The USL adds a term for crosstalk, multiplied by the \(\kappa\) coefficient.

Now there’s a **point of diminishing returns**!
]

.col[ ]

.footnote[ Crosstalk is also called coordination or coherence. ]

class: bigger, img-center

# You Already Know This

You’ve seen lots of benchmarks with diminishing returns.

.footnote[ Source: http://dimitrik.free.fr/blog/ ]

??? By the way, pay attention to the axis scale, it’s log-scaled by powers of two. If you scale the X-axis linearly you’ll get the shape of the curve on the previous slide.

class: img-center, bigger, img-300h

# The USL Describes Behavior Under Load

The USL explains the **highly nonlinear behavior** we know systems exhibit near
their saturation point.
desmos.com/calculator/3cycsgdl0b

??? - Serialization (red) grows slowly, but crosstalk (blue) grows rapidly. - This is why systems get so unpredictable near their limits. - Near and above the point of diminishing returns, systems exhibit high variance and get unpredictable.

class: bigger

# A Summary Of The USL

The Universal Scalability Law defines **throughput as a function of concurrency**.

It explains how and why **systems don’t scale linearly with load**.

class: bigger

# What is the USL Good For?

Armed with the USL, you are ready to:

- Measure and model nonlinear behavior.
- Predict the onset of nonlinearity.
- Design better systems.

It’s easy. Let’s see how!

class: title background-image: url(compass-2946958_1280.jpg)

.smokescreen[

# How To Measure, Model, And Predict

]

class: bigger

# What To Measure

You can’t measure serialization & crosstalk directly.

–

Instead, measure **throughput** and **concurrency**.

–

Then **fit the USL model to the data** to estimate the parameters.

class: center, middle

??? Throughput is so trivially easy to measure in most systems that I won’t talk about it. But there’s two easy ways to measure concurrency.

class: bigger

# How To Measure Concurrency, Pt. 1

Many systems have a metric of concurrency already.
Look for a metric of **things actively working**.

- MySQL:
`SHOW STATUS LIKE 'Threads_running'`

- Apache: active worker count

–

It works well to **poll this e.g. 1x/sec**, then average these into 1- or
3-minute averages.

class: bigger

# How To Measure Concurrency, Pt. 2

If there’s no metric of concurrency, you can **sum up latencies and divide by
the duration**.

\[ N = \frac{\sum_{}^{}{R}}{T} \]

??? - Again, in my experience it’s good to use averages over a moderately long window like 1-5 minutes. - You want to end up with dozens to hundreds of data points.

class: img-center, img-450h

# Plot Your Data

Simply **scatterplot your data** and eyeball it for sanity.

??? Source data in “row16.csv” file. If you’re reading this note and you’re not a VividCortex employee, sorry, I can’t give you access to this data.

class: bigger, img-450h

# Plug The Data Into The Model

Paste the data into the Excel model I built.

??? You can do it in R, or gnuplot, or even with JavaScript in Plotly. Lots of options. This is an easy one.

class: bigger

# Interpreting The Results

What does the output mean?

- Shows whether your system has
**more serialization or crosstalk**.

–
- Shows the **estimated max load** where it’ll stop scaling.

–
- Helps you **predict nonlinearity**.

class: bigger, img-center

# Paypal’s NodeJS vs Java Benchmarks

Paypal’s NodeJS vs Java benchmarks are a good example!

class: bigger, img-300h, img-center

# Bringing It Back To The Operating Domain

The USL is one way to understand **what happens near this boundary**.

class: bigger, two-column

# What Happens Here?

.col[
- When the system approaches **workload limits** it gets twitchy.
- You may be able to **see this approaching** before it gets bad.
- Simply scatterplotting **throughput vs concurrency** is super useful!
]

.col[ ]

class: two-column, bigger

# You Don’t Need To Do Any Modeling!

.col[ Let’s take another look at this data. What jumps out? ]

.col[ ]

class: two-column, bigger

# What If You Had Only The First Part?

.col[
- I would model and project out to the right.
- I’d see “hmm, it’s **leveling off**.”
- I’d say “don’t count on much more than you see now.”
]

.col[ ]

class: two-column, bigger

# Think Differently About Outlying Points

.col[
- Given all the data, I mentally cluster it into two parts.
- If the high-end outliers deviate, **it’s nonlinear already.**
- Those points are evidence that the system is struggling there.
- You don’t need to model anything to see that.
]

.col[ ]

class: center, two-column, bigger

.col[

# Some Resources

I wrote a book.

I created an Excel workbook.

These slides are at xaprb.com/talks. ]