Capacity and Stability Patterns

class: center, middle

# Capacity and Stability Patterns

???

Relax

---

# Me

* Brian Pitts
* Systems Engineer at Eventbrite
* @sciurus on twitter
* https://www.polibyte.com

---
background-image: url(ebtech.svg)
background-size: contain

# Systems

???

* unweighted visualization taken from puppet repo

---

# Eventbrite

* Global Marketplace for Live Events
* 600,000 event organizers
* 150 million tickets sold
* $2+ Billion in ticket sales

???

* Stats are 2016, hrough our own growth and acuqisitions like tikectfly 2017s numbers are much bigger
* interesting thing about the ticketing business

---
background-image: url(graphgoingup.png)
background-size: contain

# Obligatory hockey stick graph

???

* some events are bigger than others
* graph of calls to to one service endpoint involved in purchasing a ticket
* in eb vernacular this is what we call an "onsale",
* sudden extreme spike in activity due to a popular event going on sale
* hardly only company that deals with this. you could see similar graphs from companies like parse.ly, shopify, and even those vax lovers over at ticketmaster

---

# Patterns we find helpful
# Examples of how we apply them

???

* big part of the job for operators/developers at eb is designing systems that can cope
* we're far from perfect and this is far from comprehensive, but hope will give you concrete ideas for improving your own systems
* patterns fall into two categories

---

# Stability
## Keep processing in face of impulses, stresses, or component failures

???

* def from michael nygard's book release it
* impulse is rapid shock to system
* stress is force applied to system over time
* onsale i showed is an impulse, stress would be more subtle, like an increase in your tail latency

---

# Capacity
## Throughput a system can sustain with acceptable response time

???

Acceptable is a key word here. I've had developers show me their load testing charts and brag about the amount of requests their system can process, but then you look at the latency distribution and it's taking a minute for the average request to process. We won't have any customers if it takes a minute to purchase a tiket.

---

# Bulkheads
## Partitioning systems to prevent cascading failures

![Bulkheads](bulkheads.jpg)

???

* ship analogy

---
class: center, middle

# Bulkheads
![No bulkhead healthy](no_bulkhead_healthy.png)

???

* walk though scenario of slow reports starving all requests
* initial infrastructure all is well

---
class: center, middle

# Bulkheads
![No bulkhead unhealthy](no_bulkhead_unhealthy.png)

???

* walk though scnerio of slow reports starving all requests
* slow report overloading webserver

---
class: center, middle

# Bulkheads
![Bulkhead](bulkhead.png)

???

* walk though scenario of slow reports starving all requests
* add bulkhead by adding dedicated reports web servers
* explain many different cllusters eb has based on urls and user agents

---
class: center, middle

# Bulkheads
![Bulkhead with db](bulkhead_with_db.png)

???

* Explain applies to stateful service too
* scope of bulkhead can be however large is needed to bound failure

---

# Canary testing
## Gradual rollout of new code

![Canary](canary_spot_art.jpg)

???

* Explain origin of term via canary in coal mine
* We do this two ways

---

# Baking releases

![Bake Board](bakeboard.png)

???

* Deploy to subset of servers, watch performance and errors

---

# Feature flags

```python
from gargoyle import gargoyle

def my_function(request):
    if gargoyle.is_active('cool_feature', request):
        do_cool_new_thing()
    else:
        do_old_boring_thing()
```

???

* we use a fork of gargoyle from disqus
* new features or many other changes are off by default

---

# Feature flags

```
bot BOT [15:49]
new proposal for `cool_feature` by alice@eventbrite.com (admin page)

bot BOT [16:15]
bob@eventbrite.com approved alice@eventbrite.com's proposal for `cool_feature`

Rho BotBOT [16:15]
prod: `cool_feature` set to *active for IP Address Internal IPs OR User Percent: 50% (0-50)*
```

???

* sample rollout: internal ips, selected users, increasing percentages of users, global

---

# Graceful degradation
## Turning functionality off in response to failures or load

???
* Some functionality more critical than others
* In some places, we do this through our feature flagging framework
* example of mongodb failing and us turning off pageview tracking and reporting features
* example of turning off recommendations during onsale
* in other places, its automated
* explain velocity engine and forced caching

---

# Load shedding
## Purposefully not handling some requests in order to reserve resources for others

![Waiting Room](waitingroom.png)

???

* instead of serving limited experience to all users, completely different experience if shed or not
* Waiting room during big onsales to limit orders in progress
* emergency blocks in haproxy at edge

---

# Rate limiting
## Controlling the amount of work you accept

???

* understand your capacity and prevent exceding it
* better to fail fast with limit exceeded than cause cascading failures from overload
* at edge- iptables, nginx
* in application where have more smarts- more focused on preventing abuse, e.g. bots scraping us
* in queues - e.g. in our new services framework which is rougly based on django channels, we limit queue depth for outstanding calls to a service to provide backpressure

---

# Timeouts
## Limiting time you wait for a request to complete

???

* rate limiter was receivers side, this is senders side
* again, better to fail fast than wait and contribute to cascading failure
* some of youare thinking aren't circuit breakers a better client analogue to rate limiting? i'd agree, but we sadly don't have history of implementingthose
* at edge- timeouts on requests to www servers
* internally - timeouts in requests to datastores, e.g. second for redis
* remote - timeouts in calls to remote services, e.g. webhooks

---

# Caching
## Saving and re-serving results to reduce expensive requests

???

* Saving computed values in memcached or redis
* Saving entire HTTP response and serving via varnish

---

# Invalidation strategies
## TTL: Keep it short, stupid
## For service calls, centralized invalidation logic
## Wrapper strategy for dynamic TTL

???

* service invalidation driven off mysql binlogs
* event pages, to get higher hit rate, wrapper strategy. dynamic ttl and standardized cache key.

---

# Wrapper example

## Request from user
https://www.eventbrite.com/e/pytennessee-2018-tickets-35662110332?aff=ehomesaved

## Response from app to varnish

```html
<html>
    
    <esi:include src="/esi/event/35662110332?lang=en&timestamp=1509732000>
</html>
```

???

* Incoming request, handled by dedicated pool of web servers running code that is blazing fast compared to code to actually render an event page
* response that code serves to varnish, varnish will serve that url from cache if it has it or otherwise request it
* stripped unnecessary info and constructed url that serves as cache key for a version of event page
* to generate the timestamp, check key in redis that we're updating each time the data needed by an event page changes
* if event is not changing, varnish can continue serving from cache indefinitely
* if event is changing frequently, we won't update timestamp more frequently than 5 seconds

---

# Capacity Planning
## Getting the resources you need in place, before you need them

???

* this is kinda hard, as you might have seen in Eben and Baron's talk it helps if you have people comfortable with math
* load testing
* collect stats and look at growth and seasonality
* e.g. more api servers for new years eve

---

# Recap

* Bulkheads
* Canary testing
* Graceful degradation
* Rate limiting
* Timeouts
* Load shedding
* Caching
* Planning

---

# Further resources

* Release it! - Michael Nygard
* The Art of Scalability - Abbott and Fisher
* The Practice of Cloud System Administration - Limonocell, Chalup, and Hogan
* Site Reliability Engineering - Beyer, Jones, Petoff, and Murphy
* Production-Ready Microservices - Susan Fowler

---

# Thanks!
# Questions?
# @sciurus / https://www.polibyte.com

???

check time