class: center, middle # Capacity and Stability Patterns ??? Relax --- # Me * Brian Pitts * Systems Engineer at Eventbrite * @sciurus on twitter * https://www.polibyte.com --- background-image: url(ebtech.svg) background-size: contain # Systems ??? * unweighted visualization taken from puppet repo --- # Eventbrite * Global Marketplace for Live Events * 600,000 event organizers * 150 million tickets sold * $2+ Billion in ticket sales ??? * Stats are 2016, hrough our own growth and acuqisitions like tikectfly 2017s numbers are much bigger * interesting thing about the ticketing business --- background-image: url(graphgoingup.png) background-size: contain # Obligatory hockey stick graph ??? * some events are bigger than others * graph of calls to to one service endpoint involved in purchasing a ticket * in eb vernacular this is what we call an "onsale", * sudden extreme spike in activity due to a popular event going on sale * hardly only company that deals with this. you could see similar graphs from companies like parse.ly, shopify, and even those vax lovers over at ticketmaster --- # Patterns we find helpful # Examples of how we apply them ??? * big part of the job for operators/developers at eb is designing systems that can cope * we're far from perfect and this is far from comprehensive, but hope will give you concrete ideas for improving your own systems * patterns fall into two categories --- # Stability ## Keep processing in face of impulses, stresses, or component failures ??? * def from michael nygard's book release it * impulse is rapid shock to system * stress is force applied to system over time * onsale i showed is an impulse, stress would be more subtle, like an increase in your tail latency --- # Capacity ## Throughput a system can sustain with acceptable response time ??? Acceptable is a key word here. I've had developers show me their load testing charts and brag about the amount of requests their system can process, but then you look at the latency distribution and it's taking a minute for the average request to process. We won't have any customers if it takes a minute to purchase a tiket. --- # Bulkheads ## Partitioning systems to prevent cascading failures  ??? * ship analogy --- class: center, middle # Bulkheads  ??? * walk though scenario of slow reports starving all requests * initial infrastructure all is well --- class: center, middle # Bulkheads  ??? * walk though scnerio of slow reports starving all requests * slow report overloading webserver --- class: center, middle # Bulkheads  ??? * walk though scenario of slow reports starving all requests * add bulkhead by adding dedicated reports web servers * explain many different cllusters eb has based on urls and user agents --- class: center, middle # Bulkheads  ??? * Explain applies to stateful service too * scope of bulkhead can be however large is needed to bound failure --- # Canary testing ## Gradual rollout of new code  ??? * Explain origin of term via canary in coal mine * We do this two ways --- # Baking releases  ??? * Deploy to subset of servers, watch performance and errors --- # Feature flags ```python from gargoyle import gargoyle def my_function(request): if gargoyle.is_active('cool_feature', request): do_cool_new_thing() else: do_old_boring_thing() ``` ??? * we use a fork of gargoyle from disqus * new features or many other changes are off by default --- # Feature flags ``` bot BOT [15:49] new proposal for `cool_feature` by alice@eventbrite.com (admin page) bot BOT [16:15] bob@eventbrite.com approved alice@eventbrite.com's proposal for `cool_feature` Rho BotBOT [16:15] prod: `cool_feature` set to *active for IP Address Internal IPs OR User Percent: 50% (0-50)* ``` ??? * sample rollout: internal ips, selected users, increasing percentages of users, global --- # Graceful degradation ## Turning functionality off in response to failures or load ??? * Some functionality more critical than others * In some places, we do this through our feature flagging framework * example of mongodb failing and us turning off pageview tracking and reporting features * example of turning off recommendations during onsale * in other places, its automated * explain velocity engine and forced caching --- # Load shedding ## Purposefully not handling some requests in order to reserve resources for others  ??? * instead of serving limited experience to all users, completely different experience if shed or not * Waiting room during big onsales to limit orders in progress * emergency blocks in haproxy at edge --- # Rate limiting ## Controlling the amount of work you accept ??? * understand your capacity and prevent exceding it * better to fail fast with limit exceeded than cause cascading failures from overload * at edge- iptables, nginx * in application where have more smarts- more focused on preventing abuse, e.g. bots scraping us * in queues - e.g. in our new services framework which is rougly based on django channels, we limit queue depth for outstanding calls to a service to provide backpressure --- # Timeouts ## Limiting time you wait for a request to complete ??? * rate limiter was receivers side, this is senders side * again, better to fail fast than wait and contribute to cascading failure * some of youare thinking aren't circuit breakers a better client analogue to rate limiting? i'd agree, but we sadly don't have history of implementingthose * at edge- timeouts on requests to www servers * internally - timeouts in requests to datastores, e.g. second for redis * remote - timeouts in calls to remote services, e.g. webhooks --- # Caching ## Saving and re-serving results to reduce expensive requests ??? * Saving computed values in memcached or redis * Saving entire HTTP response and serving via varnish --- # Invalidation strategies ## TTL: Keep it short, stupid ## For service calls, centralized invalidation logic ## Wrapper strategy for dynamic TTL ??? * service invalidation driven off mysql binlogs * event pages, to get higher hit rate, wrapper strategy. dynamic ttl and standardized cache key. --- # Wrapper example ## Request from user https://www.eventbrite.com/e/pytennessee-2018-tickets-35662110332?aff=ehomesaved ## Response from app to varnish ```html