Instrumenting Python For Production

class: center, middle

# Instrumenting Python For Production

???

Hi everyone, thanks for making it back here bright and early after a long, hard night of board games. 
You're at  _Instrumenting Python For Production_, a novice to intermediate level talk on just that topic.
It was designed for a 30 minute time slot, so we should have penty of time for questions both during and after.
If at any point you decide this isn't the talk for you, feel free to leave. I won't blockade the door and make you pass a pop quiz on instrumentation before you leave.
No promises for those of you who stick around for the whole talk, though.

---

# Me

* Brian Pitts
* Operations Engineer at Mozilla
* Previously Eventbrite, Lonely Planet, and more
* Former Nashvillian, Current ATLien
* https://www.polibyte.com

---

# Agenda

1. What is instrumentation, and why should I care?
1. What tools should I use?
1. Where do I get started in my codebase?
1. How should I name things?
1. Now what do I do with this?

???

My goal tonight is to walk you through answers to those questions such that someone who has spent little or no time on instrumentation before could show up at work and start making improvements tomorrow. If you're further along on your instrumentation journey, hopefully you'll pick up a few tips that are helpful.

---

# What is instrumentation?

Code whose purpose it not to implement the behavior of your application, but to understand it.

1. What operations am I performing?
1. How long are they taking?
1. What errors are occurring?

???

Ever write code to answer questions like...

Congrats, that was instrumentation!

---

# Why do I care?

Instrumentation provides data that lets you

* know what your application looks like when it's working
* prove or disprove hypothesis about what's gone wrong when it's not

???

Imagine your boss walks up to your desk and says "the app is slow this morning, fix it ASAP". What do you do next? Would you rather guess, or would you rather pull up a dashboard of your key metrics over the last two days and dive in?

---

# Instrumentation Technology

* unstructured logs
* structured log / events
* profiling
* tracing / apm
* time series / metrics

???

Those earlier questions didn't dictate any particular approach to collecting, storing, or analyzing the data. There are a lot of them. Start googling and you'll drown in a sea of vendor blog posts talking about why their chosen approach is best. I'm not going to argue that any single approach is the one true way, but I am going to focus on time series metrics.

---

# So what's a time series?

Measurements over time

The output of your instrumentation is

`identifier(s) + numeric data + timestamp`

E.G.

http_requests_total{method="post",code="200"} 1027 1395066363000

---

# Why is this a good starting point?

* simple to add to your code
* numerous open source tools for storage and analysis
* wealth of SaaS vendors too
* easy to correlate with non-application metrics

---

# Prometheus

Open source monitoring project encompassing

* metric format
* client libraries 
* integrations to pull metrics from misc software and services
* database for storing and querying time series data

Generic prometheus_client library, as well as framework integrations like django-prometheus

Prometheus server is easy to run, but if you prefer something else many other tools can ingest Prometheus-formatted metrics

---

# Metric Types

* Counters
* Gauges
* Histograms

???

Prometheus supports three types of metrics we'll talk about today and a fourth, summaries, that we'll skip.

---

# Counters

* Monotonically increasing value
* Use if you want to know how many times something happened but don't care about how long it took
* Examples: number of clients that have ever connected, number of cache hits

```
from prometheus_client import Counter
c = Counter('my_failures', 'Description of counter')
c.inc()     # Increment by 1
c.inc(1.6)  # Increment by given value
```

---

# Gauges

* Value that can go up or down
* Examples: number of clients currently connected, memory usage

```
from prometheus_client import Gauge
g = Gauge('my_inprogress_requests', 'Description of gauge')
g.inc()      # Increment by 1
g.dec(10)    # Decrement by given value
g.set(4.2)   # Set to a given value
```

---

# Histogram

* Counts observations into configurable buckets
* Use when you care about duration or size in addition to count
* Examples: Request time, response size

```
from prometheus_client import Histogram
h = Histogram('request_latency_seconds', 'Description of histogram')
h.observe(4.7)
```

There's also a decorator and context manager

---

# Histogram continued

* Default buckets are .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, infinity
* Buckets are cumulative
* In addition to count per bucket, also sums all values and counts data points
* With this data you can calculate averages and percentiles

An example with buckets of 1,3,5 and observations of 0.5,1.5,3.

```
http_request_duration_bucket{le="1"} 1
http_request_duration_bucket{le="3"} 2
http_request_duration_bucket{le="5"} 3
http_request_duration_bucket{le="+Inf"} 3
http_request_duration_sum 5
http_request_duration_count 3
```

---

# Where to start with instrumentation

* Integration points!
* Where your app spends its own time
* Key business data

---

# Integration Points

For your application

* Histograms for all requests to your application, broken down by success vs error
* Gauge for in-process requests (if possible)

For each of your dependencies

* Histograms for all requests to them, broken down by success vs error
* Gauge for in-process requests (if applicable)

???

These overall ideas apply both to a online system such as a webapp or an offline system like a task queue. With this information you know the utilization of your application, its correctness, its performance, and can attribute changes in correctness and performance to your application or to one of its dependencies.

---

# Where your app spends its own time

For many services, instrumenting your integration points will take care of this. However, if your service spends significant CPU time on one or more tasks, record histograms for each of them.

???

Ideally you can account for all the non-idle time your application has.

---

# Key Business Data

* accounts created
* orders placed
* messages to customer support
* etc

???

Your instrumentation shouldn't be the primary source of business analytics, but these are the ultimate check on if your code is working correctly.

---

# Names vs Labels

Identifier for a Prometheus metric has a name and zero or more key:value pairs called labels.

Name identifies one thing being measured, and labels let you drill down into different dimensions of it.

Bad: http_requests_200_total and http_requests_500_total
Good: http_requests_total{code="200"} and http_requests_total{code="500"}

Be careful not to add tags with too many values (e.g. user id). Prometheus and other time series stores have trouble handling this. Generally stay less than 100.

---

# What's in a name?

Namespaces are one hoking great idea!
Units are too!

Template: myapp_descriptor_unit

EG: pytennessee_presentation_duration_seconds

---

# How do I get my data out?

Prometheus client supports multiple methods

* Exposing over HTTP
* Writing to a text file
* Pushing to a Prometheus Pushgateway

???

HTTP is the default, and the library makes this easy for standard WSGI applications.
If you don't listen to HTTP already, writing to a textfile might be simpler.
In addition to prometheus itself, other software like telegraf or datadog can pull prometheus metrics over http.

---

# So what can I do with all this data?

* Build Dashboards!
* Alerting
* Analyze trends
* Compare across configurations, time, etc
* Ad-hoc debugging

???

One of the first things you'll want to do is build dashboards displaying this information. If you're using Prometheus, you'd probably use Grafana for this.

Of course, you can't spend all day and night staring at dashboards, so you can use systems like Prometheus's AlertManager to notiy you when measurements of things like latency or errors are worse than you expect.

---

# References

* https://github.com/prometheus/client_python
* https://www.honeycomb.io/blog/instrumentation-the-first-four-things-you-measure/
* https://www.honeycomb.io/blog/instrumentation-measuring-capacity-through-utilization/
* https://prometheus.io/docs/concepts/metric_types/
* https://prometheus.io/docs/practices/instrumentation/
* https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/

Slides will be on https://www.polibyte.com after PyTennessee