On Observability

Notes on what is needed to understand a complex system, the open source tooling to consider, and various best practices

Observability

How well can the internals, or inner workings, of a system be understood from external signals.

Telemetry

Telemetry is the data (or external signals) emitted or pulled from a system (e.g. application, scaling group, cluster, etc) that lets you understand a system. Telemetry you might want:

Logs:
- Application Logs: Outputs from application code. Think print(), fmt.Println or anything that looks like log.info(...
- Request Logs: Outputs from an API/server with data pertaining to a single API request. Often application logs are embedded into a request log.
  - Request logs can also refer to requests made by a server to a dependency (e.g. a database call), depending on the point of view. A dependency request log is just a client-isde log of the associated request log of the dependency (i.e. the database will have a matching request log).
Metrics:
- Application Metrics: Business or logic level metrics. Relate to the inputs, decisions or decision cases of a process/request. Generally included in a request log/object.
- Hardware Metrics: Pertaining to the performance of the hardware (e.g. EC2 instance, container, lambda process) running application code.
- Network Metrics: Similar to hardware metrics, but pertaining to low level metrics & performance of network components (e.g. throughput, requests queued, latency or integration latency)
Traces: telemetry and metrics needed to relate telemetery between microservices into a unified context (e.g. a single user request goes through many microservices). Often this is in the form of trace context propagation (i.e. pass around a traceID and add to request logs).
- Span: Logically related components within a trace (e.g. group multiprocessing, paginated API calls).
- Context Propagation / Baggage: Join traces/spans with request-level metadata that may not be available to each span/trace in the call stack.

Instrumentation (and OpenTelemetry)

Instrumentation is how to get, process and store telemetery data.

Collectors: A process on or adjacent to the application, responsible for collecting and sending data to observability backends. Can also be used as a proxy to ETL from a non-OTel system (i.e. vendor X’s system produces telemetry in T_X format. Collector receives T_X, transforms to T_Y in OTel standard and then forward to backends).
- Receivers: Get telemetry from application/host. E.g. Read from log file, periodically make system calls to get host metrics.
- Processors: Anything that needs to be done before exporting: batching, encryption, compression, augmentation.
- Exporters: Send data to observability backend. Two forms:
- Agent: Sits adjacent to application (e,g, on node, sidecar, container on deployment/statefulset)
- Gateway: Standalone service
Instrumentation SDKs often have backend exporters (i.e. direct from application to observability backend). Why use a collector?
- Offload processing out of application resources
- Standardise important ETL (e.g. compression, encryption, retries)
Automatic Instrumentation: packagages/libraries to provide out of the box telemetry for common libraries (logging, HTTP clients + servers). Similar out-of-the-box experience to APM systems.
Observability Backends: Services to receieve, store and view telemetry.

Ideally observability backends support open telemetry protocols (OTLP).

Applications can just implement OTLP collectors
No vendor lock-in and support multiple vendors without affecting application instrumentation.

Opentelemetry (OTel)

“OTel’s goal is to provide a set of standardized vendor-agnostic SDKs, APIs, and tools for ingesting, transforming, and sending data to an Observability back-end (i.e. open source or commercial vendor).”

Instrument code with open telemetry SDK
Integrate with all observability backends. No vendor lock-in
Observability backends can build once for significant usage by pre-instrumented software.

Components

API: Define how to generate/correlate telemetry
SDK: Language specific implementation of API
Data: Describe structure of OTel data and protocol.

Promotheus

Open source metrics backend
Traditionally a pull based metrics backend. i.e. clients configure a prometheus compatible endpoint, and prometheus will periodically send requests.
Can be pushed metrics, useful for short-lived jobs/processes/events, but also for opentelelmetry exporters.