Custom Kubernetes Scaling via Envoy Metrics

Louis Vernon

May 20, 2019 · 3 min read

A selection of data products returned from our Raster service
A selection of data products returned from our Raster service


The Descartes Labs Platform runs on Kubernetes and scales from hundreds to tens of thousands of cores in response to customer traffic.

Much of this load comes from our Tasks service, which allows users to scale out analytics models with high throughput access to petabytes of geospatial data. The bulk of the heavy lifting behind the retrieval, transformation, and delivery of this data is handled by our Raster service.

Scaling Challenges

Raster can be called directly from our Python client or via RESTful API and response times can vary significantly depending on the nature of the request.

Our original approach to scaling Raster used a standard horizontal pod autoscaler (HPA) that tracked CPU utilization per pod. Unfortunately, variation in compute characteristics (requests could be I/O- or CPU-bound) made CPU utilization a poor indicator, and we needed a low threshold to stay ahead of the load.


Raster request latencies at a random moment in time
not ideal cpu
CPU utilization was not a great indicator of load


Variation in both the nature and duration of requests meant that scaling based on request rate was also not ideal.


At Descartes Labs we are using Managed Istio (1.0.6) on GKE


We have been using Istio for a long time and took note of Istio metrics-based autoscaling, but these higher level metrics (i.e., labeled request counts, duration, rate) were not an obvious fit for our service.

Tapping into Envoy Metrics

Fortunately for us, Envoy, the sidecar proxy used by Istio, allows us to directly measure current saturation of our Raster service. From the Envoy docs:

upstream_rq_activeGauge — Total active requests

By summing the upstream_rq_active across all Raster pods we get an effective measure of how many requests are currently being handled by our service.

To allow Kubernetes to scale on this metric we installed the Zalando kube-metric-adapter as packaged by Stefan Prodan. We could technically configure the metric adapter to scrape this metric directly from Envoy using the JSON stats endpoint, but it made more sense to let our existing Prometheus infrastructure handle scraping and aggregation.

Update: After upgrading to Istio 1.1.x we found that the specific Envoy metric we leverage is no longer exposed by default. To configure Envoy to provide the metric you must now add the following to the pod template of the deployment you are scaling:




Prometheus Configuration

In Prometheus this metric is surfaced as:


(this metric is also exposed via envoy_cluster_upstream_cx_active ).

active connect
Number of active connections to Raster as reported by Envoy's upstream_rq_active metric

Unfortunately the Istio bundled Prometheus configuration scrapes but drops this number. To retain it you must modify or remove the following lines from the Prometheus config.


- source_labels: [ cluster_name ]
         regex: '(outbound|inbound|prometheus_stats).*'
         action: drop


Implementing Our Custom Autoscaler

With the number of active requests per pod being tracked by Prometheus we could then implement our custom HPA:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
  name: raster-release
  namespace: raster
    metric-config.object.raster-release-rq-active.prometheus/per-replica: "true"
    metric-config.object.raster-release-rq-active.prometheus/query:| sum(max_over_time(envoy_cluster_upstream_rq_active{app="raster",cluster_name="inbound|8000|http|raster-release.raster.svc.cluster.local", namespace="raster",stage="release"}[1m]))
  maxReplicas: 1500
  minReplicas: 12
    apiVersion: apps/v1
    kind: Deployment
    name: raster-release
    - type: Object
        metricName: raster-release-rq-active
          apiVersion: v1
          kind: Pod
          name: raster-release # needed for schema consistency
        targetValue: 2


The critical part of this HPA config is the annotation block where we:

  • Provide a consistent label for the metric:raster-release-rq-active
  • Tell the metrics adapter to normalize the metric with respect to the number of pods:per-replica: “true"
  • Provide a PromQL query that returns the sum of the max number of active requests per Raster pod over the last minute.

Raster can handle four concurrent requests per pod, so we set the targetValue to two active requests per pod.

In our small-scale testing we found that simply taking the sum of envoy_cluster_upstream_rq_active yielded accurate numbers, but when we tested with production traffic (yay for Istio traffic mirroring!) and large numbers of pods, we needed to use a window of at least one minute to get consistent numbers.

Does It Work?

As shown below, our rq_active HPA roughly halved the requested resources. Even accounting for the delay introduced between Prometheus scraping Envoy and the metrics adapter querying Prometheus, we still get more responsive scaling than using CPU utilization, resulting in a lower 503 rate overall. We saw these trends continue once we applied our custom HPA in production.

cpu scaling
Requested cores for raster-release (using CPU scaling) vs raster-mirror (using rq_activ)
Final Thoughts
  • Istio and Envoy made collecting telemetry and safely testing with production traffic (via mirroring) simple.
  • We’re now rolling this methodology out to multiple services throughout our stack.
  • How often do you get to improve quality of service and reduce costs in the process?