Knowledge base/Best practices

Dashboard templates

SRE on-call

  • 4 widgets at the top: error rate / p95 latency / RPS / saturation (the USE method)
  • Variables: env=prod (static), service=label_values
  • Auto-refresh: 30s

Capacity planning

  • 7-day CPU/RAM graphs across the fleet
  • Predict_linear for memory: predict_linear(node_memory_used_percent[7d], 86400 * 30) — where we will be in 30 days
  • Authentication metrics: new users / active / churn

Business KPI

  • Revenue (a custom metric)
  • Signups / day
  • Funnel: visits → signup → trial → paid

Embed for customers

  • 1-2 widgets: success rate + latency
  • Public share with a 30-day TTL