Automating cloud processor benchmarks with SiliconBoutique

Date
Clock 16 min read
Tag
#benchmarking #kubernetes #gcp #terraform #observability
Automating cloud processor benchmarks with SiliconBoutique

Comparing cloud processors sounds like a weekend project until you’re three days in and still fighting reproducibility. Every run needs the same workload, the same observability stack, the same teardown guarantee, and somewhere to put the results that doesn’t evaporate after the next run. SiliconBoutique puts all of that into a single pipeline you can trigger from one command, inspect in a live Grafana dashboard or a portable HTML page, and compare across GCP and AWS without opening a spreadsheet.


What the system delivers

SiliconBoutique deploys Google’s Online Boutique microservices demo as the benchmark workload, collects CPU and memory metrics during a measured load window, writes a canonical BenchmarkSummary row to BigQuery for cloud runs and configured local runs, and gives you two ways to look at results: a private Grafana dashboard scoped to the live run, and a portable HTML dashboard generated from stored history.

The full pipeline is built around four concrete outputs:

  • A Terraform-managed execution target: a fresh namespace in the local minikube profile, or a fresh GKE/EKS cluster for cloud runs, destroyed after the run.
  • Structured benchmark artifacts covering CPU utilization, memory working set, CPU throttling, frontend probe latency, pod readiness, and container restarts.
  • A Grafana dashboard scoped to the exact run and benchmark window, accessible via port-forward during local runs or verified through the Grafana API during cloud runs.
  • A BenchmarkSummary row in BigQuery for cloud runs, and for local runs when BigQuery settings are configured, with cost fields, load profile metadata, and quality indicators that persist across teardown.

How to run a benchmark

The main entrypoint is run_benchmark_workflow.py. It handles local runs, GCP dispatches, and AWS dispatches from a single interface. Pick the command that fits your situation.

Local smoke run

python3 automation/scripts/run_benchmark_workflow.py \ --target local \ --profile smoke \ --bigquery-env-file credential.env

This provisions a namespace in the siliconboutique minikube profile, deploys Online Boutique and the monitoring stack, runs the load generator, extracts Prometheus metrics, validates the summary, and tears down. If your credential.env has valid BigQuery settings, it also persists the row and verifies it remotely.

GCP remote benchmark

python3 automation/scripts/run_benchmark_workflow.py \ --target gcp \ --project-id "$project_id" \ --bigquery-env-file credential.env

This dispatches .github/workflows/benchmark.yml through the GitHub Actions API, waits for the workflow to finish, downloads the artifact, and verifies acceptance evidence. GCP Terraform never runs locally. The cloud workflow derives a run_id as gha-<github-run-id>-<attempt> and threads it through run-scoped resources, labels, artifacts, and the BigQuery row.

Dashboard from stored results

python3 automation/scripts/launch_metrics_dashboard.py \ --project-id "$project_id" \ --dataset-id silicon_boutique \ --table-id benchmark_summaries \ --location US \ --no-browser

This writes artifacts/dashboard/index.html and artifacts/dashboard/dashboard-data.json from BigQuery history. The same command works against a local NDJSON store by swapping --summary-store artifacts/benchmark-summaries.ndjson.


Architecture overview

The system has five layers. Each layer communicates through structured artifacts and never imports the internals of another layer directly.

🔌 MCP Boundary

🔄 Automation Layer

📊 Observability Layer

📦 Workload Layer

⚙️ Infrastructure Layer

namespace

GKE cluster

EKS cluster

provisions

injects labels and settings

deployed pods observed by

latency samples

recording rules

prometheus-metrics.json

benchmark-summary.json

benchmark-summaries.ndjson

accepted summary row

append rows

stored summaries

local runs

GCP dispatch

AWS dispatch

terraform apply

helm install

helm install

port-forward + query

summarize

validate

persist row

terraform apply

terraform apply

helm install

helm install

helm install

helm install

persist row

persist row

dispatches GCP workflow

checks workflow state

read-only queries

🗂️ Terraform
local-kubernetes

☁️ Terraform
gcp-gke

💾 Terraform
gcp-bigquery

☁️ Terraform
aws-eks

BigQuery
benchmark_summaries

Helm Chart
online-boutique

Post-Renderer
metadata injector

Rendered workload
with run metadata

Helm Chart
monitoring stack

Prometheus
+ recording rules

Grafana
dashboard

Blackbox
frontend probe

run_benchmark_workflow.py

run_local_benchmark.py

extract_prometheus_metrics.py

generate_benchmark_summary.py

validate_benchmark_comparability.py

load_benchmark_summary_to_bigquery.py

launch_metrics_dashboard.py

GitHub Actions
benchmark.yml

GitHub Actions
benchmark-aws.yml

trigger_benchmark_run

get_benchmark_status

query_historical_metrics

Infrastructure layer

Terraform manages four roots. Three are ephemeral and scoped to a single run_id: the local Kubernetes namespace, the GCP GKE cluster with its VPC, and the AWS EKS cluster with its node group. The fourth, gcp-bigquery, is durable and never part of teardown. It provisions the BigQuery dataset and table once, and every cloud benchmark run appends a row to it; local runs do the same when BigQuery persistence is configured.

Label-capable or tag-capable resources created by an ephemeral root carry the run_id and provider metadata. Teardown fails the workflow job if Terraform destroy reports a non-zero exit code.

Workload layer

The Online Boutique workload is packaged as a Helm chart pinned to upstream onlineboutique version 0.10.5. A custom post-renderer written in Python injects SiliconBoutique run labels, teardown annotations, and load-generator environment variables into every rendered resource. This keeps the upstream chart untouched while guaranteeing every pod carries the right metadata for Prometheus label matching.

The monitoring chart installs a full kube-prometheus-stack, a prometheus-blackbox-exporter, defines silicon_boutique:* recording rules, and provisions the Grafana dashboard through a sidecar ConfigMap.

Automation layer

Each automation script reads from an explicit input file and writes to an explicit output file. The unified coordinator, run_benchmark_workflow.py, calls the lower-level scripts directly for local runs or dispatches the GitHub Actions workflows for cloud runs. The same scripts run in a local terminal or inside GitHub Actions without modification.


The benchmark pipeline sequence

A single benchmark run follows this sequence whether it runs locally, on GCP, or on AWS.

6️⃣ Teardown

5️⃣ Summarize and validate

4️⃣ Extract

3️⃣ Benchmark window

2️⃣ Deploy

1️⃣ Provision

Start benchmark run

Terraform apply
create namespace or cloud cluster

helm upgrade --install
online-boutique chart

kubectl wait
all deployments Available

helm upgrade --install
monitoring chart

kubectl rollout status
Prometheus, Grafana, exporters

Wait for required
recording rules to populate

Restart loadgenerator
start measured window

Sleep for test_duration
(default 20m)

Update Grafana dashboard
with exact window timestamps

kubectl port-forward
Prometheus :9090

extract_prometheus_metrics.py
query silicon_boutique:* rules

kubectl logs loadgenerator
parse Locust aggregate stats

generate_benchmark_summary.py
write summary JSON + NDJSON

validate_benchmark_comparability.py
--mode summary --run-id

load_benchmark_summary_to_bigquery.py
cloud required / local optional

helm uninstall
monitoring + workload

terraform destroy
remove all run-scoped resources

Verify teardown status
fail job if destroy failed

Artifacts uploaded

The automation waits for the silicon_boutique:* recording rules to produce live data before the clock starts. This reduces startup and warmup noise in the benchmark window, so the measured window is much closer to steady-state behavior.

The teardown phase runs with if: always() in GitHub Actions, so it executes even after extraction or summary failures. Terraform apply and destroy stay in the same job to keep the state file local and available for cleanup.


Setting up on Google Cloud

Before dispatching a live GCP benchmark, three things need to be in place: a GCP project with billing enabled, a durable BigQuery destination, and GitHub OIDC credentials.

Provision durable BigQuery storage

Provision once, before the first benchmark. This root is intentionally separate from the GKE root and must not be destroyed as part of teardown.

cd infra/terraform/gcp-bigquery terraform init terraform apply -auto-approve \ -var="project_id=YOUR_GCP_PROJECT_ID" \ -var="static_validation_mode=false"

This creates the silicon_boutique dataset and benchmark_summaries table, partitioned by benchmark_start and clustered by machine_type, processor_family, architecture, and run_id. To grant the GitHub Actions writer service account access at the same time, pass it as a variable.

terraform apply -auto-approve \ -var="project_id=YOUR_GCP_PROJECT_ID" \ -var="summary_writer_service_accounts=[\"your-sa@your-project.iam.gserviceaccount.com\"]" \ -var="static_validation_mode=false"

Configure GitHub OIDC and secrets

The benchmark workflow authenticates to GCP using GitHub OIDC, so no long-lived service account keys need to be stored. Set two secrets in your repository:

  • GCP_WORKLOAD_IDENTITY_PROVIDER — the full workload identity provider resource name.
  • GCP_SERVICE_ACCOUNT — the service account email.

The service account needs roles/bigquery.jobUser at the project level and roles/bigquery.dataEditor on the durable dataset. The BigQuery Terraform root manages these grants when you pass the service account email to summary_writer_service_accounts.

Dispatch a benchmark

The coordinator dispatches the workflow and waits for it by default:

python3 automation/scripts/run_benchmark_workflow.py \ --target gcp \ --project-id "$project_id" \ --machine-type c3-standard-4 \ --processor-family c3 \ --bigquery-env-file credential.env

To dispatch without waiting:

python3 automation/scripts/run_benchmark_workflow.py \ --target gcp \ --project-id "$project_id" \ --bigquery-env-file credential.env \ --no-wait \ --dashboard skip

You can also dispatch directly through the gh CLI if you want control over every input:

gh workflow run benchmark.yml \ -f project_id="YOUR_GCP_PROJECT_ID" \ -f region=us-central1 \ -f zone=us-central1-a \ -f machine_type=c3-standard-4 \ -f processor_family=c3 \ -f architecture=x86_64 \ -f pricing_model=spot \ -f test_duration=20m \ -f bigquery_dataset=silicon_boutique \ -f bigquery_table=benchmark_summaries \ -f bigquery_location=US \ -f acceptance_demo=true

Running locally in the devcontainer

The devcontainer is the fastest path to a working local environment. It pins Terraform 1.15.2, kubectl 1.36.0, Helm 4.1.4, and minikube 1.38.1. The post-create script boots the siliconboutique minikube profile automatically when Docker is reachable. Open the repository in VS Code with the Dev Containers extension and the environment is ready.

Local smoke benchmark

python3 automation/scripts/run_benchmark_workflow.py \ --target local \ --profile smoke \ --bigquery-env-file credential.env

For lower-level debugging, you can call the orchestration script directly with a short duration:

python3 automation/scripts/run_local_benchmark.py \ --test-duration 2m \ --min-duration-seconds 60

Inspect the Grafana dashboard locally

Add --skip-destroy to leave the namespace alive after extraction, then port-forward Grafana in a second terminal:

python3 automation/scripts/run_local_benchmark.py \ --test-duration 5m \ --min-duration-seconds 60 \ --skip-destroy # In another terminal: cd infra/terraform/local-kubernetes namespace="$(terraform output -raw namespace)" cd ../../.. kubectl port-forward service/sb-monitoring-grafana \ 3000:80 \ --namespace "$namespace" \ --context siliconboutique

Get the admin password:

kubectl get secret sb-monitoring-grafana \ --namespace "$namespace" \ --context siliconboutique \ -o jsonpath='{.data.admin-password}' | base64 -d

Open http://127.0.0.1:3000, log in with admin, and open the SiliconBoutique Online Boutique Benchmark dashboard. The default time range is the last 30 minutes, which covers short local runs.

What the Grafana dashboard shows

The dashboard has seven panels, each scoped to the exact run_id and namespace of the current run.

📊 SiliconBoutique Online Boutique Benchmark

CPU Usage
(cores time series)

CPU Utilization %
(gauge vs node capacity)

Memory Working Set
(bytes time series)

CPU Throttling
(ratio time series)

Frontend Latency
(p50 / p95 / p99)

Pod Readiness
and Restarts

Benchmark Metadata
(run_id, window, machine)

Prometheus
silicon_boutique:* rules

Blackbox probe
HTTP /

Prometheus rules
from kube-state-metrics

Rendered Helm values

The CPU utilization gauge is worth pointing out specifically. It measures workload CPU usage divided by allocatable CPU cores on nodes running benchmark pods, with node capacity as a fallback when allocatable metrics are unavailable. This normalization against actual node capacity makes it meaningful for cross-machine comparison.


What a benchmark summary looks like

After a successful run, artifacts/benchmark-summary.json contains the canonical result. Here’s a representative example from a local smoke run:

{ "run_id": "local-smoke-20260514-120000", "namespace": "silicon-boutique-local-smoke-20260514-120000", "environment": "local", "cloud_provider": "local", "region": "local", "zone": "local", "machine_type": "local", "processor_family": "local", "cpu_platform": null, "architecture": "x86_64", "node_count": 1, "pricing_model": "local", "benchmark_start": "2026-05-14T12:00:00Z", "benchmark_end": "2026-05-14T12:02:00Z", "duration_seconds": 120, "generated_at": "2026-05-14T12:02:10Z", "avg_cpu_usage_cores": 1.87, "max_cpu_usage_cores": 2.49, "avg_cpu_utilization_pct": 46.8, "max_cpu_utilization_pct": 62.3, "avg_memory_working_set_bytes": 1432150016, "max_memory_working_set_bytes": 1480000000, "max_memory_used_gb": 1.48, "avg_cpu_throttling_ratio": 0.01, "max_cpu_throttling_ratio": 0.03, "min_ready_pods": 12, "avg_ready_pods": 12, "max_ready_pods": 12, "max_restarts_total": 0, "frontend_latency_p50_ms": 68.4, "frontend_latency_p95_ms": 142.7, "frontend_latency_p99_ms": 210.3, "frontend_latency_max_ms": 340.1, "request_count_total": 1823, "request_success_count": 1820, "request_failure_count": 3, "avg_requests_per_second": 15.19, "load_concurrent_users": 10, "load_users_per_second": 1, "load_profile_source": "manual", "node_hourly_price_usd": null, "benchmark_compute_cost_usd": null, "cost_per_1m_requests_usd": null, "metrics_coverage_ratio": 1.0, "missing_metrics": [], "empty_metrics": [], "invalid_metric_samples": {}, "summary_status": "complete" }

For priced cloud runs, node_hourly_price_usd, benchmark_compute_cost_usd, and cost_per_1m_requests_usd are populated from the machine pricing table in automation/templates/machine-pricing.json.

summary_status is complete only when all required metrics are present, the coverage ratio is at or above 0.95, and the load-generator stats parsed successfully. A partial summary is still written but excluded from comparability validation.


Viewing results in the portable dashboard

Once you have benchmark results stored in BigQuery or in a local NDJSON file, you can generate a portable dashboard from them. This dashboard is separate from the live Grafana dashboard: it visualizes stored summaries after runs complete, not live Kubernetes metrics during a run.

python3 automation/scripts/launch_metrics_dashboard.py \ --project-id "$project_id" \ --dataset-id silicon_boutique \ --table-id benchmark_summaries \ --location US \ --no-browser

The launcher writes artifacts/dashboard/index.html and artifacts/dashboard/dashboard-data.json, then prints a localhost URL. Use --no-serve when you want the files without starting a local server. The dashboard groups runs by machine type, processor family, architecture, and load profile, ranks them on latency, throughput, memory efficiency, and cost, and lists rejected runs explicitly rather than silently dropping them.


The end-to-end acceptance demo

The acceptance demo is the fastest way to prove the full local path works from start to finish:

python3 automation/scripts/run_acceptance_demo.py \ --mode local \ --run-id local-demo \ --test-duration 2m \ --min-duration-seconds 60

This runs the benchmark, verifies the Grafana dashboard through the API (not just the ConfigMap), checks the summary, validates summary quality and comparability readiness, and writes artifacts/acceptance-demo-report.json. The report’s checks.dashboard.grafana_load_status.status must be passed for the demo to succeed. That check actually queries the Grafana API using the generated admin secret; it doesn’t just confirm the ConfigMap exists.

For a bounded inspection window before cleanup resumes:

python3 automation/scripts/run_acceptance_demo.py \ --mode local \ --dashboard-hold-seconds 120

This gives you two minutes to browse the dashboard before the script tears everything down and writes the final report.


Multi-cloud comparison

Once you have GCP and AWS artifact sets downloaded, the acceptance matrix ties them together:

python3 automation/scripts/run_acceptance_matrix.py \ --mode verify \ --gcp-artifacts artifacts/gcp-run \ --aws-artifacts artifacts/aws-run

This writes artifacts/acceptance-matrix-report.json, acceptance-matrix-comparison.json, and a Markdown comparison table ranking the two providers on latency, throughput, and cost. Each artifact directory must contain matching workflow-trace.json, benchmark-summary.json, acceptance-demo-report.json, comparability-report.json, bigquery-load-report.json, and teardown-status.env for the same run_id.

After verification, pull up the full picture in the portable dashboard:

python3 automation/scripts/launch_metrics_dashboard.py \ --project-id "$project_id" \ --dataset-id silicon_boutique \ --table-id benchmark_summaries \ --location US \ --no-browser

At that point you have reproducible data, teardown evidence, and a visual comparison sitting in one place. That’s the whole point.