Automating cloud processor benchmarks with SiliconBoutique

Comparing cloud processors sounds like a weekend project until you’re three days in and still fighting reproducibility. Every run needs the same workload, the same observability stack, the same teardown guarantee, and somewhere to put the results that doesn’t evaporate after the next run. SiliconBoutique puts all of that into a single pipeline you can trigger from one command, inspect in a live Grafana dashboard or a portable HTML page, and compare across GCP and AWS without opening a spreadsheet.

What the system delivers

SiliconBoutique deploys Google’s Online Boutique microservices demo as the benchmark workload, collects CPU and memory metrics during a measured load window, writes a canonical BenchmarkSummary row to BigQuery for cloud runs and configured local runs, and gives you two ways to look at results: a private Grafana dashboard scoped to the live run, and a portable HTML dashboard generated from stored history.

The full pipeline is built around four concrete outputs:

A Terraform-managed execution target: a fresh namespace in the local minikube profile, or a fresh GKE/EKS cluster for cloud runs, destroyed after the run.
Structured benchmark artifacts covering CPU utilization, memory working set, CPU throttling, frontend probe latency, pod readiness, and container restarts.
A Grafana dashboard scoped to the exact run and benchmark window, accessible via port-forward during local runs or verified through the Grafana API during cloud runs.
A BenchmarkSummary row in BigQuery for cloud runs, and for local runs when BigQuery settings are configured, with cost fields, load profile metadata, and quality indicators that persist across teardown.

How to run a benchmark

The main entrypoint is run_benchmark_workflow.py. It handles local runs, GCP dispatches, and AWS dispatches from a single interface. Pick the command that fits your situation.

Local smoke run

python3 automation/scripts/run_benchmark_workflow.py \
  --target local \
  --profile smoke \
  --bigquery-env-file credential.env

This provisions a namespace in the siliconboutique minikube profile, deploys Online Boutique and the monitoring stack, runs the load generator, extracts Prometheus metrics, validates the summary, and tears down. If your credential.env has valid BigQuery settings, it also persists the row and verifies it remotely.

GCP remote benchmark

python3 automation/scripts/run_benchmark_workflow.py \
  --target gcp \
  --project-id "$project_id" \
  --bigquery-env-file credential.env

This dispatches .github/workflows/benchmark.yml through the GitHub Actions API, waits for the workflow to finish, downloads the artifact, and verifies acceptance evidence. GCP Terraform never runs locally. The cloud workflow derives a run_id as gha-<github-run-id>-<attempt> and threads it through run-scoped resources, labels, artifacts, and the BigQuery row.

Dashboard from stored results

python3 automation/scripts/launch_metrics_dashboard.py \
  --project-id "$project_id" \
  --dataset-id silicon_boutique \
  --table-id benchmark_summaries \
  --location US \
  --no-browser

This writes artifacts/dashboard/index.html and artifacts/dashboard/dashboard-data.json from BigQuery history. The same command works against a local NDJSON store by swapping --summary-store artifacts/benchmark-summaries.ndjson.

Architecture overview

The system has five layers. Each layer communicates through structured artifacts and never imports the internals of another layer directly.

Open Flowchart

Infrastructure layer

Terraform manages four roots. Three are ephemeral and scoped to a single run_id: the local Kubernetes namespace, the GCP GKE cluster with its VPC, and the AWS EKS cluster with its node group. The fourth, gcp-bigquery, is durable and never part of teardown. It provisions the BigQuery dataset and table once, and every cloud benchmark run appends a row to it; local runs do the same when BigQuery persistence is configured.

Label-capable or tag-capable resources created by an ephemeral root carry the run_id and provider metadata. Teardown fails the workflow job if Terraform destroy reports a non-zero exit code.

Workload layer

The Online Boutique workload is packaged as a Helm chart pinned to upstream onlineboutique version 0.10.5. A custom post-renderer written in Python injects SiliconBoutique run labels, teardown annotations, and load-generator environment variables into every rendered resource. This keeps the upstream chart untouched while guaranteeing every pod carries the right metadata for Prometheus label matching.

The monitoring chart installs a full kube-prometheus-stack, a prometheus-blackbox-exporter, defines silicon_boutique:* recording rules, and provisions the Grafana dashboard through a sidecar ConfigMap.

Automation layer

Each automation script reads from an explicit input file and writes to an explicit output file. The unified coordinator, run_benchmark_workflow.py, calls the lower-level scripts directly for local runs or dispatches the GitHub Actions workflows for cloud runs. The same scripts run in a local terminal or inside GitHub Actions without modification.

The benchmark pipeline sequence

A single benchmark run follows this sequence whether it runs locally, on GCP, or on AWS.

Open Flowchart

The automation waits for the silicon_boutique:* recording rules to produce live data before the clock starts. This reduces startup and warmup noise in the benchmark window, so the measured window is much closer to steady-state behavior.

The teardown phase runs with if: always() in GitHub Actions, so it executes even after extraction or summary failures. Terraform apply and destroy stay in the same job to keep the state file local and available for cleanup.

Setting up on Google Cloud

Before dispatching a live GCP benchmark, three things need to be in place: a GCP project with billing enabled, a durable BigQuery destination, and GitHub OIDC credentials.

Provision durable BigQuery storage

Provision once, before the first benchmark. This root is intentionally separate from the GKE root and must not be destroyed as part of teardown.

cd infra/terraform/gcp-bigquery
terraform init
terraform apply -auto-approve \
  -var="project_id=YOUR_GCP_PROJECT_ID" \
  -var="static_validation_mode=false"

This creates the silicon_boutique dataset and benchmark_summaries table, partitioned by benchmark_start and clustered by machine_type, processor_family, architecture, and run_id. To grant the GitHub Actions writer service account access at the same time, pass it as a variable.

terraform apply -auto-approve \
  -var="project_id=YOUR_GCP_PROJECT_ID" \
  -var="summary_writer_service_accounts=[\"your-sa@your-project.iam.gserviceaccount.com\"]" \
  -var="static_validation_mode=false"

Configure GitHub OIDC and secrets

The benchmark workflow authenticates to GCP using GitHub OIDC, so no long-lived service account keys need to be stored. Set two secrets in your repository:

GCP_WORKLOAD_IDENTITY_PROVIDER — the full workload identity provider resource name.
GCP_SERVICE_ACCOUNT — the service account email.

The service account needs roles/bigquery.jobUser at the project level and roles/bigquery.dataEditor on the durable dataset. The BigQuery Terraform root manages these grants when you pass the service account email to summary_writer_service_accounts.

Dispatch a benchmark

The coordinator dispatches the workflow and waits for it by default:

python3 automation/scripts/run_benchmark_workflow.py \
  --target gcp \
  --project-id "$project_id" \
  --machine-type c3-standard-4 \
  --processor-family c3 \
  --bigquery-env-file credential.env

To dispatch without waiting:

python3 automation/scripts/run_benchmark_workflow.py \
  --target gcp \
  --project-id "$project_id" \
  --bigquery-env-file credential.env \
  --no-wait \
  --dashboard skip

You can also dispatch directly through the gh CLI if you want control over every input:

gh workflow run benchmark.yml \
  -f project_id="YOUR_GCP_PROJECT_ID" \
  -f region=us-central1 \
  -f zone=us-central1-a \
  -f machine_type=c3-standard-4 \
  -f processor_family=c3 \
  -f architecture=x86_64 \
  -f pricing_model=spot \
  -f test_duration=20m \
  -f bigquery_dataset=silicon_boutique \
  -f bigquery_table=benchmark_summaries \
  -f bigquery_location=US \
  -f acceptance_demo=true

Running locally in the devcontainer

The devcontainer is the fastest path to a working local environment. It pins Terraform 1.15.2, kubectl 1.36.0, Helm 4.1.4, and minikube 1.38.1. The post-create script boots the siliconboutique minikube profile automatically when Docker is reachable. Open the repository in VS Code with the Dev Containers extension and the environment is ready.

Local smoke benchmark

python3 automation/scripts/run_benchmark_workflow.py \
  --target local \
  --profile smoke \
  --bigquery-env-file credential.env

For lower-level debugging, you can call the orchestration script directly with a short duration:

python3 automation/scripts/run_local_benchmark.py \
  --test-duration 2m \
  --min-duration-seconds 60

Inspect the Grafana dashboard locally

Add --skip-destroy to leave the namespace alive after extraction, then port-forward Grafana in a second terminal:

python3 automation/scripts/run_local_benchmark.py \
  --test-duration 5m \
  --min-duration-seconds 60 \
  --skip-destroy

# In another terminal:
cd infra/terraform/local-kubernetes
namespace="$(terraform output -raw namespace)"
cd ../../..

kubectl port-forward service/sb-monitoring-grafana \
  3000:80 \
  --namespace "$namespace" \
  --context siliconboutique

Get the admin password:

kubectl get secret sb-monitoring-grafana \
  --namespace "$namespace" \
  --context siliconboutique \
  -o jsonpath='{.data.admin-password}' | base64 -d

Open http://127.0.0.1:3000, log in with admin, and open the SiliconBoutique Online Boutique Benchmark dashboard. The default time range is the last 30 minutes, which covers short local runs.

What the Grafana dashboard shows

The dashboard has seven panels, each scoped to the exact run_id and namespace of the current run.

Open Flowchart

The CPU utilization gauge is worth pointing out specifically. It measures workload CPU usage divided by allocatable CPU cores on nodes running benchmark pods, with node capacity as a fallback when allocatable metrics are unavailable. This normalization against actual node capacity makes it meaningful for cross-machine comparison.

What a benchmark summary looks like

After a successful run, artifacts/benchmark-summary.json contains the canonical result. Here’s a representative example from a local smoke run:

{
  "run_id": "local-smoke-20260514-120000",
  "namespace": "silicon-boutique-local-smoke-20260514-120000",
  "environment": "local",
  "cloud_provider": "local",
  "region": "local",
  "zone": "local",
  "machine_type": "local",
  "processor_family": "local",
  "cpu_platform": null,
  "architecture": "x86_64",
  "node_count": 1,
  "pricing_model": "local",
  "benchmark_start": "2026-05-14T12:00:00Z",
  "benchmark_end": "2026-05-14T12:02:00Z",
  "duration_seconds": 120,
  "generated_at": "2026-05-14T12:02:10Z",
  "avg_cpu_usage_cores": 1.87,
  "max_cpu_usage_cores": 2.49,
  "avg_cpu_utilization_pct": 46.8,
  "max_cpu_utilization_pct": 62.3,
  "avg_memory_working_set_bytes": 1432150016,
  "max_memory_working_set_bytes": 1480000000,
  "max_memory_used_gb": 1.48,
  "avg_cpu_throttling_ratio": 0.01,
  "max_cpu_throttling_ratio": 0.03,
  "min_ready_pods": 12,
  "avg_ready_pods": 12,
  "max_ready_pods": 12,
  "max_restarts_total": 0,
  "frontend_latency_p50_ms": 68.4,
  "frontend_latency_p95_ms": 142.7,
  "frontend_latency_p99_ms": 210.3,
  "frontend_latency_max_ms": 340.1,
  "request_count_total": 1823,
  "request_success_count": 1820,
  "request_failure_count": 3,
  "avg_requests_per_second": 15.19,
  "load_concurrent_users": 10,
  "load_users_per_second": 1,
  "load_profile_source": "manual",
  "node_hourly_price_usd": null,
  "benchmark_compute_cost_usd": null,
  "cost_per_1m_requests_usd": null,
  "metrics_coverage_ratio": 1.0,
  "missing_metrics": [],
  "empty_metrics": [],
  "invalid_metric_samples": {},
  "summary_status": "complete"
}

For priced cloud runs, node_hourly_price_usd, benchmark_compute_cost_usd, and cost_per_1m_requests_usd are populated from the machine pricing table in automation/templates/machine-pricing.json.

summary_status is complete only when all required metrics are present, the coverage ratio is at or above 0.95, and the load-generator stats parsed successfully. A partial summary is still written but excluded from comparability validation.

Viewing results in the portable dashboard

Once you have benchmark results stored in BigQuery or in a local NDJSON file, you can generate a portable dashboard from them. This dashboard is separate from the live Grafana dashboard: it visualizes stored summaries after runs complete, not live Kubernetes metrics during a run.

python3 automation/scripts/launch_metrics_dashboard.py \
  --project-id "$project_id" \
  --dataset-id silicon_boutique \
  --table-id benchmark_summaries \
  --location US \
  --no-browser

The launcher writes artifacts/dashboard/index.html and artifacts/dashboard/dashboard-data.json, then prints a localhost URL. Use --no-serve when you want the files without starting a local server. The dashboard groups runs by machine type, processor family, architecture, and load profile, ranks them on latency, throughput, memory efficiency, and cost, and lists rejected runs explicitly rather than silently dropping them.

The end-to-end acceptance demo

The acceptance demo is the fastest way to prove the full local path works from start to finish:

python3 automation/scripts/run_acceptance_demo.py \
  --mode local \
  --run-id local-demo \
  --test-duration 2m \
  --min-duration-seconds 60

This runs the benchmark, verifies the Grafana dashboard through the API (not just the ConfigMap), checks the summary, validates summary quality and comparability readiness, and writes artifacts/acceptance-demo-report.json. The report’s checks.dashboard.grafana_load_status.status must be passed for the demo to succeed. That check actually queries the Grafana API using the generated admin secret; it doesn’t just confirm the ConfigMap exists.

For a bounded inspection window before cleanup resumes:

python3 automation/scripts/run_acceptance_demo.py \
  --mode local \
  --dashboard-hold-seconds 120

This gives you two minutes to browse the dashboard before the script tears everything down and writes the final report.

Multi-cloud comparison

Once you have GCP and AWS artifact sets downloaded, the acceptance matrix ties them together:

python3 automation/scripts/run_acceptance_matrix.py \
  --mode verify \
  --gcp-artifacts artifacts/gcp-run \
  --aws-artifacts artifacts/aws-run

This writes artifacts/acceptance-matrix-report.json, acceptance-matrix-comparison.json, and a Markdown comparison table ranking the two providers on latency, throughput, and cost. Each artifact directory must contain matching workflow-trace.json, benchmark-summary.json, acceptance-demo-report.json, comparability-report.json, bigquery-load-report.json, and teardown-status.env for the same run_id.

After verification, pull up the full picture in the portable dashboard:

python3 automation/scripts/launch_metrics_dashboard.py \
  --project-id "$project_id" \
  --dataset-id silicon_boutique \
  --table-id benchmark_summaries \
  --location US \
  --no-browser

At that point you have reproducible data, teardown evidence, and a visual comparison sitting in one place. That’s the whole point.

What the system delivers Link Copied

How to run a benchmark Link Copied

Local smoke run Link Copied

GCP remote benchmark Link Copied

Dashboard from stored results Link Copied

Architecture overview Link Copied

Infrastructure layer Link Copied

Workload layer Link Copied

Automation layer Link Copied

The benchmark pipeline sequence Link Copied

Setting up on Google Cloud Link Copied

Provision durable BigQuery storage Link Copied

Configure GitHub OIDC and secrets Link Copied

Dispatch a benchmark Link Copied

Running locally in the devcontainer Link Copied

Local smoke benchmark Link Copied

Inspect the Grafana dashboard locally Link Copied

What the Grafana dashboard shows Link Copied

What a benchmark summary looks like Link Copied

Viewing results in the portable dashboard Link Copied

The end-to-end acceptance demo Link Copied

Multi-cloud comparison Link Copied

What the system delivers

How to run a benchmark

Local smoke run

GCP remote benchmark

Dashboard from stored results

Architecture overview

Infrastructure layer

Workload layer

Automation layer

The benchmark pipeline sequence

Setting up on Google Cloud

Provision durable BigQuery storage

Configure GitHub OIDC and secrets

Dispatch a benchmark

Running locally in the devcontainer

Local smoke benchmark

Inspect the Grafana dashboard locally

What the Grafana dashboard shows

What a benchmark summary looks like

Viewing results in the portable dashboard

The end-to-end acceptance demo

Multi-cloud comparison