Debug Aro Hcp E2e.mdc
How to debug ARO HCP e2e tests using CI artifacts and common workflows
Loading actions...
Skill content
Main instructions and any bundled files for this skill.
Debugging ARO HCP e2e Tests
Use this rule when a PR or CI job for ARO HCP (Azure) e2e tests fails. It points to where to look in artifacts, and prescribes fast triage workflows. See also: docs/content/reference/test-information-debugging/Azure/test-artifacts-directory-structure.md
Quick links to artifacts
- Hosted control plane components
- Control plane pod deployments:
artifacts/e2e-aks/hypershift-azure-run-e2e/artifacts/Test*/namespaces/e2e-clusters-*/apps/deployments/ - Control plane pod manifests:
artifacts/e2e-aks/hypershift-azure-run-e2e/artifacts/Test*/namespaces/e2e-clusters-*/core/pods/ - Control plane pod logs:
artifacts/e2e-aks/hypershift-azure-run-e2e/artifacts/Test*/namespaces/e2e-clusters-*/core/pods/logs/
- Control plane pod deployments:
- HyperShift management cluster (namespace
hypershift/)- Operator deployment:
.../namespaces/hypershift/apps/deployments/operator.yaml - External DNS deployment:
.../namespaces/hypershift/apps/deployments/external-dns.yaml - Operator logs:
.../namespaces/hypershift/core/pods/logs/operator-*-operator.log - External DNS logs:
.../namespaces/hypershift/core/pods/logs/external-dns-*-external-dns.log
- Operator deployment:
- Primary test directory:
artifacts/e2e-aks/hypershift-azure-run-e2e/ - Top-level CI files
- Build log:
artifacts/build-log.txt - CI operator log:
artifacts/ci-operator-*/ci-operator.log - JUnit:
artifacts/e2e-aks/hypershift-azure-run-e2e/artifacts/junit.xml - Job result:
finished.json
- Build log:
Start here: critical HyperShift resources
Check the status of these first; their .status often names the failing subsystem:
- HostedCluster:
.../namespaces/e2e-clusters-*/hypershift.openshift.io/hostedclusters/*.yaml - HostedControlPlane:
.../namespaces/e2e-clusters-*-{test-name}-*/hypershift.openshift.io/hostedcontrolplanes/*.yaml - NodePool:
.../namespaces/e2e-clusters-*/hypershift.openshift.io/nodepools/*.yaml
Expect to see:
- Overall readiness conditions
- Infra provisioning state
- Control plane component health
- NodePool scaling/readiness
- Failure reasons/messages
Per-test essentials
Each scenario is under .../hypershift-azure-run-e2e/artifacts/Test*/:
create.log— hosted cluster creation; start here for provisioning issuesdestroy.log— teardowndump.log— comprehensive dump of cluster stateinfrastructure.log— Azure provisioning detailshostedcluster.tar— full hosted cluster confignamespaces/— all K8s and HyperShift resources, including control plane pods and logs
Fast triage workflows
When control plane is not healthy
- Open
finished.jsonfor the failure type. - Inspect HostedCluster/HostedControlPlane status for failing conditions.
- Read
Test*/create.logfor creation errors. - Examine control plane pods:
.../e2e-clusters-*-{test-name}-*/core/pods/. - Pull component logs:
core/pods/logs/{component}-*-{container}.log.
When nodes do not join or scale
- Check NodePool status for replicas/conditions.
- Review CAPI controllers:
- Cluster API:
cluster-api-*.{yaml,log} - Azure provider:
capi-provider-*.{yaml,log}
- Cluster API:
- Verify bootstrapping:
ignition-server-*.{yaml,log}. - CSR approvals:
machine-approver-*.{yaml,log}. - Control plane coordination:
control-plane-operator-*.{yaml,log}.
When management operator reports errors
- Operator reconciliation:
hypershift/core/pods/logs/operator-*-operator.log. - Operator init:
operator-*-init-environment.log. - DNS issues (Azure DNS):
external-dns-*-external-dns.log. - Cross-check hosted control plane namespace for component-level failures.
Component hotspots
- etcd:
etcd-0.yaml,etcd-0-*.log- Look for quorum, storage, connectivity
- kube-apiserver:
kube-apiserver-*.{yaml,log}and audit logs- TLS, etcd connectivity, RBAC/authN/Z
- kube-controller-manager / scheduler:
kube-controller-manager-*,kube-scheduler-*- Resource reconciliation, scheduling constraints
- OpenShift API server and OAuth server
- OpenShift API availability and authentication failures
Infrastructure and CI
- AKS provision logs:
artifacts/e2e-aks/aks-provision/build-log.txt - Azure resource actions:
Test*/infrastructure.log - Network: look for
cloud-network-config-controllerin hosted control plane namespace - CI operator:
artifacts/ci-operator-*/ci-operator.logfor high-level pipeline errors
Common failure patterns
- Azure API/quotas: errors in
capi-provider-*orinfrastructure.log - DNS propagation/permissions:
external-dns-*-external-dns.log - Certificates/CSR:
machine-approver-*and kube-apiserver TLS errors - etcd health:
etcd-0-healthz.logand main etcd logs
Node joining quick checklist
- NodePool health
- Resource:
.../namespaces/e2e-clusters-*/hypershift.openshift.io/nodepools/*.yaml - Compare
status.replicasvsstatus.readyReplicas; readstatus.conditions[*].messagefor reasons
- Resource:
- CAPI controllers (infrastructure provisioning)
- Logs:
.../core/pods/logs/cluster-api-*-*.log,.../core/pods/logs/capi-provider-*-*.log - Look for VM create/delete errors, quota limits, subnet/NSG failures, identity/permissions
- Logs:
- Bootstrap and ignition fetch
- Logs:
.../core/pods/logs/ignition-server-*-*.log - Indicators: GET /config 404/401, timeouts, TLS handshake errors, unreachable ignition endpoint
- Logs:
- CSR approval path
- Logs:
.../core/pods/logs/machine-approver-*-*.log - Indicators: CSRs Pending/Denied, signer mismatches, cert issuance errors; approvals not processed
- Logs:
- API reachability from nodes
- Logs:
.../core/pods/logs/kube-apiserver-*-kube-apiserver.log - Indicators: connection refused/timeouts from node IPs, SNI/certificate errors, authN/Z failures
- Logs:
- Networking readiness
- Logs:
.../core/pods/logs/cloud-network-config-controller-*-*.log - Indicators: pod CIDR allocation issues, route programming errors, Azure NIC/subnet problems
- Logs:
- If nodes exist but are NotReady
- Check kubelet/CRIO hints in events within
Test*/dump.log; verify image pulls, CNI init, time sync
- Check kubelet/CRIO hints in events within
Test scenarios reference
Examples under Test*/:
- Autoscaling, CreateCluster, CustomConfig, HA etcd chaos, NodePool lifecycle, Control plane upgrade
Related Skills
Frontend Typescript Linting.mdc
TypeScript and ESLint rules that MUST be followed when creating, modifying, or reviewing any file under apps/frontend/, including .ts, .tsx, .js, and .jsx files. Also apply when discussing frontend li...
2. Apply Deepthink Protocol (reason about dependencies
risks