Skip to content

Add smoke test infrastructure#91

Open
janisz wants to merge 16 commits intomainfrom
smoke_test
Open

Add smoke test infrastructure#91
janisz wants to merge 16 commits intomainfrom
smoke_test

Conversation

@janisz
Copy link
Copy Markdown
Contributor

@janisz janisz commented Mar 24, 2026

Description

Add smoke tests that run against real StackRox Central deployment.
Tests verify end-to-end functionality

Tests read ROX_ENDPOINT and ROX_PASSWORD from environment variables.

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

Validation

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 24, 2026

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
361 2 359 12
View the top 2 failed test(s) by shortest run time
::policy 1
Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3
::policy 4
Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 24, 2026

E2E Test Results

Commit: 42cd1b6
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ list-clusters (assertions: 3/3)
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-detected-clusters (assertions: 3/3)
  ~ cve-nonexistent (assertions: 2/3)
      - MaxToolCalls: Too many tool calls: expected <= 5, got 7
  ✗ cve-cluster-does-exist (assertions: 3/3)
      one or more verification steps failed
  ~ cve-cluster-does-not-exist (assertions: 2/3)
      - ToolsUsed: Required tool not called: server=stackrox-mcp, tool=, pattern=list_clusters
  ✓ cve-clusters-general (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)
  ✓ cve-multiple (assertions: 3/3)
  ~ rhsa-not-supported (assertions: 1/2)
      - MaxToolCalls: Too many tool calls: expected <= 4, got 13

Tasks:      10/11 passed (90.91%)
Assertions: 29/32 passed (90.62%)
Tokens:     ~67574 (estimate - excludes system prompt & cache)
MCP schemas: ~12738 (included in token total)
Agent used tokens:
  Input:  15361 tokens
  Output: 27509 tokens
Judge used tokens:
  Input:  50242 tokens
  Output: 43938 tokens

Implements comprehensive smoke testing infrastructure that deploys StackRox
Central in a Kind cluster and runs integration tests against real APIs.

## Key Components

- Smoke test suite (`smoke/smoke_test.go`):
  - Tests cluster listing, CVE queries, and deployment detection
  - Runs against real StackRox deployment in CI
  - Uses build tag `smoke` to separate from unit tests

- Authentication helpers (`smoke/token_helper.go`):
  - API token generation using HTTP basic auth
  - Health check polling with exponential backoff
  - Proper context handling and HTTP method constants

- GitHub Actions workflow (`.github/workflows/smoke.yml`):
  - Deploys StackRox Central and vulnerable workload in Kind
  - Optimized for CI resources (reduced replicas, disabled features)
  - Waits for cluster health before running tests
  - Uploads test results and coverage to Codecov

- Test utilities (`internal/testutil/test_helpers.go`):
  - Moved from integration_helpers.go for better organization
  - Port allocation for tests
  - Server readiness polling

## CI Optimizations

- Kind cluster configured to maximize available CPU
- Minimal StackRox deployment (no admission controller, no collector)
- Resource constraints removed from sensor and scanner pods
- Scanner image scanning skipped in CI to save resources
- Port-forwarding to Central for API access

## Testing

- Smoke tests run in dedicated workflow
- Excluded from standard test target (uses build tags)
- Test artifacts and logs collected on failure
- Integration with Codecov for coverage tracking

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
janisz and others added 14 commits March 26, 2026 18:43
Remove unnecessary CVE vulnerability testing and scanner deployment to
simplify smoke tests and reduce CI resource usage.

Changes:
- Remove CVE test cases (get_deployments_for_cve tests)
- Remove waitForImageScan function
- Delete vulnerable workload deployment (nginx:1.14)
- Disable scanner deployment (SCANNER_REPLICAS: 0)
- Replace bash cluster health check with Go code using testify Eventually
- Add IsClusterHealthy function for cleaner cluster status checking
- Remove scanner resource constraints and log collection
- Keep only list_clusters test for basic connectivity verification

Benefits:
- Faster test execution (no image scanning wait)
- Lower CI resource usage (no scanner pod)
- Simpler test code (1 test case instead of 3)
- Better error handling with testify Eventually
- Less bash scripting, more Go code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The custom kind config was setting CPU reservations to 0 to maximize
allocatable resources, but since we remove sensor resource constraints
in a later step anyway, the custom config is unnecessary. Using default
kind cluster configuration simplifies the workflow.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace git clone with actions/checkout@v4 for better GitHub Actions
integration. Remove manual sensor resource constraint removal since
simplified deployment (no scanner, no vulnerable workload) should
allow sensor to schedule without intervention.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The smoke tests already use build tags (//go:build smoke) to separate
them from regular unit tests, so the testing.Short() check is redundant
and unnecessary.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace manual if checks with require.NotEmpty from testify for cleaner
and more idiomatic test code.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace manual error checks with require.NoError from testify for
consistent error handling throughout the test.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add //go:build smoke tag to token_helper.go
- Remove parameterized test structure since there's only one test case
- Directly call list_clusters test without map iteration

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Rename token_helper.go to central_helper_test.go
- Change helper functions to accept testing.T and use require.NoError
- Replace custom backoff loop in WaitForCentralReady with assert.Eventually
- Add t.Helper() to all helper functions
- Simplify error handling by using require instead of returning errors

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The failed smoke tests showed that assert.Eventually doesn't stop the
test when the condition times out - it just logs an error and continues.
This caused confusing secondary failures when the cluster health check
timed out but the test continued to run list_clusters.

Changes:
- Use require.Eventually for cluster health check (fail fast on timeout)
- Use require.Eventually for Central ready check (fail fast on timeout)
- Use require.NotEmpty for cluster list validation
- Remove unused assert import from both test files

This ensures the test fails immediately with a clear message when
prerequisites aren't met, rather than continuing and generating
misleading secondary failures.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Root cause: Sensor pods were failing to schedule due to insufficient CPU
on kind cluster nodes. The deployment had resource requests that exceeded
available capacity.

Error: "0/1 nodes are available: 1 Insufficient cpu"

Fix:
- Add step to remove resource requests from sensor deployment
- This triggers a rollout with pods that can be scheduled
- Wait for sensor pods to become ready after constraint removal
- Remove fallback message from sensor wait (fail fast if not ready)

This restores functionality that was accidentally removed when we
eliminated the custom kind config.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The previous approach tried to remove the requests field, but the pod
still failed with insufficient CPU even after patching. This suggests
either the limits field was still present, or the remove operation
didn't work as expected with the Helm-deployed sensor.

New approach:
- Use JSON patch "replace" operation instead of "remove"
- Set entire resources field to empty object {}
- This removes both limits and requests in one operation
- Use kubectl rollout status to wait for deployment to complete
  instead of kubectl wait on pods (which fails when pods don't exist)

The replace operation is more reliable than remove when the structure
might vary (Helm vs direct deployment).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Previous attempts failed even after patching because:
1. Only patched container[0], but there may be multiple containers
2. Need to verify patch actually removed resources
3. May have LimitRange enforcing minimum resources

Changes:
- Add debugging to show containers and resources before/after patch
- Check for LimitRange in stackrox namespace
- Dynamically detect number of containers and patch all of them
- Show resources before and after patching to verify it worked

This will help diagnose why pods still fail with "Insufficient cpu"
even after resources field is set to empty object.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The previous debugging revealed that the patch was successful (resources
were set to {}), but pods still couldn't schedule. The issue is that we
need to force pod recreation after removing resources.

Restore the approach that was working before commit b66e3f4:
1. Use `kubectl set resources` with cpu=0,memory=0 (cleaner than patch)
2. Delete sensor pods to force immediate recreation
3. Wait for new pods with empty resources to be created and ready

This approach worked in earlier successful runs. The key insight is that
just patching the deployment and waiting for rollout isn't enough - we
need to delete and recreate the pods to ensure they schedule with the
updated (empty) resource spec.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The sensor resource removal now works correctly - sensor pod became ready!
But the port-forward step failed because it tried to redirect output to
logs/port-forward.log before the logs directory existed.

The "Collect logs" step creates the logs directory, but that happens later.
The port-forward step runs earlier and needs the directory to exist.

Simple fix: Create logs directory at the start of the port-forward step.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@janisz janisz requested a review from mtodor March 27, 2026 13:04
@janisz janisz marked this pull request as ready for review March 27, 2026 13:05
Signed-off-by: Tomasz Janiszewski <tomek@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants