Skip to content

test: add checkpoint/restart fault tolerance tests#7

Merged
rophy merged 5 commits intomasterfrom
test/checkpoint-restart-tests
Mar 24, 2026
Merged

test: add checkpoint/restart fault tolerance tests#7
rophy merged 5 commits intomasterfrom
test/checkpoint-restart-tests

Conversation

@rophy
Copy link
Copy Markdown
Owner

@rophy rophy commented Mar 23, 2026

Summary

  • Checkpoint/restart test scripts for both RAC and single-instance
  • RAC debezium config updates — removed flags: 4 (upstream ADAPTIVE_SCHEMA bug), added checkpoint intervals
  • RAC VM IP validationup.sh auto-detects VM IP via virsh domifaddr and fails loudly if config files have stale IPs
  • NLS_LANG fix in RAC driver for correct NCHAR handling through SSH

Test plan

  • 2-hour RAC soak test: 40 iterations, 120 kill/restart cycles, zero failures
  • Single-instance checkpoint/restart: 10 cycles pass
  • SQL e2e: free-23 (36 pass), xe-21 (36 pass), RAC (54 pass)

Summary by CodeRabbit

  • Tests

    • Added end-to-end checkpoint/restart integration tests (single-node and RAC) that run kill/restart cycles, verify checkpoint monotonicity, perform sentinel sync, compare produced outputs for accuracy, scan logs for errors, and emit consolidated PASS/FAIL results.
  • Chores

    • Improved test orchestration: automatic VM/RAC IP detection and SSH sourcing, updated container restart behavior, added checkpoint persistence/tuning and a named checkpoint volume, and aligned test environment connection settings.

rophy added 3 commits March 23, 2026 23:58
RAC version: kill/restart OLR with DML on both nodes, compare
LogMiner vs OLR via Debezium twin-test. Verifies checkpoint
resume, data accuracy, no duplicates, no ASAN errors.

Single-instance version: same logic against local Docker Oracle.

Also adds restart: unless-stopped for Debezium OLR adapter
(crashes when OLR is killed, needs auto-restart) and
olr-checkpoint volume for single-instance test.
- Update VM IP from 192.168.122.248 to 192.168.122.130
- Add checkpoint interval (10s, 1MB) for fault tolerance testing
- Remove flags: 4 (ADAPTIVE_SCHEMA) — causes upstream bug
  ERROR 50022 duplicate SYS.CCOL$ on createSchema()
VM was rebuilt at 192.168.122.130 (was 192.168.122.248).
Update up.sh and .env to match.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 23, 2026

📝 Walkthrough

Walkthrough

Adds new single-instance and RAC end-to-end checkpoint/restart integration tests, VM environment detection helper, and Docker/OLR configuration changes to exercise OLR kill/restart cycles, checkpoint SCN verification, sentinel synchronization, and output comparison.

Changes

Cohort / File(s) Summary
Single-instance test
tests/debezium/checkpoint-restart-test.sh
New executable Bash integration test that runs OLR kill/restart cycles against local Docker containers, forces log switches, verifies checkpoint SCNs, polls receiver health/status, compares LogMiner vs OLR outputs, and scans OLR logs for ASAN/duplicate/schema errors.
RAC test & helpers
tests/sql/environments/rac/debezium/checkpoint-restart-test.sh, tests/sql/environments/rac/vm-env.sh
New RAC-focused executable test script and vm-env.sh to detect/export VM SSH settings, validate config IPs, orchestrate remote sqlplus on RAC nodes, manage OLR lifecycle via podman, run kill/restart cycles, verify checkpoints, and compare outputs.
Docker Compose / volumes
tests/debezium/docker-compose.yaml, tests/sql/environments/rac/debezium/docker-compose.yaml
Added restart: unless-stopped to dbz-olr service(s) and introduced a named volume olr-checkpoint for checkpoint persistence.
OLR config
tests/sql/environments/rac/debezium/config/olr-config.json
Added checkpoint interval parameters (interval-s: 10, interval-mb: 1) to OLR state config alongside existing disk path.
Host / connection updates
tests/sql/environments/rac/debezium/config/application-logminer.properties, tests/sql/environments/rac/debezium/config/application-olr.properties, tests/sql/environments/rac/.env
Updated Debezium / DB host references from 192.168.122.248192.168.122.130 and adjusted RAC .env DB_CONN accordingly.
RAC scripts & runner
tests/sql/environments/rac/up.sh, tests/sql/scripts/drivers/rac.sh
up.sh now sources vm-env.sh and uses exported VM_HOST/VM_USER/_SSH_OPTS; rac.sh updates default VM host and sets NLS_LANG=AMERICAN_AMERICA.AL32UTF8 for remote sqlplus invocations.

Sequence Diagram(s)

sequenceDiagram
  participant Tester
  participant VM as RAC VM / SSH
  participant Oracle as Oracle DB (RAC)
  participant Receiver as Debezium Receiver
  participant LogMiner as Debezium LogMiner
  participant OLR as OpenLogReplicator
  participant Docker as Container Engine

  Tester->>VM: source vm-env.sh (detect VM_HOST, _SSH_OPTS)
  Tester->>Receiver: GET /health (verify reachable)
  Tester->>Oracle: sqlplus (prepare schema, supplemental logging)
  Tester->>Docker: reset/create `olr-checkpoint` volume
  Tester->>Docker: start LogMiner & OLR services
  Docker->>LogMiner: start streaming
  LogMiner->>Receiver: produce events
  Tester->>OLR: wait for readiness
  loop KILL_COUNT cycles
    Tester->>Oracle: perform DML while OLR running
    Tester->>Oracle: force log switch
    OLR->>Tester: write checkpoint (capture pre-kill SCN)
    Tester->>Docker: kill/remove OLR container
    Tester->>Oracle: DML/DDL while OLR down
    Tester->>Docker: start OLR (restart unless-stopped)
    OLR->>Tester: become ready, restore checkpoint (capture post-restart SCN)
    Tester->>Tester: verify post-SCN >= pre-SCN
  end
  Tester->>Oracle: insert DEBEZIUM_SENTINEL
  loop until timeout
    Receiver->>Tester: /status (check logminer_sentinel, olr_sentinel)
  end
  Tester->>Tester: compare outputs (logminer.jsonl vs olr.jsonl)
  Tester->>Docker: fetch OLR logs (scan ASAN/duplicates/errors)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I hopped from VM to container shore,
Killing and starting OLR once more,
I chased SCNs through log and file,
Sentinels smiled after every trial,
Now checkpoints hum — the tests adore! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'test: add checkpoint/restart fault tolerance tests' accurately and concisely describes the main purpose of the changeset—adding comprehensive checkpoint/restart validation tests for both RAC and single-instance environments.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch test/checkpoint-restart-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/debezium/docker-compose.yaml (1)

35-47: ⚠️ Potential issue | 🟠 Major

Mount olr-checkpoint into olr service to persist checkpoint data across restarts.

The olr service creates /olr-data/checkpoint in its entrypoint, but this directory lives under an ephemeral tmpfs mount. The olr-checkpoint named volume is declared but never mounted into the service. A container restart will discard checkpoint files, preventing the restart/resume test from validating durability.

Suggested fix
    volumes:
      - ./config/olr-config.json:/config/olr-config.json:ro
      - oracle-data:/opt/oracle/oradata:ro
+     - olr-checkpoint:/olr-data/checkpoint
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/debezium/docker-compose.yaml` around lines 35 - 47, The olr service
creates /olr-data/checkpoint in its entrypoint but that path is on a tmpfs and
the declared named volume olr-checkpoint is never mounted; update the olr
service spec (service name "olr", its entrypoint/command and volumes block) to
mount the named volume into /olr-data/checkpoint (e.g. add a volumes entry like
olr-checkpoint:/olr-data/checkpoint) and ensure the tmpfs entry does not
override that path (remove or adjust the tmpfs for /olr-data or exclude
/olr-data/checkpoint) so checkpoint files persist across restarts; also confirm
the named volume olr-checkpoint is declared in the compose file.
🧹 Nitpick comments (6)
tests/sql/environments/rac/debezium/checkpoint-restart-test.sh (3)

70-84: Shellcheck SC2155: Declare and assign separately.

Per static analysis, line 73 combines declaration and assignment.

Suggested fix
 _exec_user() {
     local sql_file="$1"
     local node="${2:-$RAC_NODE1}" sid="${3:-$ORACLE_SID1}" conn="${4:-$DB_CONN1}"
-    local remote="/tmp/$(basename "$sql_file")"
+    local remote
+    remote="/tmp/$(basename "$sql_file")"
     _vm_copy_in "$sql_file" "$remote" "$node"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/sql/environments/rac/debezium/checkpoint-restart-test.sh` around lines
70 - 84, The ShellCheck warning SC2155 in _exec_user is caused by declaring and
assigning a variable in one statement; change the pattern to declare the
variable first (local output) and then assign to it on the next line
(output=$(_vm_sqlplus "$node" "$sid" "$conn" "$remote")), ensuring you reference
the _exec_user function and the _vm_sqlplus call so the declaration and
assignment are separated.

63-68: Shellcheck SC2155: Declare and assign separately.

Per static analysis, line 65 combines declaration and assignment which can mask return values.

Suggested fix
 _exec_sysdba() {
     local sql_file="$1"
-    local remote="/tmp/$(basename "$sql_file")"
+    local remote
+    remote="/tmp/$(basename "$sql_file")"
     _vm_copy_in "$sql_file" "$remote" "$RAC_NODE1"
     _vm_sqlplus "$RAC_NODE1" "$ORACLE_SID1" "/ as sysdba" "$remote"
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/sql/environments/rac/debezium/checkpoint-restart-test.sh` around lines
63 - 68, The function _exec_sysdba uses combined declaration+assignment which
triggers ShellCheck SC2155; to fix, split declarations and assignments in
_exec_sysdba: declare local variables first (e.g., local sql_file; local remote)
then assign sql_file="$1" and remote="/tmp/$(basename "$sql_file")" on separate
lines, leaving the following calls to _vm_copy_in and _vm_sqlplus unchanged.

158-183: Shellcheck SC2155 and redundant condition.

Line 175 has the same SC2155 issue. Additionally, similar to the single-instance version, lines 170-173 are redundant—they're already covered by lines 166-169.

Suggested fix
 _wait_for_checkpoint() {
     echo "  Waiting for OLR checkpoint (up to 30s)..."
     local prev_scn
     prev_scn=$(_read_checkpoint_scn)
     for i in $(seq 1 15); do
         sleep 2
         local cur_scn
         cur_scn=$(_read_checkpoint_scn)
         if [[ "$cur_scn" != "0" && "$cur_scn" != "$prev_scn" ]]; then
             echo "  Checkpoint written: scn=$cur_scn"
             return 0
         fi
-        if [[ "$cur_scn" != "0" && "$prev_scn" == "0" ]]; then
-            echo "  Checkpoint written: scn=$cur_scn"
-            return 0
-        fi
     done
-    local final=$(_read_checkpoint)
+    local final
+    final=$(_read_checkpoint)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/sql/environments/rac/debezium/checkpoint-restart-test.sh` around lines
158 - 183, In _wait_for_checkpoint, fix ShellCheck SC2155 by declaring locals
separately (e.g., use "local prev_scn" then "prev_scn=$(_read_checkpoint_scn)"
and likewise "local cur_scn" then assign, and "local final" then assign from
_read_checkpoint) and remove the redundant second conditional block that repeats
the same check (the if that checks "$cur_scn" != "0" && "$prev_scn" == "0");
instead keep a single conditional that detects a non-zero cur_scn different from
prev_scn (if [[ "$cur_scn" != "0" && "$cur_scn" != "$prev_scn" ]] ...) to report
the checkpoint.
tests/debezium/checkpoint-restart-test.sh (3)

20-21: Unused ORACLE_PORT variable.

ORACLE_PORT is defined but never used. DB_CONN hardcodes port 1521 regardless of the ORACLE_PORT value.

Suggested fix

Either remove the unused variable or use it in DB_CONN:

-ORACLE_PORT="${ORACLE_PORT:-1522}"
-DB_CONN="olr_test/olr_test@//localhost:1521/XEPDB1"
+ORACLE_PORT="${ORACLE_PORT:-1521}"
+DB_CONN="olr_test/olr_test@//localhost:${ORACLE_PORT}/XEPDB1"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/debezium/checkpoint-restart-test.sh` around lines 20 - 21, The script
defines ORACLE_PORT but never uses it and DB_CONN hardcodes port 1521; update
DB_CONN to reference the ORACLE_PORT variable (e.g., use the ORACLE_PORT shell
variable in the connection string) so the port is configurable via ORACLE_PORT,
or remove the ORACLE_PORT definition if you intend to always use 1521; modify
the DB_CONN assignment that currently contains "DB_CONN" to interpolate
ORACLE_PORT instead of the hardcoded 1521 (or delete ORACLE_PORT if unused).

475-481: Broad grep pattern may cause false positives.

The pattern "duplicate\|ERROR 50022" could match unrelated log messages containing "duplicate". The RAC version uses a more specific pattern "ERROR 50022\|duplicate SYS\." which targets the actual bug.

Suggested fix for consistency with RAC version
-if grep -q "duplicate\|ERROR 50022" "$WORK_DIR/olr.log"; then
+if grep -q "ERROR 50022\|duplicate SYS\." "$WORK_DIR/olr.log"; then
     echo "  FAIL: OLR duplicate/schema errors detected (Bug 1)"
-    grep "duplicate\|ERROR 50022" "$WORK_DIR/olr.log" | head -5
+    grep "ERROR 50022\|duplicate SYS\." "$WORK_DIR/olr.log" | head -5
     OLR_ERROR_RESULT=1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/debezium/checkpoint-restart-test.sh` around lines 475 - 481, The grep
pattern in the OLR error check is too broad and may yield false positives;
update the pattern used in the conditional that greps "$WORK_DIR/olr.log" (the
line that currently checks "duplicate\|ERROR 50022") to the more specific
pattern used in RAC (e.g., "ERROR 50022\|duplicate SYS\.") so it only matches
the real bug, and ensure the subsequent grep that prints matches (the line
piping to head -5) uses the same narrowed pattern; no other logic changes
needed—just replace the pattern in both grep invocations that set/check
OLR_ERROR_RESULT.

141-148: Redundant condition in _wait_for_checkpoint.

The second condition (lines 145-148) is already covered by the first (lines 141-143). When prev_scn == "0" and cur_scn != "0", the first check's cur_scn != prev_scn is also true.

Suggested simplification
         if [[ "$cur_scn" != "0" && "$cur_scn" != "$prev_scn" ]]; then
             echo "  Checkpoint written: scn=$cur_scn"
             return 0
         fi
-        if [[ "$cur_scn" != "0" && "$prev_scn" == "0" ]]; then
-            echo "  Checkpoint written: scn=$cur_scn"
-            return 0
-        fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/debezium/checkpoint-restart-test.sh` around lines 141 - 148, In
_wait_for_checkpoint remove the redundant second if block that checks 'if [[
"$cur_scn" != "0" && "$prev_scn" == "0" ]]' because its case is already handled
by the first condition 'if [[ "$cur_scn" != "0" && "$cur_scn" != "$prev_scn" ]]'
— simply delete the second block (echo + return) so only the first check remains
to detect a written checkpoint.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/debezium/checkpoint-restart-test.sh`:
- Around line 495-503: The script currently ignores CHECKPOINT_VERIFIED when
computing the final exit code; update the logic to treat checkpoint verification
as a failure when CHECKPOINT_VERIFIED is 0 by introducing or using a
CHKPT_RESULT and including it in both the pass/fail conditional and the final
exit calculation: set CHKPT_RESULT=0/1 based on CHECKPOINT_VERIFIED (or convert
CHECKPOINT_VERIFIED to a numeric result), change the if condition to require
CHKPT_RESULT (e.g., include CHKPT_RESULT in the && list with COMPARE_RESULT,
ASAN_RESULT, OLR_ERROR_RESULT), and add CHKPT_RESULT to the exit sum so a failed
checkpoint produces a non‑zero exit code.
- Around line 114-119: The _read_checkpoint_scn function can emit multiple
concatenated outputs because the initial `docker volume inspect
"$CHECKPOINT_VOL" || echo "0"` echoes "0" but does not stop subsequent commands;
change the control flow so a failing volume inspect short-circuits the function
(for example, after testing CHECKPOINT_VOL with docker volume inspect,
immediately echo "0" and return) and only run the docker run / python pipeline
when the volume exists; reference the _read_checkpoint_scn function and the
CHECKPOINT_VOL variable to locate and update the conditional flow so the
function emits exactly one value.

In `@tests/sql/environments/rac/up.sh`:
- Around line 46-47: up.sh currently discovers the real RAC IP and sets VM_HOST
but none of the entrypoints (setup.sh, checkpoint-restart-test.sh, run.sh,
soak-test.sh) source up.sh, so they fall back to the hardcoded VM_HOST; fix by
moving the discovery logic into a shared sourced helper (e.g., a new
scripts/env.sh) or by having up.sh write the discovered value to a persistent
env file (e.g., .env) and then update all entrypoints (setup.sh,
checkpoint-restart-test.sh, run.sh, soak-test.sh) to source that helper/.env
before they evaluate their fallback VM_HOST="${VM_HOST:-192.168.122.248}" so the
discovered VM_HOST is available and exported for all scripts.
- Around line 9-15: The current ACTUAL_IP assignment from `virsh domifaddr
oracle-rac-vm` can yield multiple IPv4 lines causing a multi-line ACTUAL_IP and
breaking later SSH/mismatch checks; update the probe in up.sh to filter for
IPv4, select exactly one address (e.g., the first matching IPv4) and extract
only the IP portion so ACTUAL_IP is always a single IPv4 string; target the
`ACTUAL_IP` assignment and the `virsh domifaddr` pipeline in up.sh (referenced
symbol: ACTUAL_IP, command: virsh domifaddr oracle-rac-vm) and add a step like
"pick first match" (head -n1 or equivalent) after the awk/cut extraction to
ensure a single-line result.

---

Outside diff comments:
In `@tests/debezium/docker-compose.yaml`:
- Around line 35-47: The olr service creates /olr-data/checkpoint in its
entrypoint but that path is on a tmpfs and the declared named volume
olr-checkpoint is never mounted; update the olr service spec (service name
"olr", its entrypoint/command and volumes block) to mount the named volume into
/olr-data/checkpoint (e.g. add a volumes entry like
olr-checkpoint:/olr-data/checkpoint) and ensure the tmpfs entry does not
override that path (remove or adjust the tmpfs for /olr-data or exclude
/olr-data/checkpoint) so checkpoint files persist across restarts; also confirm
the named volume olr-checkpoint is declared in the compose file.

---

Nitpick comments:
In `@tests/debezium/checkpoint-restart-test.sh`:
- Around line 20-21: The script defines ORACLE_PORT but never uses it and
DB_CONN hardcodes port 1521; update DB_CONN to reference the ORACLE_PORT
variable (e.g., use the ORACLE_PORT shell variable in the connection string) so
the port is configurable via ORACLE_PORT, or remove the ORACLE_PORT definition
if you intend to always use 1521; modify the DB_CONN assignment that currently
contains "DB_CONN" to interpolate ORACLE_PORT instead of the hardcoded 1521 (or
delete ORACLE_PORT if unused).
- Around line 475-481: The grep pattern in the OLR error check is too broad and
may yield false positives; update the pattern used in the conditional that greps
"$WORK_DIR/olr.log" (the line that currently checks "duplicate\|ERROR 50022") to
the more specific pattern used in RAC (e.g., "ERROR 50022\|duplicate SYS\.") so
it only matches the real bug, and ensure the subsequent grep that prints matches
(the line piping to head -5) uses the same narrowed pattern; no other logic
changes needed—just replace the pattern in both grep invocations that set/check
OLR_ERROR_RESULT.
- Around line 141-148: In _wait_for_checkpoint remove the redundant second if
block that checks 'if [[ "$cur_scn" != "0" && "$prev_scn" == "0" ]]' because its
case is already handled by the first condition 'if [[ "$cur_scn" != "0" &&
"$cur_scn" != "$prev_scn" ]]' — simply delete the second block (echo + return)
so only the first check remains to detect a written checkpoint.

In `@tests/sql/environments/rac/debezium/checkpoint-restart-test.sh`:
- Around line 70-84: The ShellCheck warning SC2155 in _exec_user is caused by
declaring and assigning a variable in one statement; change the pattern to
declare the variable first (local output) and then assign to it on the next line
(output=$(_vm_sqlplus "$node" "$sid" "$conn" "$remote")), ensuring you reference
the _exec_user function and the _vm_sqlplus call so the declaration and
assignment are separated.
- Around line 63-68: The function _exec_sysdba uses combined
declaration+assignment which triggers ShellCheck SC2155; to fix, split
declarations and assignments in _exec_sysdba: declare local variables first
(e.g., local sql_file; local remote) then assign sql_file="$1" and
remote="/tmp/$(basename "$sql_file")" on separate lines, leaving the following
calls to _vm_copy_in and _vm_sqlplus unchanged.
- Around line 158-183: In _wait_for_checkpoint, fix ShellCheck SC2155 by
declaring locals separately (e.g., use "local prev_scn" then
"prev_scn=$(_read_checkpoint_scn)" and likewise "local cur_scn" then assign, and
"local final" then assign from _read_checkpoint) and remove the redundant second
conditional block that repeats the same check (the if that checks "$cur_scn" !=
"0" && "$prev_scn" == "0"); instead keep a single conditional that detects a
non-zero cur_scn different from prev_scn (if [[ "$cur_scn" != "0" && "$cur_scn"
!= "$prev_scn" ]] ...) to report the checkpoint.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b386ce39-065d-4eea-b506-174082f77624

📥 Commits

Reviewing files that changed from the base of the PR and between c81e190 and 0650b4b.

📒 Files selected for processing (10)
  • tests/debezium/checkpoint-restart-test.sh
  • tests/debezium/docker-compose.yaml
  • tests/sql/environments/rac/.env
  • tests/sql/environments/rac/debezium/checkpoint-restart-test.sh
  • tests/sql/environments/rac/debezium/config/application-logminer.properties
  • tests/sql/environments/rac/debezium/config/application-olr.properties
  • tests/sql/environments/rac/debezium/config/olr-config.json
  • tests/sql/environments/rac/debezium/docker-compose.yaml
  • tests/sql/environments/rac/up.sh
  • tests/sql/scripts/drivers/rac.sh

Comment on lines +9 to +15
# Auto-detect VM IP from virsh
ACTUAL_IP=$(virsh domifaddr oracle-rac-vm 2>/dev/null | awk '/ipv4/{print $4}' | cut -d/ -f1)
if [[ -z "$ACTUAL_IP" ]]; then
echo "ERROR: Cannot detect RAC VM IP. Is oracle-rac-vm running?" >&2
echo " Start with: virsh start oracle-rac-vm" >&2
exit 1
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the actual file mentioned
if [[ -f "tests/sql/environments/rac/up.sh" ]]; then
    echo "=== File exists, checking lines around 9-15 ==="
    sed -n '1,20p' tests/sql/environments/rac/up.sh
else
    echo "=== File not found, listing directory structure ==="
    find . -name "up.sh" -type f | head -20
fi

Repository: rophy/OpenLogReplicator

Length of output: 757


🏁 Script executed:

# Search for related files that use virsh commands
rg "virsh" --type sh -A 2 -B 2 | head -50

Repository: rophy/OpenLogReplicator

Length of output: 711


🌐 Web query:

virsh domifaddr output format multiple IPv4 addresses documentation

💡 Result:

The virsh domifaddr command displays network interface addresses for a libvirt domain in a tabular format with columns: Name, MAC address, Protocol, and Address. Each row represents one IP address (IPv4 or IPv6) associated with an interface. Example output without --full: Name MAC address Protocol Address ------------------------------------------------------------------------------- vnet1 52:54:00:b9:58:64 ipv4 192.168.130.156/24 For interfaces with multiple IPv4/IPv6 addresses, without --full, subsequent addresses for the same interface show "-" for Name and MAC address: Name MAC address Protocol Address ------------------------------------------------------------------------------- lo 00:00:00:00:00:00 ipv4 127.0.0.1/8 - - ipv6 ::1/128 enp1s0 52:54:00:6b:3b:e4 ipv4 192.168.122.248/24 With --full, the interface Name and MAC address are repeated for every address on that interface. The Address field shows IP/prefix like 192.168.122.248/24 for IPv4. Protocol is "ipv4" or "ipv6". Data source (--source lease|agent|arp) affects what addresses are shown; agent shows guest-view (multiple IPs common), lease shows libvirt DHCP leases (usually one IPv4 per interface). Official documentation: libvirt.org/manpages/virsh.html#domifaddr confirms tabular format and --full behavior for multiple addresses.

Citations:


Make the virsh probe select exactly one IPv4.

virsh domifaddr can output multiple IPv4 rows for an interface. In that case, ACTUAL_IP becomes a multi-line string and later SSH or mismatch checks will fail or target the wrong host.

Suggested fix
-# Auto-detect VM IP from virsh
-ACTUAL_IP=$(virsh domifaddr oracle-rac-vm 2>/dev/null | awk '/ipv4/{print $4}' | cut -d/ -f1)
+# Auto-detect VM IP from virsh
+mapfile -t ACTUAL_IPS < <(virsh domifaddr oracle-rac-vm 2>/dev/null | awk '/ipv4/{print $4}' | cut -d/ -f1)
+ACTUAL_IP="${ACTUAL_IPS[0]:-}"
+if [[ ${`#ACTUAL_IPS`[@]} -gt 1 ]]; then
+    echo "ERROR: Multiple IPv4 addresses detected for oracle-rac-vm: ${ACTUAL_IPS[*]}" >&2
+    exit 1
+fi
 if [[ -z "$ACTUAL_IP" ]]; then
     echo "ERROR: Cannot detect RAC VM IP. Is oracle-rac-vm running?" >&2
     echo "  Start with: virsh start oracle-rac-vm" >&2
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Auto-detect VM IP from virsh
ACTUAL_IP=$(virsh domifaddr oracle-rac-vm 2>/dev/null | awk '/ipv4/{print $4}' | cut -d/ -f1)
if [[ -z "$ACTUAL_IP" ]]; then
echo "ERROR: Cannot detect RAC VM IP. Is oracle-rac-vm running?" >&2
echo " Start with: virsh start oracle-rac-vm" >&2
exit 1
fi
# Auto-detect VM IP from virsh
mapfile -t ACTUAL_IPS < <(virsh domifaddr oracle-rac-vm 2>/dev/null | awk '/ipv4/{print $4}' | cut -d/ -f1)
ACTUAL_IP="${ACTUAL_IPS[0]:-}"
if [[ ${`#ACTUAL_IPS`[@]} -gt 1 ]]; then
echo "ERROR: Multiple IPv4 addresses detected for oracle-rac-vm: ${ACTUAL_IPS[*]}" >&2
exit 1
fi
if [[ -z "$ACTUAL_IP" ]]; then
echo "ERROR: Cannot detect RAC VM IP. Is oracle-rac-vm running?" >&2
echo " Start with: virsh start oracle-rac-vm" >&2
exit 1
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/sql/environments/rac/up.sh` around lines 9 - 15, The current ACTUAL_IP
assignment from `virsh domifaddr oracle-rac-vm` can yield multiple IPv4 lines
causing a multi-line ACTUAL_IP and breaking later SSH/mismatch checks; update
the probe in up.sh to filter for IPv4, select exactly one address (e.g., the
first matching IPv4) and extract only the IP portion so ACTUAL_IP is always a
single IPv4 string; target the `ACTUAL_IP` assignment and the `virsh domifaddr`
pipeline in up.sh (referenced symbol: ACTUAL_IP, command: virsh domifaddr
oracle-rac-vm) and add a step like "pick first match" (head -n1 or equivalent)
after the awk/cut extraction to ensure a single-line result.

Comment on lines +46 to +47
export VM_HOST="$ACTUAL_IP"
echo "RAC VM IP: $ACTUAL_IP"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n --type sh 'VM_HOST="\$\{VM_HOST:-192\.168\.122\.[0-9]+\}"' tests/sql/environments/rac tests/sql/scripts/drivers

Repository: rophy/OpenLogReplicator

Length of output: 504


🏁 Script executed:

cat -n tests/sql/environments/rac/up.sh | head -60

Repository: rophy/OpenLogReplicator

Length of output: 2286


🏁 Script executed:

# Find how up.sh is invoked in the codebase
rg -n "up\.sh" tests/sql/environments/rac/ tests/sql/scripts/drivers/ -A 2 -B 2 | head -100

Repository: rophy/OpenLogReplicator

Length of output: 2243


🏁 Script executed:

# Check if up.sh is sourced or executed as subprocess
rg -n "source.*up\.sh|\..*up\.sh|\./up\.sh|bash.*up\.sh" tests/sql/environments/rac/ tests/sql/scripts/drivers/

Repository: rophy/OpenLogReplicator

Length of output: 407


🏁 Script executed:

# Check if any entrypoint sources or calls up.sh
rg -n "source.*up\.sh|\..*up\.sh|\./up\.sh" tests/sql/environments/rac/debezium/

Repository: rophy/OpenLogReplicator

Length of output: 407


🏁 Script executed:

# Examine checkpoint-restart-test.sh to understand VM_HOST usage
sed -n '1,50p' tests/sql/environments/rac/debezium/checkpoint-restart-test.sh

Repository: rophy/OpenLogReplicator

Length of output: 1907


🏁 Script executed:

# Check setup.sh 
sed -n '1,50p' tests/sql/environments/rac/debezium/setup.sh

Repository: rophy/OpenLogReplicator

Length of output: 1668


🏁 Script executed:

# Check run.sh
sed -n '1,50p' tests/sql/environments/rac/debezium/run.sh

Repository: rophy/OpenLogReplicator

Length of output: 1903


Move VM_HOST discovery from up.sh into shared sourced helper or generated env file.

up.sh correctly discovers the actual RAC VM IP via virsh (line 46), but none of the test entrypoints (setup.sh, checkpoint-restart-test.sh, run.sh, soak-test.sh) source or invoke up.sh. Each entrypoint independently has its own hardcoded fallback: VM_HOST="${VM_HOST:-192.168.122.248}". Running any entrypoint directly will therefore ignore the discovered IP and use the stale fallback, defeating the discovery logic entirely.

Either:

  1. Create a shared helper that all entrypoints source to discover and share the actual IP, or
  2. Have up.sh write the discovered IP to a persistent env file (e.g., .env) that all entrypoints source before setting their fallback.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/sql/environments/rac/up.sh` around lines 46 - 47, up.sh currently
discovers the real RAC IP and sets VM_HOST but none of the entrypoints
(setup.sh, checkpoint-restart-test.sh, run.sh, soak-test.sh) source up.sh, so
they fall back to the hardcoded VM_HOST; fix by moving the discovery logic into
a shared sourced helper (e.g., a new scripts/env.sh) or by having up.sh write
the discovered value to a persistent env file (e.g., .env) and then update all
entrypoints (setup.sh, checkpoint-restart-test.sh, run.sh, soak-test.sh) to
source that helper/.env before they evaluate their fallback
VM_HOST="${VM_HOST:-192.168.122.248}" so the discovered VM_HOST is available and
exported for all scripts.

- Extract VM IP discovery into shared vm-env.sh sourced by all RAC
  scripts (fixes major: standalone scripts bypassed up.sh discovery)
- Fix _read_checkpoint_scn double output when volume doesn't exist
- Add CHKPT_RESULT to single-instance exit code (consistency with RAC)
- Use head -1 for virsh IP detection (handles multiple IPs)
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/debezium/checkpoint-restart-test.sh`:
- Around line 200-205: The _kill_olr() function currently removes the OLR
container (OLR_CONTAINER) without saving its logs, losing logs from earlier
restart cycles; modify _kill_olr to capture docker logs before removal and write
them to a uniquely named file (include the current cycle index or timestamp and
the output of _read_checkpoint in the filename) under a persistent test
artifacts dir (e.g., TEST_ARTIFACTS or /tmp/olr-logs) so stage 6's checks
against dbz-olr logs can examine logs from all KILL_COUNT runs; ensure the code
uses docker logs "$OLR_CONTAINER" > <file> prior to docker rm and retains
filenames that make it easy to correlate with cycle number/checkpoint.
- Around line 160-166: The docker inspect format currently concatenates network
names (in the command that sets net from ORACLE_CONTAINER), producing invalid
values like "net1net2"; update the format string used in the docker inspect call
inside the tests/debezium/checkpoint-restart-test.sh script so each network name
is printed on its own line (for example by using a template that emits a newline
for each $k or by using println $k), then keep the existing head -1 to select
the first network; refer to the variables/commands net, ORACLE_CONTAINER, and
the docker inspect invocation and ensure the docker run that uses
OLR_CONTAINER/--network reads a single valid network name.

In `@tests/sql/environments/rac/debezium/checkpoint-restart-test.sh`:
- Around line 573-575: The code currently prints a warning when len(unique_ids)
< expected_total allowing the test to pass with missing CHKPT_TEST inserts;
change this to fail the test by replacing the print warning with a non-zero exit
or assertion (e.g., raise SystemExit/exit(1) or assert len(unique_ids) >=
expected_total) that includes a clear message with len(unique_ids) and
expected_total so missing IDs cause test failure.
- Around line 181-186: Define a persistent OLR_LOG path
(OLR_LOG="$WORK_DIR/olr.log") right after WORK_DIR is created, then modify the
_kill_olr() function to capture and append the container logs to $OLR_LOG before
removing the container (i.e., run podman logs $OLR_CONTAINER >> "$OLR_LOG" or
equivalent prior to podman rm), and update the Stage 7 log collection step to
append (>>) instead of overwrite (>) so earlier restart-cycle errors (ASAN/ERROR
50022) are preserved; references: WORK_DIR, OLR_LOG, _kill_olr(), and the Stage
7 log collection redirection.

In `@tests/sql/environments/rac/vm-env.sh`:
- Around line 15-18: The VM auto-detection picks the first IPv4 from `virsh
domifaddr` which can be a non-libvirt NIC; update the `VM_HOST` assignment to
filter returned addresses to the libvirt subnet `192.168.122.*` before taking
the first match so it matches the validation later. Locate the `VM_HOST=$(virsh
domifaddr oracle-rac-vm ... | awk ... | cut ... | head -1)` line and change the
pipeline to only select addresses in the `192.168.122.` range (e.g., via
awk/grep filtering) so `VM_HOST` is set to the libvirt IPv4 if present; keep the
existing fallback behavior if no libvirt address is found.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e89a23b7-14f0-46ce-9354-b82339d39ddd

📥 Commits

Reviewing files that changed from the base of the PR and between 0650b4b and d82bcc7.

📒 Files selected for processing (4)
  • tests/debezium/checkpoint-restart-test.sh
  • tests/sql/environments/rac/debezium/checkpoint-restart-test.sh
  • tests/sql/environments/rac/up.sh
  • tests/sql/environments/rac/vm-env.sh
✅ Files skipped from review due to trivial changes (1)
  • tests/sql/environments/rac/up.sh

Comment on lines +160 to +166
# Get Oracle container's network
local net
net=$(docker inspect "$ORACLE_CONTAINER" --format '{{range $k,$v := .NetworkSettings.Networks}}{{$k}}{{end}}' | head -1)

docker run -d --name "$OLR_CONTAINER" \
--network "$net" \
--network-alias olr \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and examine the context around lines 160-166
find . -type f -name "checkpoint-restart-test.sh" | head -5

Repository: rophy/OpenLogReplicator

Length of output: 175


🏁 Script executed:

# Let's also look for docker-related documentation or examples in the repo
find . -type f \( -name "*.sh" -o -name "*.md" \) -exec grep -l "docker inspect" {} \; | head -10

Repository: rophy/OpenLogReplicator

Length of output: 110


🏁 Script executed:

# Read the specific file around the mentioned lines
if [ -f tests/debezium/checkpoint-restart-test.sh ]; then
  sed -n '155,170p' tests/debezium/checkpoint-restart-test.sh
fi

Repository: rophy/OpenLogReplicator

Length of output: 630


🌐 Web query:

docker inspect format go template multiple networks concatenation

💡 Result:

To use docker inspect with Go templates to handle multiple networks (from .NetworkSettings.Networks, a map) and concatenate their names or details, use the range action to iterate over the map keys and the join function on a collected slice if needed. ## List network names (one per line): docker inspect container_name --format '{{range $k, $v := .NetworkSettings.Networks}}{{println $k}}{{end}}' Output example: bridge br01 Sources: StackOverflow answer confirms this works for listing networks [2], Docker docs show similar range and println usage [1]. ## Concatenate network names with custom separator (e.g., comma): First collect keys into a slice, then join: docker inspect container_name --format '{{$nets := slice}}{{range $k, $v := .NetworkSettings.Networks}}{{$nets = append $nets $k}}{{end}}{{join $nets ","}}' Docker's Go template implementation (sprig library) supports keys, slice, append, and join functions [1]. This extracts map keys ["bridge", "br01"], appends to slice, joins as "bridge,br01". Alternative without variables (simpler, space-separated): docker inspect container_name --format '{{range $k, $v := .NetworkSettings.Networks}}{{$k}},{{end}}' | sed 's/,$//' But pure template preferred. ## Concatenate IP addresses (one per network): docker inspect container_name --format='{{range .NetworkSettings.Networks}}{{.IPAddress}} {{end}}' Docker docs example [7]. ## Full JSON of networks: docker inspect container_name --format '{{json .NetworkSettings.Networks}}' Official Docker formatting docs cover join, range, println [1,7]. StackOverflow provides exact container networks iteration [2]. These work as of Docker 27.x (2026).

Citations:


🌐 Web query:

go template printf %s\n docker inspect format

💡 Result:

In Docker's docker inspect command, the --format flag uses Go text/templates. The printf "%s\n" is a Go template function call to printf, which is an alias for fmt.Sprintf. It formats its arguments according to the format string "%s\n", printing the string value followed by a newline. This is commonly used to output a field value cleanly with a trailing newline, avoiding extra spacing from other functions like println. Example from community usage (e.g., reconstructing docker run commands): docker inspect --format '{{printf "%q" .Name}}' container This quotes the container name safely. Official Docker docs confirm printf availability via Go templates (https://docs.docker.com/engine/cli/formatting/), and Go docs specify printf behavior (https://pkg.go.dev/text/template#hdr-Functions). Shell escaping is needed: use single quotes around the template and escape inner double quotes, e.g.: docker inspect --format '{{printf "%s\n" .Name}}' container_id This prints the container name followed by a newline.

Citations:


🏁 Script executed:

# Search for similar docker inspect patterns in the codebase
rg 'docker inspect.*format.*NetworkSettings' -A 2 -B 2

Repository: rophy/OpenLogReplicator

Length of output: 497


🏁 Script executed:

# Also check if there are examples of properly handling multiple networks
rg 'NetworkSettings|--network' tests/ -A 2 -B 2 | head -50

Repository: rophy/OpenLogReplicator

Length of output: 778


🌐 Web query:

docker inspect go template println vs printf newline function

💡 Result:

In Docker inspect Go templates (used with --format), both println and printf are built-in functions from Go's text/template package. println is an alias for fmt.Sprintln: it formats its arguments with default formatting (spaces between non-string args), appends a newline, and outputs the result as a string. printf is an alias for fmt.Sprintf: it formats its arguments according to a format string (e.g., %s, %d), does NOT append a newline automatically, and outputs the result as a string. Key difference for newlines: Use println to automatically add a newline after output. Use printf '%s\n' or similar to add a newline manually. Docker adds extra functions like json, join, lower, upper, split, title, pad, truncate, table (println listed explicitly in docs). No evidence of Sprig library integration. Examples from Docker docs: - println for multi-line: docker inspect --format='{{range.NetworkSettings.Networks}}{{println .IPAddress}}{{end}}' container (each IP on new line) - printf for custom: docker inspect --format='{{printf "%s\n" .Name}}' container (name followed by newline)

Citations:


Output one network name per line before head -1.

The current template {{range $k,$v := .NetworkSettings.Networks}}{{$k}}{{end}} concatenates all network names without separators. If the Oracle container is on multiple networks, $net becomes an invalid concatenated string like "net1net2", causing docker run --network "$net" to fail.

Suggested fix
-    net=$(docker inspect "$ORACLE_CONTAINER" --format '{{range $k,$v := .NetworkSettings.Networks}}{{$k}}{{end}}' | head -1)
+    net=$(docker inspect "$ORACLE_CONTAINER" --format '{{range $k,$v := .NetworkSettings.Networks}}{{printf "%s\n" $k}}{{end}}' | head -1)

Alternatively, {{println $k}} works as a simpler approach that automatically appends a newline.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Get Oracle container's network
local net
net=$(docker inspect "$ORACLE_CONTAINER" --format '{{range $k,$v := .NetworkSettings.Networks}}{{$k}}{{end}}' | head -1)
docker run -d --name "$OLR_CONTAINER" \
--network "$net" \
--network-alias olr \
# Get Oracle container's network
local net
net=$(docker inspect "$ORACLE_CONTAINER" --format '{{range $k,$v := .NetworkSettings.Networks}}{{printf "%s\n" $k}}{{end}}' | head -1)
docker run -d --name "$OLR_CONTAINER" \
--network "$net" \
--network-alias olr \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/debezium/checkpoint-restart-test.sh` around lines 160 - 166, The docker
inspect format currently concatenates network names (in the command that sets
net from ORACLE_CONTAINER), producing invalid values like "net1net2"; update the
format string used in the docker inspect call inside the
tests/debezium/checkpoint-restart-test.sh script so each network name is printed
on its own line (for example by using a template that emits a newline for each
$k or by using println $k), then keep the existing head -1 to select the first
network; refer to the variables/commands net, ORACLE_CONTAINER, and the docker
inspect invocation and ensure the docker run that uses OLR_CONTAINER/--network
reads a single valid network name.

- Preserve OLR logs before container removal on each kill cycle,
  scan all cycle logs in final error check (Critical + Major)
- Turn missing IDs into test failure, not warning (Major)
- Filter virsh output for 192.168.122.* subnet (Minor)
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
tests/debezium/checkpoint-restart-test.sh (1)

160-166: ⚠️ Potential issue | 🟠 Major

Print one network name per line before head -1.

docker inspect ... '{{range $k,$v := .NetworkSettings.Networks}}{{$k}}{{end}}' still concatenates every attached network into one token. If dbz-oracle is on more than one network, net becomes an invalid value like net1net2, and the docker run --network "$net" here fails before the test starts.

♻️ Proposed fix
-    net=$(docker inspect "$ORACLE_CONTAINER" --format '{{range $k,$v := .NetworkSettings.Networks}}{{$k}}{{end}}' | head -1)
+    net=$(docker inspect "$ORACLE_CONTAINER" --format '{{range $k,$v := .NetworkSettings.Networks}}{{printf "%s\n" $k}}{{end}}' | head -1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/debezium/checkpoint-restart-test.sh` around lines 160 - 166, The code
builds the variable net by concatenating all network names into one token,
causing docker run --network "$net" to fail; update the docker inspect
invocation that sets net (used with ORACLE_CONTAINER and later in the docker run
that creates OLR_CONTAINER) to produce one network name per line (e.g., emit
newline-separated network names) and then keep the head -1 usage to select the
first network, ensuring net contains a single valid network name rather than a
concatenation of multiple names.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/debezium/checkpoint-restart-test.sh`:
- Around line 369-376: The current check compares PRE_KILL_SCN to
POST_RESTART_SCN read from the checkpoint file immediately after _launch_olr,
which can be unchanged and give a false positive; instead, after calling
_launch_olr wait for OLR to report its actual startup SCN (e.g., poll OLR’s
status/metrics endpoint or read the startup SCN from OLR logs) and compare that
reported STARTUP_SCN to PRE_KILL_SCN. Replace the immediate _read_checkpoint_scn
usage for POST_RESTART_SCN with a function that queries the running OLR instance
for the SCN it actually opened (or waits until the checkpoint file is updated by
OLR), and use that reported value in the comparison to ensure a real resume
occurred.

In `@tests/sql/environments/rac/debezium/checkpoint-restart-test.sh`:
- Around line 376-382: The current check uses POST_RESTART_SCN from
_read_checkpoint_scn which reads the persisted file (unchanged until OLR writes
a new checkpoint) so it can falsely pass; instead capture the SCN OLR actually
reported/used at startup (e.g., the value _start_olr prints or a helper that
queries the running OLR process) and use that startup SCN for the comparison
against PRE_KILL_SCN; replace the POST_RESTART_SCN = $(_read_checkpoint_scn)
usage with retrieving the OLR-reported startup SCN after calling _start_olr (or
add a _get_olr_startup_scn helper that parses _start_olr output/logs) and then
compare that startup SCN to PRE_KILL_SCN before incrementing
CHECKPOINT_VERIFIED.
- Around line 55-56: The SSH command in _vm_copy_in() currently chains "podman
cp ...; rm -f ..." which masks podman cp failures because rm returns 0; change
the SSH command so the podman cp exit code is preserved (either by using a
logical AND so rm runs only on success, or by capturing podman cp's exit status
into a variable, running rm, then exiting with that saved code). Apply this same
fix to the identical patterns in tests/sql/scripts/drivers/rac.sh,
tests/sql/environments/rac/debezium/run.sh, and
tests/sql/environments/rac/debezium/soak-test.sh so podman cp failures are
propagated instead of being masked.

---

Duplicate comments:
In `@tests/debezium/checkpoint-restart-test.sh`:
- Around line 160-166: The code builds the variable net by concatenating all
network names into one token, causing docker run --network "$net" to fail;
update the docker inspect invocation that sets net (used with ORACLE_CONTAINER
and later in the docker run that creates OLR_CONTAINER) to produce one network
name per line (e.g., emit newline-separated network names) and then keep the
head -1 usage to select the first network, ensuring net contains a single valid
network name rather than a concatenation of multiple names.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9a3ad45f-9c35-4ca9-890c-368be54dddfe

📥 Commits

Reviewing files that changed from the base of the PR and between d82bcc7 and 78750b1.

📒 Files selected for processing (3)
  • tests/debezium/checkpoint-restart-test.sh
  • tests/sql/environments/rac/debezium/checkpoint-restart-test.sh
  • tests/sql/environments/rac/vm-env.sh
✅ Files skipped from review due to trivial changes (1)
  • tests/sql/environments/rac/vm-env.sh

Comment on lines +369 to +376
# Verify checkpoint resume
POST_RESTART_SCN=$(_read_checkpoint_scn)
if [[ "$PRE_KILL_SCN" != "0" && "$POST_RESTART_SCN" != "0" ]]; then
echo " Checkpoint resume: pre-kill=$PRE_KILL_SCN, post-restart=$POST_RESTART_SCN"
if [[ "$POST_RESTART_SCN" -ge "$PRE_KILL_SCN" ]]; then
echo " PASS: resumed from checkpoint"
CHECKPOINT_VERIFIED=$(( CHECKPOINT_VERIFIED + 1 ))
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This checkpoint check can pass without proving a real resume.

POST_RESTART_SCN is read immediately after _launch_olr, before OLR has written any new state. In the normal case that just rereads the same checkpoint JSON that produced PRE_KILL_SCN, so POST_RESTART_SCN >= PRE_KILL_SCN can succeed even if the restarted process actually reopened from an earlier SCN. Please validate against the startup SCN OLR really uses/reports, not the unchanged checkpoint file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/debezium/checkpoint-restart-test.sh` around lines 369 - 376, The
current check compares PRE_KILL_SCN to POST_RESTART_SCN read from the checkpoint
file immediately after _launch_olr, which can be unchanged and give a false
positive; instead, after calling _launch_olr wait for OLR to report its actual
startup SCN (e.g., poll OLR’s status/metrics endpoint or read the startup SCN
from OLR logs) and compare that reported STARTUP_SCN to PRE_KILL_SCN. Replace
the immediate _read_checkpoint_scn usage for POST_RESTART_SCN with a function
that queries the running OLR instance for the SCN it actually opened (or waits
until the checkpoint file is updated by OLR), and use that reported value in the
comparison to ensure a real resume occurred.

Comment on lines +55 to +56
scp $_SSH_OPTS "$local_path" "${VM_USER}@${VM_HOST}:${staging}"
ssh $_SSH_OPTS "${VM_USER}@${VM_HOST}" "podman cp ${staging} ${node}:${container_path}; rm -f ${staging}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Read the file to see the actual code and context
cat -n tests/sql/environments/rac/debezium/checkpoint-restart-test.sh | sed -n '50,65p'

Repository: rophy/OpenLogReplicator

Length of output: 688


🏁 Script executed:

# Also check if there are other similar patterns in the codebase
rg 'ssh.*\$_SSH_OPTS.*".*&&|;.*rm -f' --multiline -A 2 -B 2

Repository: rophy/OpenLogReplicator

Length of output: 3617


🏁 Script executed:

# Verify the exact SSH exit-status behavior documentation/semantics
cat > /tmp/test_ssh_semantics.sh << 'EOF'
# This tests the claim that SSH returns only the last command's exit status
# when multiple commands are joined by semicolon

# Test 1: What does a local shell do with cmd1; cmd2?
echo "=== Local shell test: false; true ==="
bash -c "false; true"
echo "Exit status: $?"

echo ""
echo "=== Local shell test: true; false ==="
bash -c "true; false"
echo "Exit status: $?"

echo ""
echo "=== Demonstrating the problem ==="
echo "Local behavior with semicolon:"
bash -c "echo 'cmd1 fails' >&2; exit 1; echo 'cleanup'; exit 0"
echo "Exit status after cmd1 fails, cmd2 succeeds: $?"

bash -c "exit 1; exit 0"
echo "Exit status of (exit 1; exit 0): $?"
EOF

chmod +x /tmp/test_ssh_semantics.sh
/tmp/test_ssh_semantics.sh

Repository: rophy/OpenLogReplicator

Length of output: 349


Fix the exit-status masking in _vm_copy_in() that hides failed podman cp operations.

With podman cp ${staging} ${node}:${container_path}; rm -f ${staging}, a failed copy returns a non-zero exit code, but since rm -f unconditionally runs next and succeeds (returning 0), the overall SSH exit status is 0. This masks the copy failure, and the script proceeds with a missing or stale file, causing failures downstream that are harder to debug.

The same pattern exists in tests/sql/scripts/drivers/rac.sh, tests/sql/environments/rac/debezium/run.sh, and tests/sql/environments/rac/debezium/soak-test.sh and should be fixed consistently.

Proposed fix
-    ssh $_SSH_OPTS "${VM_USER}@${VM_HOST}" "podman cp ${staging} ${node}:${container_path}; rm -f ${staging}"
+    ssh $_SSH_OPTS "${VM_USER}@${VM_HOST}" "status=0; podman cp ${staging} ${node}:${container_path} || status=\$?; rm -f ${staging}; exit \$status"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
scp $_SSH_OPTS "$local_path" "${VM_USER}@${VM_HOST}:${staging}"
ssh $_SSH_OPTS "${VM_USER}@${VM_HOST}" "podman cp ${staging} ${node}:${container_path}; rm -f ${staging}"
scp $_SSH_OPTS "$local_path" "${VM_USER}@${VM_HOST}:${staging}"
ssh $_SSH_OPTS "${VM_USER}@${VM_HOST}" "status=0; podman cp ${staging} ${node}:${container_path} || status=\$?; rm -f ${staging}; exit \$status"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/sql/environments/rac/debezium/checkpoint-restart-test.sh` around lines
55 - 56, The SSH command in _vm_copy_in() currently chains "podman cp ...; rm -f
..." which masks podman cp failures because rm returns 0; change the SSH command
so the podman cp exit code is preserved (either by using a logical AND so rm
runs only on success, or by capturing podman cp's exit status into a variable,
running rm, then exiting with that saved code). Apply this same fix to the
identical patterns in tests/sql/scripts/drivers/rac.sh,
tests/sql/environments/rac/debezium/run.sh, and
tests/sql/environments/rac/debezium/soak-test.sh so podman cp failures are
propagated instead of being masked.

Comment on lines +376 to +382
# Verify it resumed from checkpoint, not from start SCN
POST_RESTART_SCN=$(_read_checkpoint_scn)
if [[ "$PRE_KILL_SCN" != "0" && "$POST_RESTART_SCN" != "0" ]]; then
echo " Checkpoint resume: pre-kill scn=$PRE_KILL_SCN, post-restart scn=$POST_RESTART_SCN"
if [[ "$POST_RESTART_SCN" -ge "$PRE_KILL_SCN" ]]; then
echo " PASS: OLR resumed from checkpoint (not start SCN)"
CHECKPOINT_VERIFIED=$(( CHECKPOINT_VERIFIED + 1 ))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This resume check is reading persisted state, not the SCN OLR reopened from.

POST_RESTART_SCN comes straight from the checkpoint file immediately after _start_olr. Until OLR writes a new checkpoint, that file still contains the pre-kill SCN, so this branch can increment CHECKPOINT_VERIFIED even when the restarted process actually resumed from some earlier position. Please key this assertion off the startup SCN OLR reports/uses, not the unchanged checkpoint file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/sql/environments/rac/debezium/checkpoint-restart-test.sh` around lines
376 - 382, The current check uses POST_RESTART_SCN from _read_checkpoint_scn
which reads the persisted file (unchanged until OLR writes a new checkpoint) so
it can falsely pass; instead capture the SCN OLR actually reported/used at
startup (e.g., the value _start_olr prints or a helper that queries the running
OLR process) and use that startup SCN for the comparison against PRE_KILL_SCN;
replace the POST_RESTART_SCN = $(_read_checkpoint_scn) usage with retrieving the
OLR-reported startup SCN after calling _start_olr (or add a _get_olr_startup_scn
helper that parses _start_olr output/logs) and then compare that startup SCN to
PRE_KILL_SCN before incrementing CHECKPOINT_VERIFIED.

@rophy rophy merged commit d9747b9 into master Mar 24, 2026
2 checks passed
rophy added a commit that referenced this pull request Mar 24, 2026
- Extract VM IP discovery into shared vm-env.sh sourced by all RAC
  scripts (fixes major: standalone scripts bypassed up.sh discovery)
- Fix _read_checkpoint_scn double output when volume doesn't exist
- Add CHKPT_RESULT to single-instance exit code (consistency with RAC)
- Use head -1 for virsh IP detection (handles multiple IPs)
@rophy rophy deleted the test/checkpoint-restart-tests branch March 24, 2026 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant