Skip to content

fix(aggregation): hierarchical/leaf Component routing with provenance and warnings#374

Open
bburda wants to merge 16 commits intomainfrom
fix/aggregation-proxy-logs-and-hosts
Open

fix(aggregation): hierarchical/leaf Component routing with provenance and warnings#374
bburda wants to merge 16 commits intomainfrom
fix/aggregation-proxy-logs-and-hosts

Conversation

@bburda
Copy link
Copy Markdown
Collaborator

@bburda bburda commented Apr 13, 2026

Summary

Make the aggregation layer treat Components with the same symmetry as Areas, so collision-merged peer Components work correctly for every deployment shape the project supports (single peer, multi-peer, hierarchical parent, daisy chain).

  • EntityMerger::merge_components now records a provisional routing entry on collision. Every Component ID refers to at most one ECU when it is a leaf, and the peer owns the runtime state.
  • A new pure classify_component_routing free function (aggregation/classification.hpp) runs after all peers have been merged. A Component is classified as a hierarchical parent if any other Component in the merged set references it via parent_component_id; those are removed from the routing table and served locally from the merged cache. Leaves stay routed to the owning peer.
  • Multi-peer leaf collisions are reported as structured LeafCollisionWarning objects. AggregationManager surfaces them on MergedPeerResult.leaf_warnings, writes an RCLCPP_WARN line listing every claimant, and exposes the list via a new /health.warnings array. Routing falls back to last-writer-wins because rejection does not fix a deployment anomaly.
  • Every merged entity (Area, Component, App, Function) now carries an optional x-medkit.contributors list populated during merge (local seeded before the peer loop, peer:<name> appended by EntityMerger with de-duplication). Clients can distinguish local-only, peer-only, and merged views without relying on the single-valued source field. In daisy-chain topologies each hop surfaces only its direct upstream by design.
  • Pre-existing doc comment referencing a third-party commercial product was stripped from sovd_service_interface.hpp.
  • Aggregation design doc now documents the symmetric classification, the multi-peer warning behaviour, and the contributors provenance, with an updated PlantUML merge diagram plus a new post-merge classification diagram.

SOVD API impact

  • No changes to SOVD-defined request paths, response schemas, or error codes.
  • Additive x-medkit extensions only: new optional x-medkit.contributors field on entity responses and a new optional warnings array on /health. Both are emitted only when non-empty / when aggregation is active, so existing clients ignoring unknown fields are unaffected.

Issue


Type

  • Bug fix
  • New feature or tests (classification seam, /health.warnings, contributors, daisy-chain integration test)
  • Breaking change
  • Documentation only (design doc updates)

Testing

All local runs green:

  • ./scripts/test.sh unit --packages-select ros2_medkit_gateway - 1976 tests, 0 failures. Includes:
    • 6 new AggregationClassification tests covering leaf vs hierarchical, single/multi-peer, subcomponents, and warning emission.
    • 4 new EntityMerger contributor tests (local/peer append, dedup, app-prefix-only-peer).
    • Updated merged_components_route_to_peer plus the synthetic-collision test from the previous iteration.
  • ./scripts/test.sh test_peer_aggregation --packages-select ros2_medkit_integration_tests - prior integration coverage still green, 82 tests.
  • ./scripts/test.sh test_daisy_chain_aggregation --packages-select ros2_medkit_integration_tests - new integration test with 3 gateways in isolated DDS domains (primary -> peer_B -> peer_C). Verifies hierarchical parent served locally, 1-hop and 2-hop leaf forwarding, contributors, and empty /health.warnings.
  • ./scripts/test.sh lint --packages-select ros2_medkit_gateway - clean.
  • Sphinx docs build with no new warnings.

Checklist

  • Breaking changes are clearly described (there are none; only additive x-medkit extensions)
  • Tests were added or updated if needed
  • Docs were updated if behavior or public API changed

Copilot AI review requested due to automatic review settings April 13, 2026 19:58
@bburda bburda self-assigned this Apr 13, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes aggregation routing so that when a peer Component collides (same ID) with a locally-present Component and gets merged, sub-resource requests are forwarded to the peer that owns the runtime state—preventing empty local responses on endpoints like /logs, /hosts, /data, and /operations.

Changes:

  • Add routing table entry for collision-merged Components in EntityMerger::merge_components.
  • Update/extend EntityMerger unit tests to assert merged Components route to the peer (including a hybrid synthetic-collision scenario).
  • Refresh aggregation design documentation and remove an unrelated third-party product reference from a plugin doc comment.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
src/ros2_medkit_gateway/src/aggregation/entity_merger.cpp Route collision-merged Component IDs to the peer in the routing table.
src/ros2_medkit_gateway/test/test_entity_merger.cpp Update routing-table expectations and add a hybrid synthetic-collision test.
src/ros2_medkit_gateway/design/aggregation.rst Align merge/routing documentation with ECU-ownership routing behavior for Components.
src/ros2_medkit_plugins/ros2_medkit_sovd_service_interface/src/sovd_service_interface.hpp Remove unrelated third-party reference from the class doc comment.

@bburda bburda marked this pull request as draft April 14, 2026 11:33
bburda added a commit that referenced this pull request Apr 14, 2026
…ification

Add test_aggregation_classification with 6 failing tests that codify the
target semantics before the implementation exists (TDD red phase):

- leaf_collision_keeps_routing_to_peer
- hierarchical_parent_drops_routing_when_local_subcomponents_exist
- hierarchical_parent_drops_routing_when_remote_subcomponents_exist
- hierarchical_parent_drops_routing_across_multiple_peers
- subcomponents_of_hierarchical_parent_still_route_to_peer
- leaf_collision_across_multiple_peers_emits_warning

The tests call a pure classify_component_routing() free function that
takes the fully merged Component vector plus per-peer entity claims and
returns a filtered routing table (hierarchical parents removed) together
with LeafCollisionWarning entries for leaf IDs claimed by >1 peer.
Build will fail on missing classification.hpp - intentional; the
implementation commit introduces the header and source.

Part of PR #374.
bburda added a commit that referenced this pull request Apr 14, 2026
Introduce a pure free function that takes the fully-merged Component
vector plus per-peer Component claims and returns a routing table with
hierarchical parents removed plus a list of multi-peer leaf-collision
warnings. Hierarchical parent = any Component referenced as
parent_component_id by another Component in the set. Leaves stay routed
to their peer (last-writer-wins for routing on collision; warning
enumerates every claiming peer).

Fixes the 6 red tests added in the previous commit.

Part of PR #374.
bburda added a commit that referenced this pull request Apr 14, 2026
…nager

Track per-peer Component IDs during fetch_and_merge_peer_entities and
run classify_component_routing once the merged Component set is final.
Hierarchical-parent routing entries are stripped (served locally with
the merged view, like Areas/Functions); leaves keep routing to their
peer. Multi-peer leaf collisions surface as RCLCPP_WARN log entries
listing every claiming peer, and the structured warnings are returned
on MergedPeerResult.leaf_warnings for downstream consumers (e.g. the
upcoming /health.warnings field).

Existing unit (1976) and peer-aggregation integration (82) suites pass
unchanged - the no-collision and apps-only paths still match the prior
behaviour because classify_component_routing is a no-op for those.

Part of PR #374.
bburda added a commit that referenced this pull request Apr 14, 2026
…warnings

When more than one peer announces the same leaf Component ID,
AggregationManager stores the collision as a structured warning. The
/health endpoint now surfaces these via a top-level warnings array
(x-medkit extension on our own endpoint, not SOVD core). Empty array
when no warnings are active. Operators can pick up deployment anomalies
without tailing logs; routing still falls back to last-writer-wins.

Each warning carries a machine-readable code, human message, the
affected entity_ids (currently a single-element array for forward
compatibility), and the full list of claiming peer_names.

Part of PR #374.
bburda added a commit that referenced this pull request Apr 14, 2026
Single-string source cannot express that an entity was contributed by
both the local gateway and one or more peers. Add a contributors field
on Area, Component, App, and Function models, serialised in each
entity's x-medkit block (only when non-empty, so the field is a pure
additive extension for existing clients). Single-origin entities get a
one-element list for consistency.

Population:
- AggregationManager seeds every local entity with ["local"] before the
  per-peer merge loop.
- EntityMerger appends "peer:<name>" on collisions (Areas, Components,
  Functions) and on remote-only additions; deduplication protects
  against repeated merges with the same peer.
- App collision-prefixed entities receive only the peer contributor
  since the prefixed ID is a distinct entity.

Daisy-chain note: each hop surfaces only its direct upstream
("peer:<name>") - a primary aggregating peer_B sees peer_B's entities
as contributed by peer_B regardless of whether peer_B itself merged
them from peer_C. This is intentional and can be extended later.

Part of PR #374.
bburda added a commit that referenced this pull request Apr 14, 2026
…enance

Refresh the aggregation design document for the extended merge model:

- Expand the Components merge-rules entry to distinguish hierarchical
  parents (served locally with merged view, like Areas) from leaf ECUs
  (routed to the owning peer on collision), with the classification
  happening post-merge so sub-components arriving from any peer unlock
  parent behaviour.
- Describe multi-peer leaf-collision warnings, last-writer-wins
  routing, and the /health.warnings surface.
- Update the routing-table description to match the new taxonomy.
- Add a new section describing x-medkit.contributors provenance, how
  it is populated, and how it behaves in daisy-chain topologies.
- Add a post-merge Component classification activity diagram alongside
  the per-peer merge rules diagram; the per-peer diagram now shows the
  "provisional routing entry" and contributor-append steps.

Part of PR #374.
bburda added a commit that referenced this pull request Apr 14, 2026
…egation

Launch three gateways in separate DDS domains (primary -> peer_B ->
peer_C), each with a manifest that declares a shared hierarchical
parent 'robot-x' plus a single leaf ECU sub-component (ecu-a / ecu-b /
ecu-c). Manifests are written to tmp files at import time so launch
descriptions reference stable paths.

Validates end-to-end:
- Primary sees all three ECU sub-components under robot-x, including
  ecu-c which arrives via transitive aggregation through peer_B.
- GET /components/robot-x returns a locally-served response with
  x-medkit.contributors listing "local" and "peer:peer_b" (each hop
  surfaces only its direct upstream; peer_c's contribution is opaque
  to the primary).
- GET /components/ecu-b succeeds via 1-hop forwarding to peer_B.
- GET /components/ecu-c succeeds via 2-hop forwarding primary ->
  peer_B -> peer_C.
- GET /health on primary returns a peers list plus an empty warnings
  array (no leaf-id collisions in this topology).

Also surfaces Component contributors in the detail handler's x-medkit
block, which was missing because handle_get_component builds its
x-medkit via XMedkit rather than the Component::to_json() path used by
list/reference endpoints.

Part of PR #374.
@bburda bburda changed the title fix(aggregation): forward sub-resource requests for collision-merged peer components fix(aggregation): hierarchical/leaf Component routing with provenance and warnings Apr 14, 2026
@bburda bburda requested a review from mfaferek93 April 14, 2026 16:08
bburda added a commit that referenced this pull request Apr 14, 2026
…nistic output

Review follow-ups for PR #374 addressing correctness and determinism
gaps found in deep-review:

- Previously x-medkit.contributors was emitted only on GET /components/{id}
  because the Component detail handler builds x-medkit manually via the
  XMedkit builder. Area/App/Function detail handlers silently dropped the
  field despite the models carrying it. Introduce XMedkit::contributors()
  helper and call it from all four detail handlers so list and detail
  endpoints expose provenance consistently.

- Classification output is now deterministic: leaf_warnings are sorted by
  entity_id, each warning's peer_names list is sorted alphabetically,
  and per-peer claim iteration is driven by a sorted copy of the
  unordered_set. Operator logs and /health.warnings no longer flap
  between runs or after hashmap rehashes.

- Contributors are presented in stable order ("local" first when present,
  then peer:<name> entries alphabetically) via a sorted_contributors()
  helper in discovery/models/common.hpp. Called from every entity
  to_json() and from XMedkit::contributors(), so list and detail
  responses agree regardless of how peers were merged.

- Introduce warning_codes.hpp with WARN_LEAF_ID_COLLISION constant
  mirroring the error_codes.hpp pattern. Inline string literal in
  health_handlers.cpp replaced with the constant.

- Leaf-collision warning message now names the colliding peers and
  includes a concrete remediation hint (rename on one side, or model
  as hierarchical parent) instead of only describing the symptom.

- Add aggregation capability flag to handle_root so clients can
  feature-detect that /health.warnings and x-medkit.contributors may
  be present. /health.warnings is always an array when aggregation
  is enabled (empty when no anomalies), so clients avoid tri-state
  absent/empty/populated branching.

- OpenAPI health_schema() now declares top-level peers and warnings
  fields (with structured warning object schema for code/message/
  entity_ids/peer_names). Previously these were emitted but absent
  from the spec - breaking client generators.

Tests: 1980 unit + daisy-chain integration still green. Deeper
integration coverage (leaf collision path, capability assertion,
contributors on non-Component entities) follows in a separate commit.

Part of PR #374.
bburda added a commit that referenced this pull request Apr 14, 2026
… refresh docs

Integration test additions (daisy chain):
- capability flag assertions on primary (aggregation: true) and on
  peer_c (aggregation: false)
- contributors assertion on a leaf Component detail (ordering starts
  with 'local' per the new stable-order rule)
- contributors assertion on an App detail endpoint (regression guard
  for the handler-level fix that went into the previous commit)
- addClassCleanup registers os.unlink for each tmp manifest so CI
  no longer leaks three YAML files per run

Requirement traceability:
- @verifies REQ_INTEROP_003 on all 6 AggregationClassification tests,
  on the 4 new EntityMerger contributors tests, and on the 4 main
  daisy-chain integration assertions (docstring-based for Python so
  launch_testing introspection preserves them)

Design doc refresh:
- Clarify the Provenance section: EntityMerger appends contributors
  unconditionally on Component collisions - classification happens
  later and only rewrites the routing table
- Document contributors stable ordering (local first, then peer:*)
- Expand Health Monitoring subsection with /health.warnings and peers
  shapes, cross-link to classification, point operators at
  warning_codes.hpp
- Add classify_component_routing to the Key Classes section

Part of PR #374.
bburda added a commit that referenced this pull request Apr 14, 2026
The EntityDetail OpenAPI component is shared by Area, Component, App,
and Function detail endpoints. Previously it had no x-medkit field
declared at all, which meant generated clients silently dropped the
new x-medkit.contributors aggregation provenance array.

Add an x-medkit object schema on EntityDetail with an explicit
contributors property (typed array of strings, with a description
covering the "local" vs "peer:<name>" semantics and the stable sort
order). additionalProperties remains true so that existing x-medkit
fields not yet typed at the schema level (entityType, namespace,
source, etc.) continue to round-trip cleanly.

Part of PR #374.
bburda added 13 commits April 14, 2026 18:27
When a peer component collides with a local one (e.g. hybrid mode
creates a synthetic component that matches a peer's real component),
the merged entity was not added to the routing table. This caused
sub-resource requests (logs, hosts, data) to be handled locally
instead of forwarded to the peer, returning empty results.

Add routing_table_[merged.id] = peer_name_ in the collision branch
of merge_components so that validate_entity_for_route forwards
requests to the owning peer.
Replace the test that asserted merged Components stay out of the
routing table (which encoded the prior incorrect "shared ownership"
model) with one that asserts the opposite: merged Components must
route to the peer that owns the ECU's runtime state. Add a dedicated
test for the hybrid-mode synthetic-collision scenario that motivates
the fix.

Also refine the inline comment in merge_components to spell out the
ECU-ownership rationale instead of describing symptoms.
Update the merge-rules summary, the merge-logic PlantUML diagram, and
the routing-table description to reflect that a Component ID
identifies a single physical ECU. On collision the peer is always the
authoritative owner, so every request for the merged Component -
including the detail endpoint and all sub-resources - is forwarded to
that peer. The primary's local view is limited to aggregating the
Component's presence in discovery listings.
…ification

Add test_aggregation_classification with 6 failing tests that codify the
target semantics before the implementation exists (TDD red phase):

- leaf_collision_keeps_routing_to_peer
- hierarchical_parent_drops_routing_when_local_subcomponents_exist
- hierarchical_parent_drops_routing_when_remote_subcomponents_exist
- hierarchical_parent_drops_routing_across_multiple_peers
- subcomponents_of_hierarchical_parent_still_route_to_peer
- leaf_collision_across_multiple_peers_emits_warning

The tests call a pure classify_component_routing() free function that
takes the fully merged Component vector plus per-peer entity claims and
returns a filtered routing table (hierarchical parents removed) together
with LeafCollisionWarning entries for leaf IDs claimed by >1 peer.
Build will fail on missing classification.hpp - intentional; the
implementation commit introduces the header and source.

Part of PR #374.
Introduce a pure free function that takes the fully-merged Component
vector plus per-peer Component claims and returns a routing table with
hierarchical parents removed plus a list of multi-peer leaf-collision
warnings. Hierarchical parent = any Component referenced as
parent_component_id by another Component in the set. Leaves stay routed
to their peer (last-writer-wins for routing on collision; warning
enumerates every claiming peer).

Fixes the 6 red tests added in the previous commit.

Part of PR #374.
…nager

Track per-peer Component IDs during fetch_and_merge_peer_entities and
run classify_component_routing once the merged Component set is final.
Hierarchical-parent routing entries are stripped (served locally with
the merged view, like Areas/Functions); leaves keep routing to their
peer. Multi-peer leaf collisions surface as RCLCPP_WARN log entries
listing every claiming peer, and the structured warnings are returned
on MergedPeerResult.leaf_warnings for downstream consumers (e.g. the
upcoming /health.warnings field).

Existing unit (1976) and peer-aggregation integration (82) suites pass
unchanged - the no-collision and apps-only paths still match the prior
behaviour because classify_component_routing is a no-op for those.

Part of PR #374.
…warnings

When more than one peer announces the same leaf Component ID,
AggregationManager stores the collision as a structured warning. The
/health endpoint now surfaces these via a top-level warnings array
(x-medkit extension on our own endpoint, not SOVD core). Empty array
when no warnings are active. Operators can pick up deployment anomalies
without tailing logs; routing still falls back to last-writer-wins.

Each warning carries a machine-readable code, human message, the
affected entity_ids (currently a single-element array for forward
compatibility), and the full list of claiming peer_names.

Part of PR #374.
Single-string source cannot express that an entity was contributed by
both the local gateway and one or more peers. Add a contributors field
on Area, Component, App, and Function models, serialised in each
entity's x-medkit block (only when non-empty, so the field is a pure
additive extension for existing clients). Single-origin entities get a
one-element list for consistency.

Population:
- AggregationManager seeds every local entity with ["local"] before the
  per-peer merge loop.
- EntityMerger appends "peer:<name>" on collisions (Areas, Components,
  Functions) and on remote-only additions; deduplication protects
  against repeated merges with the same peer.
- App collision-prefixed entities receive only the peer contributor
  since the prefixed ID is a distinct entity.

Daisy-chain note: each hop surfaces only its direct upstream
("peer:<name>") - a primary aggregating peer_B sees peer_B's entities
as contributed by peer_B regardless of whether peer_B itself merged
them from peer_C. This is intentional and can be extended later.

Part of PR #374.
…enance

Refresh the aggregation design document for the extended merge model:

- Expand the Components merge-rules entry to distinguish hierarchical
  parents (served locally with merged view, like Areas) from leaf ECUs
  (routed to the owning peer on collision), with the classification
  happening post-merge so sub-components arriving from any peer unlock
  parent behaviour.
- Describe multi-peer leaf-collision warnings, last-writer-wins
  routing, and the /health.warnings surface.
- Update the routing-table description to match the new taxonomy.
- Add a new section describing x-medkit.contributors provenance, how
  it is populated, and how it behaves in daisy-chain topologies.
- Add a post-merge Component classification activity diagram alongside
  the per-peer merge rules diagram; the per-peer diagram now shows the
  "provisional routing entry" and contributor-append steps.

Part of PR #374.
…egation

Launch three gateways in separate DDS domains (primary -> peer_B ->
peer_C), each with a manifest that declares a shared hierarchical
parent 'robot-x' plus a single leaf ECU sub-component (ecu-a / ecu-b /
ecu-c). Manifests are written to tmp files at import time so launch
descriptions reference stable paths.

Validates end-to-end:
- Primary sees all three ECU sub-components under robot-x, including
  ecu-c which arrives via transitive aggregation through peer_B.
- GET /components/robot-x returns a locally-served response with
  x-medkit.contributors listing "local" and "peer:peer_b" (each hop
  surfaces only its direct upstream; peer_c's contribution is opaque
  to the primary).
- GET /components/ecu-b succeeds via 1-hop forwarding to peer_B.
- GET /components/ecu-c succeeds via 2-hop forwarding primary ->
  peer_B -> peer_C.
- GET /health on primary returns a peers list plus an empty warnings
  array (no leaf-id collisions in this topology).

Also surfaces Component contributors in the detail handler's x-medkit
block, which was missing because handle_get_component builds its
x-medkit via XMedkit rather than the Component::to_json() path used by
list/reference endpoints.

Part of PR #374.
…nistic output

Review follow-ups for PR #374 addressing correctness and determinism
gaps found in deep-review:

- Previously x-medkit.contributors was emitted only on GET /components/{id}
  because the Component detail handler builds x-medkit manually via the
  XMedkit builder. Area/App/Function detail handlers silently dropped the
  field despite the models carrying it. Introduce XMedkit::contributors()
  helper and call it from all four detail handlers so list and detail
  endpoints expose provenance consistently.

- Classification output is now deterministic: leaf_warnings are sorted by
  entity_id, each warning's peer_names list is sorted alphabetically,
  and per-peer claim iteration is driven by a sorted copy of the
  unordered_set. Operator logs and /health.warnings no longer flap
  between runs or after hashmap rehashes.

- Contributors are presented in stable order ("local" first when present,
  then peer:<name> entries alphabetically) via a sorted_contributors()
  helper in discovery/models/common.hpp. Called from every entity
  to_json() and from XMedkit::contributors(), so list and detail
  responses agree regardless of how peers were merged.

- Introduce warning_codes.hpp with WARN_LEAF_ID_COLLISION constant
  mirroring the error_codes.hpp pattern. Inline string literal in
  health_handlers.cpp replaced with the constant.

- Leaf-collision warning message now names the colliding peers and
  includes a concrete remediation hint (rename on one side, or model
  as hierarchical parent) instead of only describing the symptom.

- Add aggregation capability flag to handle_root so clients can
  feature-detect that /health.warnings and x-medkit.contributors may
  be present. /health.warnings is always an array when aggregation
  is enabled (empty when no anomalies), so clients avoid tri-state
  absent/empty/populated branching.

- OpenAPI health_schema() now declares top-level peers and warnings
  fields (with structured warning object schema for code/message/
  entity_ids/peer_names). Previously these were emitted but absent
  from the spec - breaking client generators.

Tests: 1980 unit + daisy-chain integration still green. Deeper
integration coverage (leaf collision path, capability assertion,
contributors on non-Component entities) follows in a separate commit.

Part of PR #374.
… refresh docs

Integration test additions (daisy chain):
- capability flag assertions on primary (aggregation: true) and on
  peer_c (aggregation: false)
- contributors assertion on a leaf Component detail (ordering starts
  with 'local' per the new stable-order rule)
- contributors assertion on an App detail endpoint (regression guard
  for the handler-level fix that went into the previous commit)
- addClassCleanup registers os.unlink for each tmp manifest so CI
  no longer leaks three YAML files per run

Requirement traceability:
- @verifies REQ_INTEROP_003 on all 6 AggregationClassification tests,
  on the 4 new EntityMerger contributors tests, and on the 4 main
  daisy-chain integration assertions (docstring-based for Python so
  launch_testing introspection preserves them)

Design doc refresh:
- Clarify the Provenance section: EntityMerger appends contributors
  unconditionally on Component collisions - classification happens
  later and only rewrites the routing table
- Document contributors stable ordering (local first, then peer:*)
- Expand Health Monitoring subsection with /health.warnings and peers
  shapes, cross-link to classification, point operators at
  warning_codes.hpp
- Add classify_component_routing to the Key Classes section

Part of PR #374.
The EntityDetail OpenAPI component is shared by Area, Component, App,
and Function detail endpoints. Previously it had no x-medkit field
declared at all, which meant generated clients silently dropped the
new x-medkit.contributors aggregation provenance array.

Add an x-medkit object schema on EntityDetail with an explicit
contributors property (typed array of strings, with a description
covering the "local" vs "peer:<name>" semantics and the stable sort
order). additionalProperties remains true so that existing x-medkit
fields not yet typed at the schema level (entityType, namespace,
source, etc.) continue to round-trip cleanly.

Part of PR #374.
@bburda bburda force-pushed the fix/aggregation-proxy-logs-and-hosts branch from 6fecf30 to 0e0aef3 Compare April 14, 2026 16:35
@bburda bburda marked this pull request as ready for review April 14, 2026 19:04
@bburda bburda requested a review from Copilot April 14, 2026 19:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 7 comments.

Comment thread src/ros2_medkit_gateway/src/openapi/schema_builder.cpp
Comment thread src/ros2_medkit_gateway/src/http/handlers/health_handlers.cpp
Comment thread src/ros2_medkit_gateway/src/http/handlers/health_handlers.cpp
Seven inline review comments from Copilot, all valid after verifying
against the actual diff:

- area.hpp relied on sorted_contributors() (defined in common.hpp) from
  its inline to_json() without including the header. component.hpp and
  function.hpp were already fine; add the include to area.hpp so every
  public model header is self-contained regardless of include order.
- get_test_domain_id() previously collapsed every non-zero offset into
  a single derived secondary domain via (base - 140) % 3, so using
  offsets 2 and 3 in the daisy-chain test produced the SAME DDS domain
  for peer_b and peer_c (the earlier docstring was also wrong about
  this). Return 229 + offset so offsets 1..3 map one-to-one to
  domains 230..232, and reject offsets outside that range with a clear
  ValueError - the DDS maximum is 232 so silently exceeding it must
  fail loudly. Pair this with a CTest RESOURCE_LOCK on
  test_peer_aggregation and test_daisy_chain_aggregation so the two
  multi-gateway tests never share a secondary domain in parallel runs.
- test_leaf_component_detail_contributors_single_peer had a leading
  comment that claimed the assertion was about contributors equalling
  ["peer:peer_b"] while the actual check was for "local" first. Rewrite
  the docstring so it matches the forwarded-response semantics the test
  actually pins (peer_b's own view serialised verbatim by the primary).
- docs/api/warning_codes.rst did not exist, yet two header/source
  comments pointed at it. Add the page with the stable warning-code
  table (currently just leaf_id_collision) plus remediation guidance,
  link it from the API reference index, and point warning_codes.hpp +
  schema_builder.cpp at a now-valid canonical location.
- docs/api/rest.rst did not document the new /health.warnings array or
  the capabilities.aggregation flag. Add both: root-endpoint capability
  gets an example and a one-paragraph description; /health grows an
  aggregation-aware example response showing peers and warnings with a
  realistic leaf_id_collision payload.

Tests: 2015 unit + integration still green locally; Sphinx docs build
with -W warnings-as-errors stays clean.

Part of PR #374.
@bburda bburda requested a review from Copilot April 14, 2026 19:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 30 changed files in this pull request and generated 2 comments.

Comment thread docs/api/rest.rst Outdated
Comment thread src/ros2_medkit_integration_tests/CMakeLists.txt
…ing RESOURCE_LOCK

Two additional valid findings on top of 73d1433:

- docs/api/rest.rst described capabilities.aggregation as "true when
  the gateway has one or more aggregation peers configured", but the
  implementation gates it on AggregationManager presence (i.e. the
  subsystem is enabled regardless of peer count). A gateway with
  aggregation.enabled=true and zero peers still sets the flag to true
  and still emits peers/warnings on /health. Rephrase the docs so the
  semantics match the code and callers understand that peers may be
  empty while the flag is still true.
- test_cross_ecu_fanout also calls get_test_domain_id(1) and therefore
  claims domain 230, same as test_peer_aggregation. The CTest
  RESOURCE_LOCK medkit_secondary_dds_domains that 73d1433 added only
  covered the two multi-gateway tests I remembered; add
  test_cross_ecu_fanout to the same lock so it is serialised against
  the rest and cannot race for the shared domain under parallel CTest
  runs. Comment updated to remind future additions to list every test
  that calls get_test_domain_id with a non-zero offset here.

Part of PR #374.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 30 changed files in this pull request and generated 2 comments.

Comment on lines +142 to +147
set_tests_properties(
test_peer_aggregation
test_cross_ecu_fanout
test_daisy_chain_aggregation
PROPERTIES RESOURCE_LOCK medkit_secondary_dds_domains
)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RESOURCE_LOCK is likely not being applied because the integration test targets are named from get_filename_component(... NAME_WE) on *.test.py files, which yields names like test_peer_aggregation.test, not test_peer_aggregation. As written, set_tests_properties(test_peer_aggregation ...) won’t match any test, so multi-gateway tests can still run concurrently and collide on the shared secondary DDS domains (230–232). Update the list to use the actual CTest target names (e.g. test_peer_aggregation.test, test_cross_ecu_fanout.test, test_daisy_chain_aggregation.test) or change the test_name derivation to strip both .test and .py consistently (and then use those names here).

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +125
// Stable presentation order: "local" first (when present), then "peer:<name>"
// entries alphabetically. Lets UI badges and snapshot tests rely on a
// consistent order regardless of how peers were merged internally.
std::vector<std::string> sorted(contributors.begin(), contributors.end());
std::sort(sorted.begin(), sorted.end(), [](const std::string & a, const std::string & b) {
const bool a_local = (a == "local");
const bool b_local = (b == "local");
if (a_local != b_local) {
return a_local;
}
return a < b;
});
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XMedkit::contributors() re-implements the same ordering logic as sorted_contributors() in discovery/models/common.hpp. Since list responses use sorted_contributors() while detail handlers use XMedkit::contributors(), keeping two independent implementations risks the ordering drifting and producing inconsistent API output over time. Consider reusing a single helper/comparator for both paths (e.g. have XMedkit::contributors() call sorted_contributors() or move the sort routine to a shared utility header used by both).

Copilot uses AI. Check for mistakes.
std::unordered_set<std::string> hierarchical_parents;
for (const auto & comp : merged_components) {
if (!comp.parent_component_id.empty()) {
hierarchical_parents.insert(comp.parent_component_id);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major - no validation on parent_component_id. This insert fires unconditionally for any non-empty parent ID. Three malformed inputs produce silent misbehavior (round-2 traced end-to-end):

  1. Nonexistent parent (child.parent_component_id = "ghost"): "ghost" lands in hierarchical_parents set, gets excluded from routing_table (line 45-47). No Component object exists. Client GET /components/ghost → 404 via validate_entity_for_route (handler_context.cpp:254-267). The child itself is correctly routed, but the parent reference is a dangling pointer in the response.

  2. Cycle A↔B (A.parent = B, B.parent = A): both IDs enter hierarchical_parents, neither enters routing_table. Both served from local merged cache (both exist there), never forwarded to their owning peer. Semantically wrong: if peer_b owns B, client reading /components/B hits the local stub instead of the real peer data.

  3. Self-parent (A.parent = A): A inserted as its own hierarchical parent, excluded from routing, served locally. Breaks the invariant that hierarchical parents aggregate their children.

Pre-classification sanitization also absent at peer_client.cpp:180 (JSON deserialization) and entity_merger.cpp. Zero guard anywhere.

Suggested fix:

std::unordered_set<std::string> known_ids;
for (const auto & comp : merged_components) known_ids.insert(comp.id);

for (const auto & comp : merged_components) {
  const auto & p = comp.parent_component_id;
  if (p.empty()) continue;
  if (p == comp.id) {
    RCLCPP_WARN(..., "Component '%s' declares itself as parent, ignoring", p.c_str());
    continue;
  }
  if (known_ids.count(p) == 0u) {
    RCLCPP_WARN(..., "Component '%s' parent_component_id='%s' not in merged set", comp.id.c_str(), p.c_str());
    continue;
  }
  hierarchical_parents.insert(p);
}
// BFS/DFS cycle detection pass here.

Plus unit tests: parent_references_nonexistent_component, two_way_parent_cycle, self_parent.

// Per-peer Component IDs each peer claimed, fed into classify_component_routing
// after the loop so that hierarchical-vs-leaf classification sees the full
// merged Component set (including subcomponents from other peers).
std::vector<PeerClaim> peer_component_claims;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major - routing destination flips silently across merge snapshots under peer health churn.

peer_component_claims is built by iterating futures (line 422 context below) in insertion order of the healthy-peer snapshot at line 371-374. is_healthy() filter silently drops failing peers. The "last writer" in classification.cpp:46 (result.routing_table[id] = pc.peer_name;) is whichever peer sits last in this vector.

Concrete scenario (round-2 traced):

  • Merge N: peers [peer_b, peer_c] both claim ecu-shared. Last writer = peer_c. /health.warnings surfaces both claimants. Routing destination = peer_c.
  • Merge N+1: peer_c fails health check, dropped from snapshot. peer_component_claims = [peer_b]. Last writer = peer_b. Only 1 claimant = no collision = no warning. Routing destination silently flipped from peer_c to peer_b.
  • Operator sees /health go from "collision warning present" to "warnings: []" with no signal that routing destination changed.

This is worse than pure nondeterminism: the warning mechanism itself can hide the flip.

Three options:

  1. Sticky routing: remember last-chosen peer across merges, only flip if that peer becomes unhealthy.
  2. Deterministic ordering: std::sort(peer_component_claims.begin(), ..., by peer_name) before classification so the "last writer" is reproducible.
  3. Emit routing-change event: new warning code WARN_ROUTING_DESTINATION_CHANGED whenever a routing entry for a given entity_id differs from the previous merge's entry.

At minimum, update design/aggregation.rst:291 to call out cross-snapshot instability and the warning-disappearance edge case so operators can plan around it.

{"code", WARN_LEAF_ID_COLLISION},
{"message", std::move(message)},
{"entity_ids", json::array({w.entity_id})},
{"peer_names", w.peer_names},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - peer names leak on unauthenticated endpoint. Round-2 verified: auth.enabled defaults to false at gateway_node.cpp:169. Anyone reachable to /health enumerates every peer name in the deployment. Even with auth enabled, /health is in VIEWER/OPERATOR/CONFIGURATOR roles so any authenticated user sees the names.

Fine in controlled LAN deployments, leaks topology in shared-infra / multi-tenant setups. Since design intent is "operator-visible", that's OK - but please document explicitly in docs/api/rest.rst so operators know peer names on this endpoint are public-by-default.

If you want to gate it: wrap the warnings array in the same auth config that gates admin-adjacent info, while leaving the top-level status/uptime unauthenticated for classical health probes.

/// Resolve by renaming the Component on one side or by modelling it as a
/// hierarchical parent (declare a child Component with
/// ``parentComponentId`` pointing at the colliding ID on the owning peer).
inline constexpr const char * WARN_LEAF_ID_COLLISION = "leaf_id_collision";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - no schema version for the warnings array. Codes are stable strings (good), but when the second code ships, clients have no contract to feature-detect "supports new_code_X" without string-matching. capabilities.aggregation is a boolean — too coarse.

Suggested: add response["warning_schema_version"] = 1; in health_handlers.cpp, bump on every new code added, document in warning_codes.rst. Typed clients (MCP, Web UI, Foxglove) can then make the "known codes" set explicit and log+degrade on unknown codes per version.


// Routing exists - last-writer-wins (peer_c processed last here).
ASSERT_EQ(result.routing_table.count("ecu-shared"), 1u);
EXPECT_EQ(result.routing_table.at("ecu-shared"), "peer_c");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - collision test is unidirectional. Input is hardcoded [peer_b, peer_c] at line 184-185, asserted routing destination is peer_c (last). Flipping input to [peer_c, peer_b] should route to peer_b. That flip case is not tested, so a regression that breaks last-writer semantics (e.g., accidentally sorting claims by peer_name before classification) wouldn't be caught.

Add a parametrized or second test:

TEST(AggregationClassification, leaf_collision_routing_respects_input_order) {
  // [peer_c, peer_b] → routing must go to peer_b
  ...
}

Also missing from this file (round-2 confirmed all 6 test names): two_way_parent_cycle, self_parent_ignored, parent_references_nonexistent_component, empty_merged_set, four_level_nesting. See companion Major comment on classification.cpp:29.

# Sub-components (with parent_component_id) appear under the parent's
# /subcomponents endpoint, not the top-level /components list.
cls._wait_for_subcomponents(
PRIMARY_URL, 'robot-x', {'ecu-a', 'ecu-b', 'ecu-c'}, 'primary',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - no end-to-end negative collision scenario in daisy-chain tests. All 10 test methods are positive paths. robot-x collision is intentional hierarchical merge (not an error case), and line 275 asserts body['warnings'] == [] for healthy chain.

Recommend a dedicated test: configure peer_b AND peer_c to both claim a new leaf ecu-shared (no parent_component_id), verify from primary:

  1. /components/ecu-shared returns 200 and routes somewhere (either peer_b or peer_c).
  2. /health.warnings contains a leaf_id_collision entry listing both peer_b and peer_c under peer_names.
  3. x-medkit.contributors for the routed response reflects the resolving peer (not both).

This exercises the actual wire-level warning path under the real daisy-chain topology that unit tests can't simulate. Also covers the nondeterminism concern on aggregation_manager.cpp:419 via a follow-up test that kills one of the colliding peers and re-asserts warnings disappearance + routing flip (per the Major on that line).

}

// @verifies REQ_INTEROP_003
TEST(EntityMerger, contributors_no_duplicate_on_repeat_merge) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - contributor presentation order not asserted on the user-visible path. This test verifies dedup in the internal contributors_ vector, which is correct. But clients actually see the output of XMedkit::contributors() in x_medkit.cpp:110-128, which sorts ["local", peers alphabetical].

Suggest one additional test that exercises the sorted output: inputs arriving in reverse order (e.g., ["peer:zulu", "peer:alpha", "local"]) should serialize as ["local", "peer:alpha", "peer:zulu"]. Otherwise a regression flipping the sort direction wouldn't be caught by unit tests — only by an integration test that happens to inspect the exact contributor order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Collision-merged peer components return empty runtime data on primary

3 participants