fix(aggregation): hierarchical/leaf Component routing with provenance and warnings#374
fix(aggregation): hierarchical/leaf Component routing with provenance and warnings#374
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes aggregation routing so that when a peer Component collides (same ID) with a locally-present Component and gets merged, sub-resource requests are forwarded to the peer that owns the runtime state—preventing empty local responses on endpoints like /logs, /hosts, /data, and /operations.
Changes:
- Add routing table entry for collision-merged Components in
EntityMerger::merge_components. - Update/extend
EntityMergerunit tests to assert merged Components route to the peer (including a hybrid synthetic-collision scenario). - Refresh aggregation design documentation and remove an unrelated third-party product reference from a plugin doc comment.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/ros2_medkit_gateway/src/aggregation/entity_merger.cpp |
Route collision-merged Component IDs to the peer in the routing table. |
src/ros2_medkit_gateway/test/test_entity_merger.cpp |
Update routing-table expectations and add a hybrid synthetic-collision test. |
src/ros2_medkit_gateway/design/aggregation.rst |
Align merge/routing documentation with ECU-ownership routing behavior for Components. |
src/ros2_medkit_plugins/ros2_medkit_sovd_service_interface/src/sovd_service_interface.hpp |
Remove unrelated third-party reference from the class doc comment. |
…ification Add test_aggregation_classification with 6 failing tests that codify the target semantics before the implementation exists (TDD red phase): - leaf_collision_keeps_routing_to_peer - hierarchical_parent_drops_routing_when_local_subcomponents_exist - hierarchical_parent_drops_routing_when_remote_subcomponents_exist - hierarchical_parent_drops_routing_across_multiple_peers - subcomponents_of_hierarchical_parent_still_route_to_peer - leaf_collision_across_multiple_peers_emits_warning The tests call a pure classify_component_routing() free function that takes the fully merged Component vector plus per-peer entity claims and returns a filtered routing table (hierarchical parents removed) together with LeafCollisionWarning entries for leaf IDs claimed by >1 peer. Build will fail on missing classification.hpp - intentional; the implementation commit introduces the header and source. Part of PR #374.
Introduce a pure free function that takes the fully-merged Component vector plus per-peer Component claims and returns a routing table with hierarchical parents removed plus a list of multi-peer leaf-collision warnings. Hierarchical parent = any Component referenced as parent_component_id by another Component in the set. Leaves stay routed to their peer (last-writer-wins for routing on collision; warning enumerates every claiming peer). Fixes the 6 red tests added in the previous commit. Part of PR #374.
…nager Track per-peer Component IDs during fetch_and_merge_peer_entities and run classify_component_routing once the merged Component set is final. Hierarchical-parent routing entries are stripped (served locally with the merged view, like Areas/Functions); leaves keep routing to their peer. Multi-peer leaf collisions surface as RCLCPP_WARN log entries listing every claiming peer, and the structured warnings are returned on MergedPeerResult.leaf_warnings for downstream consumers (e.g. the upcoming /health.warnings field). Existing unit (1976) and peer-aggregation integration (82) suites pass unchanged - the no-collision and apps-only paths still match the prior behaviour because classify_component_routing is a no-op for those. Part of PR #374.
…warnings When more than one peer announces the same leaf Component ID, AggregationManager stores the collision as a structured warning. The /health endpoint now surfaces these via a top-level warnings array (x-medkit extension on our own endpoint, not SOVD core). Empty array when no warnings are active. Operators can pick up deployment anomalies without tailing logs; routing still falls back to last-writer-wins. Each warning carries a machine-readable code, human message, the affected entity_ids (currently a single-element array for forward compatibility), and the full list of claiming peer_names. Part of PR #374.
Single-string source cannot express that an entity was contributed by
both the local gateway and one or more peers. Add a contributors field
on Area, Component, App, and Function models, serialised in each
entity's x-medkit block (only when non-empty, so the field is a pure
additive extension for existing clients). Single-origin entities get a
one-element list for consistency.
Population:
- AggregationManager seeds every local entity with ["local"] before the
per-peer merge loop.
- EntityMerger appends "peer:<name>" on collisions (Areas, Components,
Functions) and on remote-only additions; deduplication protects
against repeated merges with the same peer.
- App collision-prefixed entities receive only the peer contributor
since the prefixed ID is a distinct entity.
Daisy-chain note: each hop surfaces only its direct upstream
("peer:<name>") - a primary aggregating peer_B sees peer_B's entities
as contributed by peer_B regardless of whether peer_B itself merged
them from peer_C. This is intentional and can be extended later.
Part of PR #374.
…enance Refresh the aggregation design document for the extended merge model: - Expand the Components merge-rules entry to distinguish hierarchical parents (served locally with merged view, like Areas) from leaf ECUs (routed to the owning peer on collision), with the classification happening post-merge so sub-components arriving from any peer unlock parent behaviour. - Describe multi-peer leaf-collision warnings, last-writer-wins routing, and the /health.warnings surface. - Update the routing-table description to match the new taxonomy. - Add a new section describing x-medkit.contributors provenance, how it is populated, and how it behaves in daisy-chain topologies. - Add a post-merge Component classification activity diagram alongside the per-peer merge rules diagram; the per-peer diagram now shows the "provisional routing entry" and contributor-append steps. Part of PR #374.
…egation Launch three gateways in separate DDS domains (primary -> peer_B -> peer_C), each with a manifest that declares a shared hierarchical parent 'robot-x' plus a single leaf ECU sub-component (ecu-a / ecu-b / ecu-c). Manifests are written to tmp files at import time so launch descriptions reference stable paths. Validates end-to-end: - Primary sees all three ECU sub-components under robot-x, including ecu-c which arrives via transitive aggregation through peer_B. - GET /components/robot-x returns a locally-served response with x-medkit.contributors listing "local" and "peer:peer_b" (each hop surfaces only its direct upstream; peer_c's contribution is opaque to the primary). - GET /components/ecu-b succeeds via 1-hop forwarding to peer_B. - GET /components/ecu-c succeeds via 2-hop forwarding primary -> peer_B -> peer_C. - GET /health on primary returns a peers list plus an empty warnings array (no leaf-id collisions in this topology). Also surfaces Component contributors in the detail handler's x-medkit block, which was missing because handle_get_component builds its x-medkit via XMedkit rather than the Component::to_json() path used by list/reference endpoints. Part of PR #374.
…nistic output Review follow-ups for PR #374 addressing correctness and determinism gaps found in deep-review: - Previously x-medkit.contributors was emitted only on GET /components/{id} because the Component detail handler builds x-medkit manually via the XMedkit builder. Area/App/Function detail handlers silently dropped the field despite the models carrying it. Introduce XMedkit::contributors() helper and call it from all four detail handlers so list and detail endpoints expose provenance consistently. - Classification output is now deterministic: leaf_warnings are sorted by entity_id, each warning's peer_names list is sorted alphabetically, and per-peer claim iteration is driven by a sorted copy of the unordered_set. Operator logs and /health.warnings no longer flap between runs or after hashmap rehashes. - Contributors are presented in stable order ("local" first when present, then peer:<name> entries alphabetically) via a sorted_contributors() helper in discovery/models/common.hpp. Called from every entity to_json() and from XMedkit::contributors(), so list and detail responses agree regardless of how peers were merged. - Introduce warning_codes.hpp with WARN_LEAF_ID_COLLISION constant mirroring the error_codes.hpp pattern. Inline string literal in health_handlers.cpp replaced with the constant. - Leaf-collision warning message now names the colliding peers and includes a concrete remediation hint (rename on one side, or model as hierarchical parent) instead of only describing the symptom. - Add aggregation capability flag to handle_root so clients can feature-detect that /health.warnings and x-medkit.contributors may be present. /health.warnings is always an array when aggregation is enabled (empty when no anomalies), so clients avoid tri-state absent/empty/populated branching. - OpenAPI health_schema() now declares top-level peers and warnings fields (with structured warning object schema for code/message/ entity_ids/peer_names). Previously these were emitted but absent from the spec - breaking client generators. Tests: 1980 unit + daisy-chain integration still green. Deeper integration coverage (leaf collision path, capability assertion, contributors on non-Component entities) follows in a separate commit. Part of PR #374.
… refresh docs Integration test additions (daisy chain): - capability flag assertions on primary (aggregation: true) and on peer_c (aggregation: false) - contributors assertion on a leaf Component detail (ordering starts with 'local' per the new stable-order rule) - contributors assertion on an App detail endpoint (regression guard for the handler-level fix that went into the previous commit) - addClassCleanup registers os.unlink for each tmp manifest so CI no longer leaks three YAML files per run Requirement traceability: - @verifies REQ_INTEROP_003 on all 6 AggregationClassification tests, on the 4 new EntityMerger contributors tests, and on the 4 main daisy-chain integration assertions (docstring-based for Python so launch_testing introspection preserves them) Design doc refresh: - Clarify the Provenance section: EntityMerger appends contributors unconditionally on Component collisions - classification happens later and only rewrites the routing table - Document contributors stable ordering (local first, then peer:*) - Expand Health Monitoring subsection with /health.warnings and peers shapes, cross-link to classification, point operators at warning_codes.hpp - Add classify_component_routing to the Key Classes section Part of PR #374.
The EntityDetail OpenAPI component is shared by Area, Component, App, and Function detail endpoints. Previously it had no x-medkit field declared at all, which meant generated clients silently dropped the new x-medkit.contributors aggregation provenance array. Add an x-medkit object schema on EntityDetail with an explicit contributors property (typed array of strings, with a description covering the "local" vs "peer:<name>" semantics and the stable sort order). additionalProperties remains true so that existing x-medkit fields not yet typed at the schema level (entityType, namespace, source, etc.) continue to round-trip cleanly. Part of PR #374.
When a peer component collides with a local one (e.g. hybrid mode creates a synthetic component that matches a peer's real component), the merged entity was not added to the routing table. This caused sub-resource requests (logs, hosts, data) to be handled locally instead of forwarded to the peer, returning empty results. Add routing_table_[merged.id] = peer_name_ in the collision branch of merge_components so that validate_entity_for_route forwards requests to the owning peer.
Replace the test that asserted merged Components stay out of the routing table (which encoded the prior incorrect "shared ownership" model) with one that asserts the opposite: merged Components must route to the peer that owns the ECU's runtime state. Add a dedicated test for the hybrid-mode synthetic-collision scenario that motivates the fix. Also refine the inline comment in merge_components to spell out the ECU-ownership rationale instead of describing symptoms.
Update the merge-rules summary, the merge-logic PlantUML diagram, and the routing-table description to reflect that a Component ID identifies a single physical ECU. On collision the peer is always the authoritative owner, so every request for the merged Component - including the detail endpoint and all sub-resources - is forwarded to that peer. The primary's local view is limited to aggregating the Component's presence in discovery listings.
…e_interface doc comment
…ification Add test_aggregation_classification with 6 failing tests that codify the target semantics before the implementation exists (TDD red phase): - leaf_collision_keeps_routing_to_peer - hierarchical_parent_drops_routing_when_local_subcomponents_exist - hierarchical_parent_drops_routing_when_remote_subcomponents_exist - hierarchical_parent_drops_routing_across_multiple_peers - subcomponents_of_hierarchical_parent_still_route_to_peer - leaf_collision_across_multiple_peers_emits_warning The tests call a pure classify_component_routing() free function that takes the fully merged Component vector plus per-peer entity claims and returns a filtered routing table (hierarchical parents removed) together with LeafCollisionWarning entries for leaf IDs claimed by >1 peer. Build will fail on missing classification.hpp - intentional; the implementation commit introduces the header and source. Part of PR #374.
Introduce a pure free function that takes the fully-merged Component vector plus per-peer Component claims and returns a routing table with hierarchical parents removed plus a list of multi-peer leaf-collision warnings. Hierarchical parent = any Component referenced as parent_component_id by another Component in the set. Leaves stay routed to their peer (last-writer-wins for routing on collision; warning enumerates every claiming peer). Fixes the 6 red tests added in the previous commit. Part of PR #374.
…nager Track per-peer Component IDs during fetch_and_merge_peer_entities and run classify_component_routing once the merged Component set is final. Hierarchical-parent routing entries are stripped (served locally with the merged view, like Areas/Functions); leaves keep routing to their peer. Multi-peer leaf collisions surface as RCLCPP_WARN log entries listing every claiming peer, and the structured warnings are returned on MergedPeerResult.leaf_warnings for downstream consumers (e.g. the upcoming /health.warnings field). Existing unit (1976) and peer-aggregation integration (82) suites pass unchanged - the no-collision and apps-only paths still match the prior behaviour because classify_component_routing is a no-op for those. Part of PR #374.
…warnings When more than one peer announces the same leaf Component ID, AggregationManager stores the collision as a structured warning. The /health endpoint now surfaces these via a top-level warnings array (x-medkit extension on our own endpoint, not SOVD core). Empty array when no warnings are active. Operators can pick up deployment anomalies without tailing logs; routing still falls back to last-writer-wins. Each warning carries a machine-readable code, human message, the affected entity_ids (currently a single-element array for forward compatibility), and the full list of claiming peer_names. Part of PR #374.
Single-string source cannot express that an entity was contributed by
both the local gateway and one or more peers. Add a contributors field
on Area, Component, App, and Function models, serialised in each
entity's x-medkit block (only when non-empty, so the field is a pure
additive extension for existing clients). Single-origin entities get a
one-element list for consistency.
Population:
- AggregationManager seeds every local entity with ["local"] before the
per-peer merge loop.
- EntityMerger appends "peer:<name>" on collisions (Areas, Components,
Functions) and on remote-only additions; deduplication protects
against repeated merges with the same peer.
- App collision-prefixed entities receive only the peer contributor
since the prefixed ID is a distinct entity.
Daisy-chain note: each hop surfaces only its direct upstream
("peer:<name>") - a primary aggregating peer_B sees peer_B's entities
as contributed by peer_B regardless of whether peer_B itself merged
them from peer_C. This is intentional and can be extended later.
Part of PR #374.
…enance Refresh the aggregation design document for the extended merge model: - Expand the Components merge-rules entry to distinguish hierarchical parents (served locally with merged view, like Areas) from leaf ECUs (routed to the owning peer on collision), with the classification happening post-merge so sub-components arriving from any peer unlock parent behaviour. - Describe multi-peer leaf-collision warnings, last-writer-wins routing, and the /health.warnings surface. - Update the routing-table description to match the new taxonomy. - Add a new section describing x-medkit.contributors provenance, how it is populated, and how it behaves in daisy-chain topologies. - Add a post-merge Component classification activity diagram alongside the per-peer merge rules diagram; the per-peer diagram now shows the "provisional routing entry" and contributor-append steps. Part of PR #374.
…egation Launch three gateways in separate DDS domains (primary -> peer_B -> peer_C), each with a manifest that declares a shared hierarchical parent 'robot-x' plus a single leaf ECU sub-component (ecu-a / ecu-b / ecu-c). Manifests are written to tmp files at import time so launch descriptions reference stable paths. Validates end-to-end: - Primary sees all three ECU sub-components under robot-x, including ecu-c which arrives via transitive aggregation through peer_B. - GET /components/robot-x returns a locally-served response with x-medkit.contributors listing "local" and "peer:peer_b" (each hop surfaces only its direct upstream; peer_c's contribution is opaque to the primary). - GET /components/ecu-b succeeds via 1-hop forwarding to peer_B. - GET /components/ecu-c succeeds via 2-hop forwarding primary -> peer_B -> peer_C. - GET /health on primary returns a peers list plus an empty warnings array (no leaf-id collisions in this topology). Also surfaces Component contributors in the detail handler's x-medkit block, which was missing because handle_get_component builds its x-medkit via XMedkit rather than the Component::to_json() path used by list/reference endpoints. Part of PR #374.
…nistic output Review follow-ups for PR #374 addressing correctness and determinism gaps found in deep-review: - Previously x-medkit.contributors was emitted only on GET /components/{id} because the Component detail handler builds x-medkit manually via the XMedkit builder. Area/App/Function detail handlers silently dropped the field despite the models carrying it. Introduce XMedkit::contributors() helper and call it from all four detail handlers so list and detail endpoints expose provenance consistently. - Classification output is now deterministic: leaf_warnings are sorted by entity_id, each warning's peer_names list is sorted alphabetically, and per-peer claim iteration is driven by a sorted copy of the unordered_set. Operator logs and /health.warnings no longer flap between runs or after hashmap rehashes. - Contributors are presented in stable order ("local" first when present, then peer:<name> entries alphabetically) via a sorted_contributors() helper in discovery/models/common.hpp. Called from every entity to_json() and from XMedkit::contributors(), so list and detail responses agree regardless of how peers were merged. - Introduce warning_codes.hpp with WARN_LEAF_ID_COLLISION constant mirroring the error_codes.hpp pattern. Inline string literal in health_handlers.cpp replaced with the constant. - Leaf-collision warning message now names the colliding peers and includes a concrete remediation hint (rename on one side, or model as hierarchical parent) instead of only describing the symptom. - Add aggregation capability flag to handle_root so clients can feature-detect that /health.warnings and x-medkit.contributors may be present. /health.warnings is always an array when aggregation is enabled (empty when no anomalies), so clients avoid tri-state absent/empty/populated branching. - OpenAPI health_schema() now declares top-level peers and warnings fields (with structured warning object schema for code/message/ entity_ids/peer_names). Previously these were emitted but absent from the spec - breaking client generators. Tests: 1980 unit + daisy-chain integration still green. Deeper integration coverage (leaf collision path, capability assertion, contributors on non-Component entities) follows in a separate commit. Part of PR #374.
… refresh docs Integration test additions (daisy chain): - capability flag assertions on primary (aggregation: true) and on peer_c (aggregation: false) - contributors assertion on a leaf Component detail (ordering starts with 'local' per the new stable-order rule) - contributors assertion on an App detail endpoint (regression guard for the handler-level fix that went into the previous commit) - addClassCleanup registers os.unlink for each tmp manifest so CI no longer leaks three YAML files per run Requirement traceability: - @verifies REQ_INTEROP_003 on all 6 AggregationClassification tests, on the 4 new EntityMerger contributors tests, and on the 4 main daisy-chain integration assertions (docstring-based for Python so launch_testing introspection preserves them) Design doc refresh: - Clarify the Provenance section: EntityMerger appends contributors unconditionally on Component collisions - classification happens later and only rewrites the routing table - Document contributors stable ordering (local first, then peer:*) - Expand Health Monitoring subsection with /health.warnings and peers shapes, cross-link to classification, point operators at warning_codes.hpp - Add classify_component_routing to the Key Classes section Part of PR #374.
The EntityDetail OpenAPI component is shared by Area, Component, App, and Function detail endpoints. Previously it had no x-medkit field declared at all, which meant generated clients silently dropped the new x-medkit.contributors aggregation provenance array. Add an x-medkit object schema on EntityDetail with an explicit contributors property (typed array of strings, with a description covering the "local" vs "peer:<name>" semantics and the stable sort order). additionalProperties remains true so that existing x-medkit fields not yet typed at the schema level (entityType, namespace, source, etc.) continue to round-trip cleanly. Part of PR #374.
6fecf30 to
0e0aef3
Compare
Seven inline review comments from Copilot, all valid after verifying against the actual diff: - area.hpp relied on sorted_contributors() (defined in common.hpp) from its inline to_json() without including the header. component.hpp and function.hpp were already fine; add the include to area.hpp so every public model header is self-contained regardless of include order. - get_test_domain_id() previously collapsed every non-zero offset into a single derived secondary domain via (base - 140) % 3, so using offsets 2 and 3 in the daisy-chain test produced the SAME DDS domain for peer_b and peer_c (the earlier docstring was also wrong about this). Return 229 + offset so offsets 1..3 map one-to-one to domains 230..232, and reject offsets outside that range with a clear ValueError - the DDS maximum is 232 so silently exceeding it must fail loudly. Pair this with a CTest RESOURCE_LOCK on test_peer_aggregation and test_daisy_chain_aggregation so the two multi-gateway tests never share a secondary domain in parallel runs. - test_leaf_component_detail_contributors_single_peer had a leading comment that claimed the assertion was about contributors equalling ["peer:peer_b"] while the actual check was for "local" first. Rewrite the docstring so it matches the forwarded-response semantics the test actually pins (peer_b's own view serialised verbatim by the primary). - docs/api/warning_codes.rst did not exist, yet two header/source comments pointed at it. Add the page with the stable warning-code table (currently just leaf_id_collision) plus remediation guidance, link it from the API reference index, and point warning_codes.hpp + schema_builder.cpp at a now-valid canonical location. - docs/api/rest.rst did not document the new /health.warnings array or the capabilities.aggregation flag. Add both: root-endpoint capability gets an example and a one-paragraph description; /health grows an aggregation-aware example response showing peers and warnings with a realistic leaf_id_collision payload. Tests: 2015 unit + integration still green locally; Sphinx docs build with -W warnings-as-errors stays clean. Part of PR #374.
…ing RESOURCE_LOCK Two additional valid findings on top of 73d1433: - docs/api/rest.rst described capabilities.aggregation as "true when the gateway has one or more aggregation peers configured", but the implementation gates it on AggregationManager presence (i.e. the subsystem is enabled regardless of peer count). A gateway with aggregation.enabled=true and zero peers still sets the flag to true and still emits peers/warnings on /health. Rephrase the docs so the semantics match the code and callers understand that peers may be empty while the flag is still true. - test_cross_ecu_fanout also calls get_test_domain_id(1) and therefore claims domain 230, same as test_peer_aggregation. The CTest RESOURCE_LOCK medkit_secondary_dds_domains that 73d1433 added only covered the two multi-gateway tests I remembered; add test_cross_ecu_fanout to the same lock so it is serialised against the rest and cannot race for the shared domain under parallel CTest runs. Comment updated to remind future additions to list every test that calls get_test_domain_id with a non-zero offset here. Part of PR #374.
| set_tests_properties( | ||
| test_peer_aggregation | ||
| test_cross_ecu_fanout | ||
| test_daisy_chain_aggregation | ||
| PROPERTIES RESOURCE_LOCK medkit_secondary_dds_domains | ||
| ) |
There was a problem hiding this comment.
The RESOURCE_LOCK is likely not being applied because the integration test targets are named from get_filename_component(... NAME_WE) on *.test.py files, which yields names like test_peer_aggregation.test, not test_peer_aggregation. As written, set_tests_properties(test_peer_aggregation ...) won’t match any test, so multi-gateway tests can still run concurrently and collide on the shared secondary DDS domains (230–232). Update the list to use the actual CTest target names (e.g. test_peer_aggregation.test, test_cross_ecu_fanout.test, test_daisy_chain_aggregation.test) or change the test_name derivation to strip both .test and .py consistently (and then use those names here).
| // Stable presentation order: "local" first (when present), then "peer:<name>" | ||
| // entries alphabetically. Lets UI badges and snapshot tests rely on a | ||
| // consistent order regardless of how peers were merged internally. | ||
| std::vector<std::string> sorted(contributors.begin(), contributors.end()); | ||
| std::sort(sorted.begin(), sorted.end(), [](const std::string & a, const std::string & b) { | ||
| const bool a_local = (a == "local"); | ||
| const bool b_local = (b == "local"); | ||
| if (a_local != b_local) { | ||
| return a_local; | ||
| } | ||
| return a < b; | ||
| }); |
There was a problem hiding this comment.
XMedkit::contributors() re-implements the same ordering logic as sorted_contributors() in discovery/models/common.hpp. Since list responses use sorted_contributors() while detail handlers use XMedkit::contributors(), keeping two independent implementations risks the ordering drifting and producing inconsistent API output over time. Consider reusing a single helper/comparator for both paths (e.g. have XMedkit::contributors() call sorted_contributors() or move the sort routine to a shared utility header used by both).
| std::unordered_set<std::string> hierarchical_parents; | ||
| for (const auto & comp : merged_components) { | ||
| if (!comp.parent_component_id.empty()) { | ||
| hierarchical_parents.insert(comp.parent_component_id); |
There was a problem hiding this comment.
Major - no validation on parent_component_id. This insert fires unconditionally for any non-empty parent ID. Three malformed inputs produce silent misbehavior (round-2 traced end-to-end):
-
Nonexistent parent (
child.parent_component_id = "ghost"): "ghost" lands inhierarchical_parentsset, gets excluded fromrouting_table(line 45-47). No Component object exists. ClientGET /components/ghost→ 404 viavalidate_entity_for_route(handler_context.cpp:254-267). The child itself is correctly routed, but the parent reference is a dangling pointer in the response. -
Cycle A↔B (A.parent = B, B.parent = A): both IDs enter
hierarchical_parents, neither entersrouting_table. Both served from local merged cache (both exist there), never forwarded to their owning peer. Semantically wrong: if peer_b owns B, client reading/components/Bhits the local stub instead of the real peer data. -
Self-parent (A.parent = A): A inserted as its own hierarchical parent, excluded from routing, served locally. Breaks the invariant that hierarchical parents aggregate their children.
Pre-classification sanitization also absent at peer_client.cpp:180 (JSON deserialization) and entity_merger.cpp. Zero guard anywhere.
Suggested fix:
std::unordered_set<std::string> known_ids;
for (const auto & comp : merged_components) known_ids.insert(comp.id);
for (const auto & comp : merged_components) {
const auto & p = comp.parent_component_id;
if (p.empty()) continue;
if (p == comp.id) {
RCLCPP_WARN(..., "Component '%s' declares itself as parent, ignoring", p.c_str());
continue;
}
if (known_ids.count(p) == 0u) {
RCLCPP_WARN(..., "Component '%s' parent_component_id='%s' not in merged set", comp.id.c_str(), p.c_str());
continue;
}
hierarchical_parents.insert(p);
}
// BFS/DFS cycle detection pass here.Plus unit tests: parent_references_nonexistent_component, two_way_parent_cycle, self_parent.
| // Per-peer Component IDs each peer claimed, fed into classify_component_routing | ||
| // after the loop so that hierarchical-vs-leaf classification sees the full | ||
| // merged Component set (including subcomponents from other peers). | ||
| std::vector<PeerClaim> peer_component_claims; |
There was a problem hiding this comment.
Major - routing destination flips silently across merge snapshots under peer health churn.
peer_component_claims is built by iterating futures (line 422 context below) in insertion order of the healthy-peer snapshot at line 371-374. is_healthy() filter silently drops failing peers. The "last writer" in classification.cpp:46 (result.routing_table[id] = pc.peer_name;) is whichever peer sits last in this vector.
Concrete scenario (round-2 traced):
- Merge N: peers
[peer_b, peer_c]both claimecu-shared. Last writer = peer_c./health.warningssurfaces both claimants. Routing destination = peer_c. - Merge N+1: peer_c fails health check, dropped from snapshot.
peer_component_claims = [peer_b]. Last writer = peer_b. Only 1 claimant = no collision = no warning. Routing destination silently flipped from peer_c to peer_b. - Operator sees /health go from "collision warning present" to "warnings: []" with no signal that routing destination changed.
This is worse than pure nondeterminism: the warning mechanism itself can hide the flip.
Three options:
- Sticky routing: remember last-chosen peer across merges, only flip if that peer becomes unhealthy.
- Deterministic ordering:
std::sort(peer_component_claims.begin(), ..., by peer_name)before classification so the "last writer" is reproducible. - Emit routing-change event: new warning code
WARN_ROUTING_DESTINATION_CHANGEDwhenever a routing entry for a givenentity_iddiffers from the previous merge's entry.
At minimum, update design/aggregation.rst:291 to call out cross-snapshot instability and the warning-disappearance edge case so operators can plan around it.
| {"code", WARN_LEAF_ID_COLLISION}, | ||
| {"message", std::move(message)}, | ||
| {"entity_ids", json::array({w.entity_id})}, | ||
| {"peer_names", w.peer_names}, |
There was a problem hiding this comment.
Minor - peer names leak on unauthenticated endpoint. Round-2 verified: auth.enabled defaults to false at gateway_node.cpp:169. Anyone reachable to /health enumerates every peer name in the deployment. Even with auth enabled, /health is in VIEWER/OPERATOR/CONFIGURATOR roles so any authenticated user sees the names.
Fine in controlled LAN deployments, leaks topology in shared-infra / multi-tenant setups. Since design intent is "operator-visible", that's OK - but please document explicitly in docs/api/rest.rst so operators know peer names on this endpoint are public-by-default.
If you want to gate it: wrap the warnings array in the same auth config that gates admin-adjacent info, while leaving the top-level status/uptime unauthenticated for classical health probes.
| /// Resolve by renaming the Component on one side or by modelling it as a | ||
| /// hierarchical parent (declare a child Component with | ||
| /// ``parentComponentId`` pointing at the colliding ID on the owning peer). | ||
| inline constexpr const char * WARN_LEAF_ID_COLLISION = "leaf_id_collision"; |
There was a problem hiding this comment.
Minor - no schema version for the warnings array. Codes are stable strings (good), but when the second code ships, clients have no contract to feature-detect "supports new_code_X" without string-matching. capabilities.aggregation is a boolean — too coarse.
Suggested: add response["warning_schema_version"] = 1; in health_handlers.cpp, bump on every new code added, document in warning_codes.rst. Typed clients (MCP, Web UI, Foxglove) can then make the "known codes" set explicit and log+degrade on unknown codes per version.
|
|
||
| // Routing exists - last-writer-wins (peer_c processed last here). | ||
| ASSERT_EQ(result.routing_table.count("ecu-shared"), 1u); | ||
| EXPECT_EQ(result.routing_table.at("ecu-shared"), "peer_c"); |
There was a problem hiding this comment.
Minor - collision test is unidirectional. Input is hardcoded [peer_b, peer_c] at line 184-185, asserted routing destination is peer_c (last). Flipping input to [peer_c, peer_b] should route to peer_b. That flip case is not tested, so a regression that breaks last-writer semantics (e.g., accidentally sorting claims by peer_name before classification) wouldn't be caught.
Add a parametrized or second test:
TEST(AggregationClassification, leaf_collision_routing_respects_input_order) {
// [peer_c, peer_b] → routing must go to peer_b
...
}Also missing from this file (round-2 confirmed all 6 test names): two_way_parent_cycle, self_parent_ignored, parent_references_nonexistent_component, empty_merged_set, four_level_nesting. See companion Major comment on classification.cpp:29.
| # Sub-components (with parent_component_id) appear under the parent's | ||
| # /subcomponents endpoint, not the top-level /components list. | ||
| cls._wait_for_subcomponents( | ||
| PRIMARY_URL, 'robot-x', {'ecu-a', 'ecu-b', 'ecu-c'}, 'primary', |
There was a problem hiding this comment.
Minor - no end-to-end negative collision scenario in daisy-chain tests. All 10 test methods are positive paths. robot-x collision is intentional hierarchical merge (not an error case), and line 275 asserts body['warnings'] == [] for healthy chain.
Recommend a dedicated test: configure peer_b AND peer_c to both claim a new leaf ecu-shared (no parent_component_id), verify from primary:
/components/ecu-sharedreturns 200 and routes somewhere (either peer_b or peer_c)./health.warningscontains aleaf_id_collisionentry listing bothpeer_bandpeer_cunderpeer_names.x-medkit.contributorsfor the routed response reflects the resolving peer (not both).
This exercises the actual wire-level warning path under the real daisy-chain topology that unit tests can't simulate. Also covers the nondeterminism concern on aggregation_manager.cpp:419 via a follow-up test that kills one of the colliding peers and re-asserts warnings disappearance + routing flip (per the Major on that line).
| } | ||
|
|
||
| // @verifies REQ_INTEROP_003 | ||
| TEST(EntityMerger, contributors_no_duplicate_on_repeat_merge) { |
There was a problem hiding this comment.
Minor - contributor presentation order not asserted on the user-visible path. This test verifies dedup in the internal contributors_ vector, which is correct. But clients actually see the output of XMedkit::contributors() in x_medkit.cpp:110-128, which sorts ["local", peers alphabetical].
Suggest one additional test that exercises the sorted output: inputs arriving in reverse order (e.g., ["peer:zulu", "peer:alpha", "local"]) should serialize as ["local", "peer:alpha", "peer:zulu"]. Otherwise a regression flipping the sort direction wouldn't be caught by unit tests — only by an integration test that happens to inspect the exact contributor order.
Summary
Make the aggregation layer treat Components with the same symmetry as Areas, so collision-merged peer Components work correctly for every deployment shape the project supports (single peer, multi-peer, hierarchical parent, daisy chain).
EntityMerger::merge_componentsnow records a provisional routing entry on collision. Every Component ID refers to at most one ECU when it is a leaf, and the peer owns the runtime state.classify_component_routingfree function (aggregation/classification.hpp) runs after all peers have been merged. A Component is classified as a hierarchical parent if any other Component in the merged set references it viaparent_component_id; those are removed from the routing table and served locally from the merged cache. Leaves stay routed to the owning peer.LeafCollisionWarningobjects.AggregationManagersurfaces them onMergedPeerResult.leaf_warnings, writes an RCLCPP_WARN line listing every claimant, and exposes the list via a new/health.warningsarray. Routing falls back to last-writer-wins because rejection does not fix a deployment anomaly.x-medkit.contributorslist populated during merge (localseeded before the peer loop,peer:<name>appended byEntityMergerwith de-duplication). Clients can distinguish local-only, peer-only, and merged views without relying on the single-valuedsourcefield. In daisy-chain topologies each hop surfaces only its direct upstream by design.sovd_service_interface.hpp.contributorsprovenance, with an updated PlantUML merge diagram plus a new post-merge classification diagram.SOVD API impact
x-medkit.contributorsfield on entity responses and a new optionalwarningsarray on/health. Both are emitted only when non-empty / when aggregation is active, so existing clients ignoring unknown fields are unaffected.Issue
Type
Testing
All local runs green:
./scripts/test.sh unit --packages-select ros2_medkit_gateway- 1976 tests, 0 failures. Includes:AggregationClassificationtests covering leaf vs hierarchical, single/multi-peer, subcomponents, and warning emission.EntityMergercontributor tests (local/peer append, dedup, app-prefix-only-peer).merged_components_route_to_peerplus the synthetic-collision test from the previous iteration../scripts/test.sh test_peer_aggregation --packages-select ros2_medkit_integration_tests- prior integration coverage still green, 82 tests../scripts/test.sh test_daisy_chain_aggregation --packages-select ros2_medkit_integration_tests- new integration test with 3 gateways in isolated DDS domains (primary -> peer_B -> peer_C). Verifies hierarchical parent served locally, 1-hop and 2-hop leaf forwarding, contributors, and empty /health.warnings../scripts/test.sh lint --packages-select ros2_medkit_gateway- clean.Checklist