Skip to content

refactor: unify duplicate DAG construction (dag.py + ExecutionGraph)#511

Open
przemekboruta wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:refactor/unify-dag-construction
Open

refactor: unify duplicate DAG construction (dag.py + ExecutionGraph)#511
przemekboruta wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:refactor/unify-dag-construction

Conversation

@przemekboruta
Copy link
Copy Markdown
Contributor

@przemekboruta przemekboruta commented Apr 8, 2026

Summary

Closes #510

  • Deletes dag.py and moves topologically_sort_column_configs into execution_graph.py as a module-level function
  • Replaces networkx.topological_sort with an inline Kahn's algorithm, consistent with ExecutionGraph.get_topological_order
  • Side-effect resolution is now O(1) via a side_effect_map dict — the previous implementation did a linear scan over sum(side_effect_dict.values(), []) which was O(n²)
  • Restores skip.columns dependency edges: the sync builder executes generators in compile-time sort order (not ExecutionGraph order), so these edges are required to guarantee that skip conditions are evaluated after their referenced columns are generated
  • Refactors edge building into a _add_edge(name, dep, label) helper that logs [required] vs [skip.when] in debug output
  • Renames test_dag.pytest_topological_sort.py to match the new module location
  • Updates imports in config_compiler.py and test_topological_sort.py

Design note

The function is intentionally a module-level function, not a @classmethod on ExecutionGraph. ExecutionGraph is an execution abstraction (requires strategies, manages task scheduling); this function is a compilation step that works on raw ColumnConfigT without strategies. Mixing the two responsibilities would require either dummy strategies or a significant signature change to add_column.

Test plan

  • test_dag_construction and test_circular_dependencies cover the migrated function with updated import path
  • test_duplicate_side_effect_producers_raises verifies that two columns declaring the same side-effect column name raise ConfigCompilationError
  • test_side_effect_column_ordering covers the side_effect_map.get() resolution path — a column depending on a side-effect column is sorted after its producer
  • test_skip_when_column_ordering verifies that skip.columns edges produce the correct ordering when the skip condition references a column not in required_columns
  • ruff check passes on all modified files

@przemekboruta przemekboruta requested a review from a team as a code owner April 8, 2026 16:47
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 8, 2026

Greptile Summary

This PR unifies the duplicate DAG construction logic that existed in dag.py and ExecutionGraph by deleting dag.py and migrating topologically_sort_column_configs into execution_graph.py as a module-level function. The new implementation replaces networkx with an inline Kahn's algorithm (consistent with ExecutionGraph.get_topological_order) and switches side-effect resolution from a O(n²) linear scan to an O(1) dict lookup. The test file is renamed and extended with three new targeted test cases.

Confidence Score: 5/5

Safe to merge — the refactoring is correct, all edge types are preserved, and the new tests cover the key scenarios.

No P0/P1 findings. The Kahn's algorithm is implemented correctly with proper set-based deduplication, O(1) side-effect resolution, and symmetric upstream/downstream bookkeeping. Cycle detection, self-loop handling, and skip.when ordering all behave identically to the removed dag.py. Tests are meaningful and cover new paths.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py Adds module-level topologically_sort_column_configs using Kahn's algorithm with set-based deduplication and O(1) side-effect resolution; logic is correct and consistent with ExecutionGraph.get_topological_order.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dag.py Deleted — functionality migrated to execution_graph.py; no residual references remain.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/config_compiler.py Import updated from dag to execution_graph; no other changes.
packages/data-designer-engine/tests/engine/dataset_builders/utils/test_topological_sort.py Renamed from test_dag.py, imports updated, ordering assertion correctly relaxed for unordered siblings, and three new tests added covering duplicate producers, side-effect ordering, and skip.when ordering.

Sequence Diagram

sequenceDiagram
    participant CC as config_compiler
    participant TSC as topologically_sort_column_configs
    participant EG as ExecutionGraph

    CC->>TSC: column_configs list
    TSC->>TSC: partition non_dag_cols / dag_col_dict
    TSC->>TSC: build side_effect_map O(1)
    TSC->>TSC: add edges for required_columns
    TSC->>TSC: add edges for skip.columns
    TSC->>TSC: Kahns algorithm produces order
    TSC-->>CC: sorted column configs
    CC->>EG: ExecutionGraph.create(sorted_configs, strategies)
Loading

Reviews (6): Last reviewed commit: "fix(tests): move SkipConfig import to mo..." | Re-trigger Greptile

@przemekboruta przemekboruta force-pushed the refactor/unify-dag-construction branch 2 times, most recently from b25ebf6 to 775f147 Compare April 8, 2026 16:52
@nabinchha
Copy link
Copy Markdown
Contributor

@przemekboruta there's a larger PR in flight that touches these DAG abstractions. Let's wait until that merges before this one can re-base and merge!

przemekboruta and others added 2 commits April 15, 2026 21:37
…ution_graph.py

Eliminates dag.py and its networkx dependency by moving
topologically_sort_column_configs into execution_graph.py as a
module-level function. Side-effect resolution is now O(1) via a
side_effect_map dict (previously O(n²) linear scan). Kahn's algorithm
is reused in-place rather than leaning on networkx.topological_sort.

Closes NVIDIA-NeMo#510

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test_judge and test_code_and_depends_on_validation_reasoning_traces have
no mutual dependency and reach in-degree 0 simultaneously in Kahn's
algorithm. Set iteration order varies with PYTHONHASHSEED, making the
strict list assertion flaky. Assert only the topological invariants.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@przemekboruta przemekboruta force-pushed the refactor/unify-dag-construction branch from 520fbf2 to eddf3a3 Compare April 15, 2026 19:37
…ally_sort_column_configs

ExecutionGraph.create handles skip.when ordering edges in its own
two-pass build; the pre-sort function only needs required_columns
to produce a valid ColumnConfigT ordering for config compilation.
@github-actions
Copy link
Copy Markdown
Contributor

Issue #510 has been triaged. The linked issue check is being re-evaluated.

@andreatgretel andreatgretel added the agent-review Trigger agentic CI review label Apr 16, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Summary

PR #511 consolidates two separate DAG construction implementations by deleting dag.py and moving topologically_sort_column_configs into execution_graph.py as a module-level function. The networkx dependency is replaced with an inline Kahn's algorithm (consistent with ExecutionGraph.get_topological_order), and side-effect resolution is improved from O(n²) linear scan to O(1) dict lookup via side_effect_map. Import paths are updated in config_compiler.py and test_dag.py. Test assertions are relaxed to account for non-deterministic ordering of independent nodes.

Files changed: 4 (1 deleted, 2 modified source files, 1 modified test file)
Net lines: -23 (76 additions, 99 deletions)


Findings

1. Behavioral change: skip.columns edges omitted (Medium)

File: execution_graph.py:366-369

The old dag.py added dependency edges for col.skip.columns; the new implementation intentionally omits them. Commit 3/3 adds a comment explaining that ExecutionGraph.create handles skip ordering in its own two-pass build and this function only needs to produce a valid ColumnConfigT ordering for config compilation.

This is a correct and justified change — topologically_sort_column_configs is only called from config_compiler.py, and the authoritative execution graph (ExecutionGraph.create at line 89-123) already handles skip edges in its second pass. However, there is no test that validates the behavior when skip.columns references create additional ordering constraints. If a future caller relies on this function for execution ordering, the omission could surface as a subtle bug.

Recommendation: Consider adding a brief test case that verifies a config with skip.columns dependencies still compiles correctly end-to-end (through ExecutionGraph.create), to guard against regressions if skip handling changes later.

2. Silent skip of unresolvable dependencies (Low)

File: execution_graph.py:374

The resolve function returns None for columns not in dag_col_dict or side_effect_map, and the caller silently skips them (if resolved is None ... continue). The old code had the same behavior — unknown deps in _add_dependency_edges were simply ignored when not in dag_column_config_dict and not in all_side_effects.

This is acceptable because downstream validation in ExecutionGraph.create (lines 101-104, 115-118) raises ValueError for unresolvable dependencies. The pre-sort function is not the validation boundary. No action needed, just noting for completeness.

3. Test non-determinism fix is correct (Positive)

File: test_dag.py:85-92

The relaxed assertion (set(names[3:5]) instead of strict list equality) correctly reflects that test_judge and test_code_and_depends_on_validation_reasoning_traces have no mutual dependency and their relative order depends on Python's set iteration, which varies by PYTHONHASHSEED. The fixed positions (indices 0-2 and 5) are still strictly asserted where ordering is deterministic. This eliminates flakiness without weakening the test.

4. Clean removal of networkx dependency (Positive)

File: dag.py (deleted)

The deleted file was the only use of lazy.nx (networkx) in this module. The replacement uses collections.deque (already imported in execution_graph.py), so no new dependencies are added. The Kahn's algorithm implementation is consistent with ExecutionGraph.get_topological_order (line 227-258), reducing cognitive overhead for maintainers.

5. Module placement rationale is sound (Positive)

The PR description explains why topologically_sort_column_configs is a module-level function rather than a @classmethod on ExecutionGraph: the function operates on raw ColumnConfigT without strategies, while ExecutionGraph is an execution abstraction requiring strategies and managing task scheduling. This avoids leaking execution concerns into the compilation step.


Verdict

Approve. This is a clean, well-motivated refactoring that eliminates code duplication, removes an unnecessary networkx dependency path, and improves algorithmic complexity for side-effect resolution. The behavioral change around skip.columns edges is intentional, well-documented, and correct given the call site. The test fix for non-deterministic ordering is appropriate. No blocking issues found.

Minor suggestion: a test case exercising skip.columns through the full compilation + ExecutionGraph.create path would strengthen coverage against future regressions.

@github-actions github-actions bot removed the agent-review Trigger agentic CI review label Apr 16, 2026
@andreatgretel
Copy link
Copy Markdown
Contributor

coverage nit: the side-effect resolution path in resolve() (the side_effect_map.get() branch) isn't exercised by any of these tests. test_dag_construction uses test_code__trace but that resolves to the real column test_code directly, not through the side-effect map. a test with a CustomColumnConfig that declares side_effect_columns and an ExpressionColumnConfig that depends on one of them would close the gap.

@andreatgretel
Copy link
Copy Markdown
Contributor

nit: now that dag.py is deleted and the function lives in execution_graph.py, the test file name test_dag.py is a bit orphaned. not blocking, but a rename to test_topological_sort.py (or folding these tests into test_execution_graph.py) would match the new structure.

@andreatgretel
Copy link
Copy Markdown
Contributor

packages/data-designer-engine/tests/engine/dataset_builders/utils/test_dag.py

nit: missing from __future__ import annotations - AGENTS.md requires it in all python files. pre-existing but easy to fix since you're already touching imports here.

@andreatgretel
Copy link
Copy Markdown
Contributor

hey @przemekboruta, thanks for taking this on - really nice cleanup. The networkx removal and O(1) side-effect resolution are solid improvements, and the code is well-documented.

I left a few comments inline. The main one worth looking at is the skip.columns edge omission (execution_graph.py:366) - the sync builder still executes generators in compile-time sort order rather than ExecutionGraph order, so dropping those edges can cause evaluate_skip_when to hit UndefinedError and silently skip rows. The fix should be straightforward (restore the skip.columns loop). The rest are minor nits.

Great work overall, happy to help if you have questions on any of the comments.

…configs

The sync builder executes generators in compile-time sort order (from
_column_configs, populated via this function), not ExecutionGraph order.
Dropping skip.columns edges caused evaluate_skip_when to hit UndefinedError
when the referenced column hadn't been generated yet, silently skipping rows.

Also:
- Refactor edge building into _add_edge() helper with a label parameter to
  distinguish "required" from "skip.when" edges in debug output
- Rename test_dag.py -> test_topological_sort.py to match the new module location
- Add from __future__ import annotations (required by AGENTS.md)
- Add test_side_effect_column_ordering covering the side_effect_map.get() path
- Add test_skip_when_column_ordering covering the skip.columns edge path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: unify duplicate DAG construction (dag.py + ExecutionGraph)

3 participants