refactor: unify duplicate DAG construction (dag.py + ExecutionGraph)#511
refactor: unify duplicate DAG construction (dag.py + ExecutionGraph)#511przemekboruta wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
Conversation
Greptile SummaryThis PR unifies the duplicate DAG construction logic that existed in
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py | Adds module-level topologically_sort_column_configs using Kahn's algorithm with set-based deduplication and O(1) side-effect resolution; logic is correct and consistent with ExecutionGraph.get_topological_order. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dag.py | Deleted — functionality migrated to execution_graph.py; no residual references remain. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/config_compiler.py | Import updated from dag to execution_graph; no other changes. |
| packages/data-designer-engine/tests/engine/dataset_builders/utils/test_topological_sort.py | Renamed from test_dag.py, imports updated, ordering assertion correctly relaxed for unordered siblings, and three new tests added covering duplicate producers, side-effect ordering, and skip.when ordering. |
Sequence Diagram
sequenceDiagram
participant CC as config_compiler
participant TSC as topologically_sort_column_configs
participant EG as ExecutionGraph
CC->>TSC: column_configs list
TSC->>TSC: partition non_dag_cols / dag_col_dict
TSC->>TSC: build side_effect_map O(1)
TSC->>TSC: add edges for required_columns
TSC->>TSC: add edges for skip.columns
TSC->>TSC: Kahns algorithm produces order
TSC-->>CC: sorted column configs
CC->>EG: ExecutionGraph.create(sorted_configs, strategies)
Reviews (6): Last reviewed commit: "fix(tests): move SkipConfig import to mo..." | Re-trigger Greptile
b25ebf6 to
775f147
Compare
|
@przemekboruta there's a larger PR in flight that touches these DAG abstractions. Let's wait until that merges before this one can re-base and merge! |
…ution_graph.py Eliminates dag.py and its networkx dependency by moving topologically_sort_column_configs into execution_graph.py as a module-level function. Side-effect resolution is now O(1) via a side_effect_map dict (previously O(n²) linear scan). Kahn's algorithm is reused in-place rather than leaning on networkx.topological_sort. Closes NVIDIA-NeMo#510 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
test_judge and test_code_and_depends_on_validation_reasoning_traces have no mutual dependency and reach in-degree 0 simultaneously in Kahn's algorithm. Set iteration order varies with PYTHONHASHSEED, making the strict list assertion flaky. Assert only the topological invariants. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
520fbf2 to
eddf3a3
Compare
…ally_sort_column_configs ExecutionGraph.create handles skip.when ordering edges in its own two-pass build; the pre-sort function only needs required_columns to produce a valid ColumnConfigT ordering for config compilation.
|
Issue #510 has been triaged. The linked issue check is being re-evaluated. |
SummaryPR #511 consolidates two separate DAG construction implementations by deleting Files changed: 4 (1 deleted, 2 modified source files, 1 modified test file) Findings1. Behavioral change:
|
|
coverage nit: the side-effect resolution path in |
|
nit: now that |
nit: missing |
|
hey @przemekboruta, thanks for taking this on - really nice cleanup. The networkx removal and O(1) side-effect resolution are solid improvements, and the code is well-documented. I left a few comments inline. The main one worth looking at is the Great work overall, happy to help if you have questions on any of the comments. |
…configs The sync builder executes generators in compile-time sort order (from _column_configs, populated via this function), not ExecutionGraph order. Dropping skip.columns edges caused evaluate_skip_when to hit UndefinedError when the referenced column hadn't been generated yet, silently skipping rows. Also: - Refactor edge building into _add_edge() helper with a label parameter to distinguish "required" from "skip.when" edges in debug output - Rename test_dag.py -> test_topological_sort.py to match the new module location - Add from __future__ import annotations (required by AGENTS.md) - Add test_side_effect_column_ordering covering the side_effect_map.get() path - Add test_skip_when_column_ordering covering the skip.columns edge path
Summary
Closes #510
dag.pyand movestopologically_sort_column_configsintoexecution_graph.pyas a module-level functionnetworkx.topological_sortwith an inline Kahn's algorithm, consistent withExecutionGraph.get_topological_orderside_effect_mapdict — the previous implementation did a linear scan oversum(side_effect_dict.values(), [])which was O(n²)skip.columnsdependency edges: the sync builder executes generators in compile-time sort order (notExecutionGraphorder), so these edges are required to guarantee that skip conditions are evaluated after their referenced columns are generated_add_edge(name, dep, label)helper that logs[required]vs[skip.when]in debug outputtest_dag.py→test_topological_sort.pyto match the new module locationconfig_compiler.pyandtest_topological_sort.pyDesign note
The function is intentionally a module-level function, not a
@classmethodonExecutionGraph.ExecutionGraphis an execution abstraction (requires strategies, manages task scheduling); this function is a compilation step that works on rawColumnConfigTwithout strategies. Mixing the two responsibilities would require either dummy strategies or a significant signature change toadd_column.Test plan
test_dag_constructionandtest_circular_dependenciescover the migrated function with updated import pathtest_duplicate_side_effect_producers_raisesverifies that two columns declaring the same side-effect column name raiseConfigCompilationErrortest_side_effect_column_orderingcovers theside_effect_map.get()resolution path — a column depending on a side-effect column is sorted after its producertest_skip_when_column_orderingverifies thatskip.columnsedges produce the correct ordering when the skip condition references a column not inrequired_columnsruff checkpasses on all modified files