Skip to content

[SPARK-56510][SQL] Fix ReplaceData DML without metadata attributes not projecting out the operation column#55372

Open
ZiyaZa wants to merge 4 commits intoapache:masterfrom
ZiyaZa:dml-no-metadata
Open

[SPARK-56510][SQL] Fix ReplaceData DML without metadata attributes not projecting out the operation column#55372
ZiyaZa wants to merge 4 commits intoapache:masterfrom
ZiyaZa:dml-no-metadata

Conversation

@ZiyaZa
Copy link
Copy Markdown
Contributor

@ZiyaZa ZiyaZa commented Apr 16, 2026

What changes were proposed in this pull request?

Previously, all DSv2 tests used an in-memory table that had some metadata attributes. This caused the code path for no-metadata attributes to be missed. This PR introduces a new property no-metadata for testing with an InMemoryTable without metadata attributes.

Previous implementation had a bug for ReplaceData plans that it would use DataWritingSparkTask without projection, which means that the connector would receive one more column (the __row_operation column) in addition to the row data to write. This is fixed in this PR by creating a new Writing Task DataWithProjectionWritingSparkTask that supports projecting only row data.

Additionally, following changes are done to clean-up the code:

  • Renamed WRITE_WITH_METADATA_OPERATION to WRITE_OPERATION and WRITE_OPERATION to WRITE_WITHOUT_METADATA_OPERATION to make the intention clearer. Previously, it was confusing that in DataWithProjectionWritingSparkTask, we had WRITE_WITH_METADATA_OPERATION when there are no metadata attributes.
  • Created RowLevelWriteExec as a parent of ReplaceDataExec / WriteDeltaExec, which now holds a helper getMetricValue for metric computation

Why are the changes needed?

To fix a bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit tests.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.6

@ZiyaZa ZiyaZa changed the title [SQL] Fix ReplaceData DML without metadata attributes not projecting out the operation column [SPARK-56510][SQL] Fix ReplaceData DML without metadata attributes not projecting out the operation column Apr 16, 2026
Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good bug fix for the no-metadata-attributes code path in ReplaceData. The core change (introducing DataWithProjectionWritingSparkTask) is correct and well-targeted — it ensures the __row_operation column is projected out even when there are no metadata attributes. A few suggestions below.

Comment on lines +29 to +30
final val WRITE_WITHOUT_METADATA_OPERATION: Int = 5
final val WRITE_OPERATION: Int = 6
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The rename swaps both names AND integer values — WRITE_OPERATION goes from value 5 to 6. Since the values are arbitrary and only consumed by match statements in the same codebase, consider just renaming without swapping the integer values. This would avoid any (unlikely but possible) risk with code that may have hard-coded the integer values.

val operation = row.getInt(0)

operation match {
case WRITE_OPERATION | WRITE_WITHOUT_METADATA_OPERATION =>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Worth a brief comment here explaining why both operation types are handled. This task is only used when there are no metadata attributes, yet WRITE_OPERATION-tagged rows (carryover/update rows in MERGE, all rows in DELETE/UPDATE) still arrive because the rewrite rules tag them with WRITE_OPERATION regardless of whether metadata attributes exist. Without the comment, a reader might wonder why a no-metadata writing task handles WRITE_OPERATION.

package org.apache.spark.sql.connector

class GroupBasedNoMetadataDeleteFromTableSuite extends DeleteFromTableSuiteBase {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Six new test files, each ~10 lines of actual code, is significant boilerplate. Consider adding a noMetadata flag to the existing suites and running them with both configurations (e.g., via a shared trait or parameterization). This would avoid class proliferation and keep the test matrix more maintainable.

override def build(): Write = new Write with RequiresDistributionAndOrdering {
override def requiredDistribution: Distribution = {
Distributions.clustered(Array(PARTITION_COLUMN_REF))
override def build(): Write = if (noMetadata) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is it intentional that the noMetadata path bypasses RequiresDistributionAndOrdering? This exercises a different physical plan (no shuffle/sort). If the goal is just to test the no-metadata code path, consider keeping the distribution/ordering requirements so these tests cover the same physical plan shape as the existing suites.

val pk = id.getInt(0)
buffer.deletes += pk
val logEntry = new GenericInternalRow(Array[Any](DELETE, pk, meta.copy(), null))
val metaCopy = if (meta != null) meta.copy() else null
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This null guard is needed because DeltaWritingSparkTask passes null for metadata when requiredMetadataAttributes() is empty. However, the DeltaWriter API methods (delete(meta, id), update(meta, id, row), reinsert(meta, row)) don't document that meta can be null. Third-party connectors could hit the same NPE. Consider adding Javadoc on those API methods to clarify the contract.

@aokolnychyi
Copy link
Copy Markdown
Contributor

Let me take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants