Skip to content

feat: add postpone bucket (bucket=-2) write support for primary-key tables#252

Merged
JingsongLi merged 2 commits intoapache:mainfrom
JingsongLi:postpone
Apr 16, 2026
Merged

feat: add postpone bucket (bucket=-2) write support for primary-key tables#252
JingsongLi merged 2 commits intoapache:mainfrom
JingsongLi:postpone

Conversation

@JingsongLi
Copy link
Copy Markdown
Contributor

Purpose

Postpone bucket mode writes data in KV format without sorting or deduplication, deferring bucket assignment to background compaction. Files are written to bucket-postpone directory and are invisible to normal reads until compacted.

Brief change log

Tests

API and Format

Documentation

…ables

Postpone bucket mode writes data in KV format without sorting or
deduplication, deferring bucket assignment to background compaction.
Files are written to `bucket-postpone` directory and are invisible
to normal reads until compacted.
@jerry-024
Copy link
Copy Markdown

I found two behavior-compatibility issues compared with Java's postpone-bucket implementation:

  1. In crates/paimon/src/table/table_write.rs, postpone files are always named with ...-s-0-w-.... This is not compatible with Java. Java assigns a distinct writeId per writer and encodes it in the file name; the postpone compaction path later parses that writeId and keeps files from the same writer on the same reader so replay order matches writer-local production order. Hard-coding s-0 removes that writer boundary entirely. Once multiple writers produce files for the same postpone partition, compaction / replay can no longer reconstruct the ordering assumptions used by Java, so conflicting PK records may resolve differently.

  2. In crates/paimon/src/table/postpone_file_writer.rs, rolled files are closed asynchronously and creation_time is assigned with Utc::now() when the async close finishes. This is also incompatible with Java's ordering semantics. Java's postpone compaction sorts files by DataFileMeta.creationTime before replaying them, so creationTime is part of the effective replay-order contract. Here an earlier file can easily end up with a later creation_time than a later file, depending on close timing. That makes replay order nondeterministic and can again change the final result for PK conflicts.

So these are not just implementation differences; they change the behavioral assumptions that Java relies on for postpone-bucket replay / compaction.

@JingsongLi
Copy link
Copy Markdown
Contributor Author

Regarding the first question, writeId identifies which worker it is. Since Rust only has one worker, it's hardcoded as 0.

@JingsongLi
Copy link
Copy Markdown
Contributor Author

Regarding the second question, there is indeed a problem; I will fix it.

@JingsongLi
Copy link
Copy Markdown
Contributor Author

Thanks @jerry-024 , comment addressed.

Copy link
Copy Markdown

@jerry-024 jerry-024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit d5dd8fc into apache:main Apr 16, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants