Skip to content

feat(write): add MERGE & UPDATE with DataEvolutionWriter#241

Merged
JingsongLi merged 2 commits intoapache:mainfrom
JingsongLi:merge-into
Apr 15, 2026
Merged

feat(write): add MERGE & UPDATE with DataEvolutionWriter#241
JingsongLi merged 2 commits intoapache:mainfrom
JingsongLi:merge-into

Conversation

@JingsongLi
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi commented Apr 13, 2026

Purpose

  • Add shared DataFileWriter extracted from TableWrite
  • Add DataEvolutionWriter: engine-agnostic Table-layer API for partial-column updates via _ROW_ID
  • Add MERGE INTO execution in DataFusion
  • Add UPDATE execution in DataFusion

Brief change log

Tests

API and Format

Documentation

@JingsongLi JingsongLi force-pushed the merge-into branch 2 times, most recently from de603ad to 1c52144 Compare April 13, 2026 10:04
@JingsongLi JingsongLi changed the title [WIP] feat(write): add MERGE INTO support with RowIdUpdateWriter feat(write): add MERGE INTO support with RowIdUpdateWriter Apr 13, 2026
@JingsongLi JingsongLi changed the title feat(write): add MERGE INTO support with RowIdUpdateWriter feat(write): add MERGE & UPDATE support with RowIdUpdateWriter Apr 13, 2026
@JingsongLi JingsongLi force-pushed the merge-into branch 2 times, most recently from 0c496bc to 16c7d0a Compare April 13, 2026 15:58
.with_bucket(file_range.bucket)
.with_bucket_path(file_range.bucket_path.clone())
.with_total_buckets(file_range.total_buckets)
.with_data_files(vec![file.clone()])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, _ROW_ID has already been narrowed to a single DataFileMeta, and this read only loads that file back. However, the current logical row range may span multiple files with the same first_row_id (base file + partial-column files). Upstream Java/Python paths seem to read the whole first_row_id group.

ins.columns
.iter()
.zip(ins.value_exprs.iter())
.map(|(col, expr)| format!("{expr} AS {col}"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This keeps the generated batch in INSERT (...) order. But the write path later reads partition/bucket fields by target schema index, not by column name, so a reordered insert list can mis-map columns on partitioned / fixed-bucket tables.

@JingsongLi
Copy link
Copy Markdown
Contributor Author

@littlecoder04 Thanks for the review, fixed comments and add e2e tests.

@JingsongLi JingsongLi changed the title feat(write): add MERGE & UPDATE support with RowIdUpdateWriter feat(write): add MERGE & UPDATE support with DataEvolutionWriter Apr 15, 2026
@JingsongLi JingsongLi changed the title feat(write): add MERGE & UPDATE support with DataEvolutionWriter feat(write): add MERGE & UPDATE with DataEvolutionWriter Apr 15, 2026
@littlecoder04
Copy link
Copy Markdown
Contributor

+1

Copy link
Copy Markdown
Contributor

@luoyuxia luoyuxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown

@jerry-024 jerry-024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit f424ded into apache:main Apr 15, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants