[spark] Support batch union read for lake-enabled primary key tables by Yohahaha · Pull Request #3042 · apache/fluss

Yohahaha · 2026-04-09T07:01:11Z

Purpose

Linked issue: close #2984

This PR adds support for reading lake-enabled primary key tables in the spark sql.

Brief change log

Introduces Spark lake upsert batch scan/planning/reader implementation (lake snapshot + KV/log tail merge).
Refactors Spark lake test infrastructure and adds primary-key lake read test coverage (Paimon).
Moves/refactors LakeSnapshotAndLogSplitScanner into fluss-client and updates Flink/Spark integrations accordingly.

Tests

Added SparkLakePrimaryKeyTableReadTestBase with paimon.

API and Format

No API or format changes.

Documentation

No new feature documentation required (extends existing lake reading capability).

Yohahaha · 2026-04-10T03:29:56Z

CI failed with known issue #2992

Yohahaha · 2026-04-10T03:30:17Z

@YannByron would you help review this pr?

fresh-borzoni

@Yohahaha Ty for the PR, looks good, just one comment

fresh-borzoni

@Yohahaha Ty, LGTM 👍

YannByron · 2026-04-14T14:07:39Z

LGTM.

Yohahaha · 2026-04-15T07:37:41Z

@wuchong @luoyuxia would you help take a look?

Copilot

Pull request overview

This PR adds Spark SQL batch support for reading lake-enabled primary key (upsert) tables by unioning lake snapshot data with Fluss KV/log tail, and refactors shared lake-reading components for reuse across connectors.

Changes:

Introduces Spark lake upsert batch scan/planning/reader implementation (lake snapshot + KV/log tail merge).
Refactors Spark lake test infrastructure and adds primary-key lake read test coverage (Paimon).
Moves/refactors LakeSnapshotAndLogSplitScanner into fluss-client and updates Flink/Spark integrations accordingly.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
fluss-spark/fluss-spark-ut/src/test/scala/org/apache/fluss/spark/lake/SparkLakeTableReadTestBase.scala	New shared Spark lake test base, including tier-to-lake helper.
fluss-spark/fluss-spark-ut/src/test/scala/org/apache/fluss/spark/lake/SparkLakePrimaryKeyTableReadTestBase.scala	New PK-table lake read tests (fallback, lake-only, union, updates) + Paimon concrete test.
fluss-spark/fluss-spark-ut/src/test/scala/org/apache/fluss/spark/lake/SparkLakeLogTableReadTest.scala	Refactors log-table lake tests to reuse the new shared base; inlines concrete Paimon/Iceberg tests.
fluss-spark/fluss-spark-ut/src/test/scala/org/apache/fluss/spark/lake/SparkLakePaimonLogTableReadTest.scala	Removes standalone Paimon log-table lake test (now inlined).
fluss-spark/fluss-spark-ut/src/test/scala/org/apache/fluss/spark/lake/SparkLakeIcebergLogTableReadTest.scala	Removes standalone Iceberg log-table lake test (now inlined).
fluss-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/read/lake/FlussLakeUtils.scala	Moves/renames shared lake utilities and adds split (de)serialization helpers.
fluss-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/read/lake/FlussLakeBatch.scala	New shared base for lake batch planning + stopping-offset logic.
fluss-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/read/lake/FlussLakeInputPartition.scala	Adds lake-only and lake+tail (upsert) Spark input partition types.
fluss-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/read/lake/FlussLakeUpsertBatch.scala	New batch planner for PK lake union read (lake snapshot + log tail).
fluss-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/read/lake/FlussLakeUpsertPartitionReader.scala	New reader that drives `LakeSnapshotAndLogSplitScanner` and returns Spark rows.
fluss-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/read/lake/FlussLakeUpsertPartitionReaderFactory.scala	New reader factory dispatching between upsert-merge and lake-only readers.
fluss-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/read/lake/FlussLakeAppendBatch.scala	Refactors append-lake batch to reuse `FlussLakeBatch` and new utils/package.
fluss-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/read/lake/FlussLakeAppendPartitionReaderFactory.scala	Moves to new package and uses renamed utils/reader.
fluss-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/read/lake/FlussLakeAppendPartitionReader.scala	Renames lake-only reader class and simplifies reader context creation.
fluss-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/read/FlussScanBuilder.scala	Adds Spark DSv2 scan builder for lake-enabled PK tables.
fluss-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/read/FlussScan.scala	Adds `FlussLakeUpsertScan` wiring into Spark batch scans.
fluss-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/read/FlussInputPartition.scala	Removes lake split partition type (moved) and fixes upsert partition `toString` formatting.
fluss-spark/fluss-spark-common/src/main/scala/org/apache/fluss/spark/SparkTable.scala	Routes PK scans to lake upsert scan when datalake is enabled.
fluss-flink/fluss-flink-common/src/main/java/org/apache/fluss/flink/lake/LakeSplitReaderGenerator.java	Updates to use the refactored scanner API now located in `fluss-client`.
fluss-client/src/main/java/org/apache/fluss/client/table/scanner/batch/LakeSnapshotAndLogSplitScanner.java	Moves/refactors scanner to be connector-agnostic (accepts splits + explicit offsets).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Yohahaha · 2026-04-16T08:17:47Z

@luoyuxia would you help review again?

luoyuxia · 2026-04-16T09:22:16Z

@luoyuxia would you help review again?

Thanks, will have a quick review today.

luoyuxia

@Yohahaha Thanks for the pr. LGTM overall. Only one minor comments.
Btw, will you support filter pushdown in following pr?

luoyuxia · 2026-04-16T23:54:17Z

Seems we can now remove the limit in https://fluss.apache.org/docs/next/engine-spark/reads/#limitations

luoyuxia · 2026-04-16T23:59:53Z

Btw, I think it maybe great to have a quick for spark just we did for flink in https://fluss.apache.org/docs/next/quickstart/flink

Yohahaha · 2026-04-17T04:13:13Z

@Yohahaha Thanks for the pr. LGTM overall. Only one minor comments. Btw, will you support filter pushdown in following pr?

more read optimizations like 'filter pushdown', 'fine-grained split generation' are on the way.

docs update will in following pr.

luoyuxia

+1

Yohahaha changed the title ~~[spark] Support reading lake-enabled primary key tables in Spark connector~~ [spark] Support reading lake-enabled primary key tables Apr 9, 2026

Yohahaha changed the title ~~[spark] Support reading lake-enabled primary key tables~~ [spark] Support batch union read for lake-enabled primary key tables Apr 9, 2026

fresh-borzoni reviewed Apr 13, 2026

View reviewed changes

Comment thread ...luss-spark-common/src/main/scala/org/apache/fluss/spark/read/lake/FlussLakeUpsertBatch.scala Outdated

fresh-borzoni reviewed Apr 13, 2026

View reviewed changes

fresh-borzoni approved these changes Apr 14, 2026

View reviewed changes

luoyuxia requested review from Copilot and luoyuxia April 15, 2026 09:17

Copilot started reviewing on behalf of luoyuxia April 15, 2026 09:18 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Yohahaha added 13 commits April 16, 2026 14:03

add pk table ut and move to lake dir

aa6a894

compact test files

d50589f

remove duplicate code

14f526c

remove duplicate code

f1b662e

fix style

20888fa

ignore iceberg suite

eb2c139

trigger CI

cb429c5

optimize lake only split

582d046

fix comments

8115252

fix build

8b2ff6a

fix rebase

6ef78f4

remove unsued var

6e604d0

add partition table case

4fde215

Yohahaha force-pushed the batch-union-read-pk branch from 8ff9199 to 4fde215 Compare April 16, 2026 07:47

luoyuxia reviewed Apr 16, 2026

View reviewed changes

Comment thread ...luss-spark-common/src/main/scala/org/apache/fluss/spark/read/lake/FlussLakeUpsertBatch.scala

optimize partition creation when data is empty

ae4b4e3

luoyuxia approved these changes Apr 17, 2026

View reviewed changes

luoyuxia merged commit 26a4a72 into apache:main Apr 17, 2026
6 checks passed

Yohahaha deleted the batch-union-read-pk branch April 17, 2026 08:28

Conversation

Yohahaha commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

Yohahaha commented Apr 10, 2026

Uh oh!

Yohahaha commented Apr 10, 2026

Uh oh!

Uh oh!

fresh-borzoni left a comment

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni left a comment

Choose a reason for hiding this comment

Uh oh!

YannByron commented Apr 14, 2026

Uh oh!

Yohahaha commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Yohahaha commented Apr 16, 2026

Uh oh!

luoyuxia commented Apr 16, 2026

Uh oh!

luoyuxia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

luoyuxia commented Apr 16, 2026

Uh oh!

luoyuxia commented Apr 16, 2026

Uh oh!

Yohahaha commented Apr 17, 2026

Uh oh!

luoyuxia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Yohahaha commented Apr 9, 2026 •

edited

Loading