perf: fast-path inline strings in ByteViewGroupValueBuilder::vectorized_append by EeshanBembi · Pull Request #21618 · apache/datafusion

EeshanBembi · 2026-04-14T13:05:23Z

Which issue does this PR close?

Rationale for this change

ByteViewGroupValueBuilder::vectorized_append was doing unnecessary work for short strings (≤12 bytes): for each row it called array.value(row) to decode the u128 view into a &[u8], then called make_view to re-encode it back into a u128. The input GenericByteViewArray already stores inline values in exactly that u128 format, so the round-trip is redundant.

This mirrors the existing HAS_BUFFERS specialisation in vectorized_equal_to_inner, which uses the same data_buffers().is_empty() guard to take a direct-view-compare fast path for inline strings.

What changes are included in this PR?

In vectorized_append_inner, the Nulls::None branch now dispatches on arr.data_buffers().is_empty():

Fast path (no data buffers → all values ≤12 bytes inline): copies u128 views directly via self.views.extend(rows.iter().map(|&row| arr.views()[row])). Arrow's validity invariant guarantees inline views are zero-padded, so direct copy is semantically identical to value() → make_view().
Slow path (array has non-inline strings): adds self.views.reserve(rows.len()) before the existing loop to avoid repeated reallocation.

Are these changes tested?

Covered by the existing 6 unit tests in bytes_view::tests, all passing unchanged. test_byte_view_vectorized_operation_special_case exercises the fast path directly (11-byte strings, no data buffers).

Are there any user-facing changes?

No. Internal performance improvement only.

Benchmark

inline_null_0.0_size_1000/vectorized_append (8-byte strings, no nulls, 1 000 rows):

	time
Before	3.37 µs
After	495 ns
Change	−85.3% (6.8× faster)

…on types Closes apache#21144 Implements DFExtensionType for all remaining canonical Arrow extension types so they are recognized and pretty-printed by the extension type registry: - Bool8: displays Int8 values as 'true'/'false' instead of raw integers - Json: uses default string formatter (values are already valid JSON) - Opaque: uses default formatter - FixedShapeTensor: uses default formatter, storage_type computed from value_type and list_size - VariableShapeTensor: uses default formatter, storage_type computed from value_type and dimensions - TimestampWithOffset: uses default formatter All six types are registered in MemoryExtensionTypeRegistry::new_with_canonical_extension_types() alongside the existing UUID registration.

…ed_append When the input StringView/BinaryView array has no data buffers (all values ≤12 bytes, stored inline), skip the value() → make_view() round-trip in do_append_val_inner and instead copy the u128 views directly. Arrow guarantees valid arrays have zero-padded inline views, so the direct copy is semantically identical and lets the compiler vectorize the loop. Also pre-reserve views capacity in the slow path (non-inline strings) to avoid repeated Vec reallocation. Closes apache#21568

ebembi-crdb and others added 3 commits April 7, 2026 18:33

chore: sync with upstream apache/datafusion main

b1f8043

github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate common Related to common crate physical-plan Changes to the physical-plan crate labels Apr 14, 2026

EeshanBembi marked this pull request as draft April 16, 2026 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: fast-path inline strings in ByteViewGroupValueBuilder::vectorized_append#21618

perf: fast-path inline strings in ByteViewGroupValueBuilder::vectorized_append#21618
EeshanBembi wants to merge 3 commits intoapache:mainfrom
EeshanBembi:main

EeshanBembi commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EeshanBembi commented Apr 14, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Benchmark

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant