Skip to content

GH-40062: [C++][Python] Conversion of Table to Arrow Tensor#41870

Open
AlenkaF wants to merge 22 commits intoapache:mainfrom
AlenkaF:gh-40062-table-to-tensor
Open

GH-40062: [C++][Python] Conversion of Table to Arrow Tensor#41870
AlenkaF wants to merge 22 commits intoapache:mainfrom
AlenkaF:gh-40062-table-to-tensor

Conversation

@AlenkaF
Copy link
Copy Markdown
Member

@AlenkaF AlenkaF commented May 29, 2024

Rationale for this change

There is currently no method to convert Arrow Table to Arrow Tensor (conversion from columnar format to a contiguous block of memory). This work is a continuation of RecordBatch::ToTensor work, see #40058.

What changes are included in this PR?

This PR:

  • implements Table::ToTensor conversion
  • adds bindings to Python
  • adds benchmarks in C++
  • removes the code in RecordBatch::ToTensor and uses the Table implementation (RecordBatch::ToTensor benchmarks checked)

Are these changes tested?

Yes, in C++ and Python.

Are there any user-facing changes?

No, it is a new feature.

@AlenkaF
Copy link
Copy Markdown
Member Author

AlenkaF commented May 29, 2024

Benchmarks for RecordBatch::ToTensor after the changeing the implementation to use Table::ToTensor:

(pyarrow-dev) alenkafrim@alenka-mac arrow % archery --quiet benchmark diff --benchmark-filter=BatchToTensorSimple
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (7)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                  benchmark      baseline     contender  change %                                                                                                                                                                                                  counters
  BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3 8.540 GiB/sec 8.826 GiB/sec     3.351   {'family_index': 3, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1545}
  BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3 4.515 GiB/sec 4.583 GiB/sec     1.516    {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 787}
 BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30 5.355 GiB/sec 5.426 GiB/sec     1.320   {'family_index': 3, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 971}
 BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30 2.113 GiB/sec 2.120 GiB/sec     0.331   {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 380}
BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300 2.009 GiB/sec 1.976 GiB/sec    -1.620  {'family_index': 1, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 363}
    BatchToTensorSimple<Int16Type>/size:65536/num_columns:3 5.391 GiB/sec 5.141 GiB/sec    -4.645    {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 61484}
BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300 7.797 GiB/sec 7.429 GiB/sec    -4.716 {'family_index': 3, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1374}

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (17)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                  benchmark        baseline       contender  change %                                                                                                                                                                                                 counters
 BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300 698.025 MiB/sec 642.690 MiB/sec    -7.927  {'family_index': 0, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 117}
  BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30 936.761 MiB/sec 849.504 MiB/sec    -9.315   {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 164}
 BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30   2.943 GiB/sec   2.664 GiB/sec    -9.484  {'family_index': 2, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 530}
   BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3   1.220 GiB/sec   1.103 GiB/sec    -9.540    {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 226}
BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300   3.350 GiB/sec   3.004 GiB/sec   -10.308 {'family_index': 2, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 603}
     BatchToTensorSimple<Int8Type>/size:65536/num_columns:3   1.343 GiB/sec   1.193 GiB/sec   -11.189    {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15407}
  BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3   6.492 GiB/sec   5.679 GiB/sec   -12.518  {'family_index': 2, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1170}
    BatchToTensorSimple<Int32Type>/size:65536/num_columns:3   8.703 GiB/sec   7.530 GiB/sec   -13.478   {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 99016}
    BatchToTensorSimple<Int64Type>/size:65536/num_columns:3  17.419 GiB/sec  14.934 GiB/sec   -14.269  {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 198847}
    BatchToTensorSimple<Int8Type>/size:65536/num_columns:30   1.246 GiB/sec   1.013 GiB/sec   -18.692   {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 14331}
   BatchToTensorSimple<Int16Type>/size:65536/num_columns:30   3.813 GiB/sec   3.045 GiB/sec   -20.148  {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 43240}
   BatchToTensorSimple<Int32Type>/size:65536/num_columns:30   5.497 GiB/sec   3.822 GiB/sec   -30.460  {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 63621}
   BatchToTensorSimple<Int8Type>/size:65536/num_columns:300 665.489 MiB/sec 452.284 MiB/sec   -32.037   {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7122}
   BatchToTensorSimple<Int64Type>/size:65536/num_columns:30   7.306 GiB/sec   4.883 GiB/sec   -33.166  {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 83661}
  BatchToTensorSimple<Int16Type>/size:65536/num_columns:300   1.024 GiB/sec 646.927 MiB/sec   -38.317 {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 11642}
  BatchToTensorSimple<Int64Type>/size:65536/num_columns:300   1.208 GiB/sec 711.915 MiB/sec   -42.439 {'family_index': 3, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 13994}
  BatchToTensorSimple<Int32Type>/size:65536/num_columns:300   1.158 GiB/sec 678.147 MiB/sec   -42.812 {'family_index': 2, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 13406}

AlenkaF added a commit that referenced this pull request Jun 5, 2024
…to tensor.cc (#41932)

### Rationale for this change

This is a precursor PR to #41870 with the purpose to make the review of #41870 easier (the diff of the code will be visible as it currently isn't because the code was moved to table.cc. I should also live in tensor.cc).

### What changes are included in this PR?

The code from `RecordBatch::ToTensor` in record_batch.cc is moved to `RecordBatchToTensor` in tensor.cc.

### Are these changes tested?

Existing tests should pass.

### Are there any user-facing changes?

No.

**This PR does not close the linked issue yet, it is just a precursor!**
* GitHub Issue: #40062

Authored-by: AlenkaF <frim.alenka@gmail.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
@AlenkaF AlenkaF force-pushed the gh-40062-table-to-tensor branch from 13c49a7 to 15574f8 Compare June 5, 2024 12:51
Copy link
Copy Markdown
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good! Some minor comments, and wondering if we can reduce the duplication in testing a bit

Comment thread cpp/src/arrow/table.h Outdated
Comment thread cpp/src/arrow/tensor.cc Outdated
Comment thread cpp/src/arrow/tensor.cc Outdated
Comment thread python/pyarrow/table.pxi Outdated
Comment thread python/pyarrow/tests/test_table.py Outdated
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review and removed awaiting review Awaiting review labels Jun 6, 2024
@AlenkaF AlenkaF force-pushed the gh-40062-table-to-tensor branch from 15574f8 to 4decf7f Compare June 10, 2024 15:56
@github-actions github-actions bot added awaiting review Awaiting review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Jun 10, 2024
@AlenkaF
Copy link
Copy Markdown
Member Author

AlenkaF commented Jun 10, 2024

I have researched the benchmark regression a bit and found that:

  • running the benchmarks for RecordBatch::ToTensor shows up to 40% of change in time (regressions)
  • removing Table creation but keeping the code as is, hardcoding the for loop over the chunks to one iteration, makes the regression fall to maximum of 20%
benchmark diff output
(pyarrow-dev) alenkafrim@alenka-mac build % archery --quiet benchmark diff --benchmark-filter=ToTensorSimple
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (7)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                benchmark       baseline      contender  change %                                                                                                                                                                                                 counters
 BatchToTensorSimple<Int64Type>/size:65536/num_columns:30  7.321 GiB/sec  7.341 GiB/sec     0.275  {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 84665}
  BatchToTensorSimple<Int64Type>/size:65536/num_columns:3 17.341 GiB/sec 17.385 GiB/sec     0.256  {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 197830}
BatchToTensorSimple<Int32Type>/size:65536/num_columns:300  1.153 GiB/sec  1.136 GiB/sec    -1.413 {'family_index': 2, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 13151}
BatchToTensorSimple<Int64Type>/size:65536/num_columns:300  1.221 GiB/sec  1.198 GiB/sec    -1.838 {'family_index': 3, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 13997}
BatchToTensorSimple<Int16Type>/size:65536/num_columns:300  1.027 GiB/sec  1.005 GiB/sec    -2.092 {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 11502}
 BatchToTensorSimple<Int16Type>/size:65536/num_columns:30  3.824 GiB/sec  3.728 GiB/sec    -2.521  {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 43449}
BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3  4.435 GiB/sec  4.322 GiB/sec    -2.550   {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 792}

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (17)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                  benchmark        baseline        contender  change %                                                                                                                                                                                                  counters
 BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30   5.354 GiB/sec    5.078 GiB/sec    -5.159   {'family_index': 3, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 959}
    BatchToTensorSimple<Int32Type>/size:65536/num_columns:3   8.656 GiB/sec    8.107 GiB/sec    -6.348    {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 96401}
BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300   7.884 GiB/sec    7.371 GiB/sec    -6.506 {'family_index': 3, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1140}
 BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30   2.109 GiB/sec    1.969 GiB/sec    -6.655   {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 378}
BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300   2.007 GiB/sec    1.869 GiB/sec    -6.878  {'family_index': 1, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 360}
   BatchToTensorSimple<Int32Type>/size:65536/num_columns:30   5.514 GiB/sec    5.116 GiB/sec    -7.218   {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 62798}
BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300   3.346 GiB/sec    3.066 GiB/sec    -8.379  {'family_index': 2, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 601}
   BatchToTensorSimple<Int8Type>/size:65536/num_columns:300 669.230 MiB/sec  598.420 MiB/sec   -10.581    {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7493}
    BatchToTensorSimple<Int16Type>/size:65536/num_columns:3   5.393 GiB/sec    4.745 GiB/sec   -12.015    {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 61699}
 BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300 700.642 MiB/sec  611.987 MiB/sec   -12.653   {'family_index': 0, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 123}
    BatchToTensorSimple<Int8Type>/size:65536/num_columns:30   1.247 GiB/sec    1.075 GiB/sec   -13.836    {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 14200}
  BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3   6.465 GiB/sec    5.567 GiB/sec   -13.879   {'family_index': 2, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1156}
  BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30 938.704 MiB/sec  792.766 MiB/sec   -15.547    {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 164}
 BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30   2.944 GiB/sec    2.453 GiB/sec   -16.660   {'family_index': 2, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 529}
  BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3   8.618 GiB/sec    7.157 GiB/sec   -16.959   {'family_index': 3, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1521}
   BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3   1.197 GiB/sec 1008.475 MiB/sec   -17.748     {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 227}
     BatchToTensorSimple<Int8Type>/size:65536/num_columns:3   1.314 GiB/sec    1.057 GiB/sec   -19.601     {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15032}

Plan to also try profiling in python (py-spy doesn't work on MacOS, any other suggestions maybe?). Update: installed py-spy with brew and it works, looking at the .svg at the moment.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 11, 2024
Comment thread cpp/src/arrow/table.h Outdated
Comment thread python/pyarrow/table.pxi Outdated
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 11, 2024
@jorisvandenbossche
Copy link
Copy Markdown
Member

I have researched the benchmark regression a bit and found that:

Do you see those regressions of up to 40% for both row major and column major conversions? And both for uniform vs mixed type with casting?

@AlenkaF
Copy link
Copy Markdown
Member Author

AlenkaF commented Jun 11, 2024

Do you see those regressions of up to 40% for both row major and column major conversions? And both for uniform vs mixed type with casting?

Benchmarks for RecordBatch only test row-major conversion. The newly added Table benchmarks test both. I think that was due to the fact we were adding features for RecordBatch::ToTensor step by step and we needed one simple benchmark that we could check while adding the features. Row-major conversion was the last to be added.

As for the types, we only test uniform types in C++ benchmarks at the moment.

ps: haven't been able to find extract any information with neither py-spy nor cProfile.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 11, 2024
@github-actions github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
@thisisnic
Copy link
Copy Markdown
Member

Thank you for your contribution. Unfortunately, this
pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label
or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you
do not have repository permissions to reopen the PR, please tag a maintainer.

@AlenkaF AlenkaF removed the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 19, 2025
@AlenkaF AlenkaF force-pushed the gh-40062-table-to-tensor branch from 32eedbc to 1f12b90 Compare April 15, 2026 11:09
@AlenkaF
Copy link
Copy Markdown
Member Author

AlenkaF commented Apr 15, 2026

I finally took time to improve the benchmarks on this change. It has been clear from #41870 (comment) that creating a Table in case of a RecordBatch to Tensor conversion is the main issue. I have consulted Claude Code and GitHub Copilot which both gave me two good ideas to test.

  1. Pre-compute the index in case of the row-major conversion (4a879f9)
  • Number of regressions fell from 17 to 13
  • Max regression fell from -43% to -38%
Benchmark result 1
$ archery --quiet benchmark diff --benchmark-filter=BatchToTensorSimple
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (11)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                  benchmark        baseline       contender  change %                                                                                                                                                                                                 counters
 BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30   3.400 GiB/sec   4.006 GiB/sec    17.840  {'family_index': 2, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 609}
  BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3   9.607 GiB/sec  11.257 GiB/sec    17.181  {'family_index': 2, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1723}
BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300   3.922 GiB/sec   4.246 GiB/sec     8.258 {'family_index': 2, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 711}
   BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3   1.346 GiB/sec   1.360 GiB/sec     1.043    {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 241}
  BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3   5.743 GiB/sec   5.754 GiB/sec     0.193  {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1024}
 BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30   2.329 GiB/sec   2.319 GiB/sec    -0.430  {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 418}
     BatchToTensorSimple<Int8Type>/size:65536/num_columns:3   1.375 GiB/sec   1.365 GiB/sec    -0.702    {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15777}
  BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30 980.412 MiB/sec 959.127 MiB/sec    -2.171   {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 172}
 BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300 724.155 MiB/sec 708.308 MiB/sec    -2.188  {'family_index': 0, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 126}
    BatchToTensorSimple<Int8Type>/size:65536/num_columns:30   1.255 GiB/sec   1.216 GiB/sec    -3.107   {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 14451}
    BatchToTensorSimple<Int16Type>/size:65536/num_columns:3   5.227 GiB/sec   5.054 GiB/sec    -3.307   {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 59905}

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (13)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                  benchmark        baseline       contender  change %                                                                                                                                                                                                  counters
 BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30   7.150 GiB/sec   6.782 GiB/sec    -5.158  {'family_index': 3, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1281}
BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300   2.229 GiB/sec   2.017 GiB/sec    -9.482  {'family_index': 1, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 401}
   BatchToTensorSimple<Int16Type>/size:65536/num_columns:30   4.068 GiB/sec   3.445 GiB/sec   -15.303   {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 46683}
  BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3  14.380 GiB/sec  11.985 GiB/sec   -16.657   {'family_index': 3, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2582}
    BatchToTensorSimple<Int64Type>/size:65536/num_columns:3  17.690 GiB/sec  14.347 GiB/sec   -18.901   {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 203532}
BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300  13.195 GiB/sec  10.688 GiB/sec   -18.999 {'family_index': 3, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2358}
   BatchToTensorSimple<Int8Type>/size:65536/num_columns:300 746.820 MiB/sec 595.770 MiB/sec   -20.226    {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 8300}
    BatchToTensorSimple<Int32Type>/size:65536/num_columns:3   9.870 GiB/sec   7.690 GiB/sec   -22.088   {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 111961}
   BatchToTensorSimple<Int64Type>/size:65536/num_columns:30   8.216 GiB/sec   6.032 GiB/sec   -26.581   {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 94142}
   BatchToTensorSimple<Int32Type>/size:65536/num_columns:30   6.596 GiB/sec   4.725 GiB/sec   -28.357   {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 59518}
  BatchToTensorSimple<Int16Type>/size:65536/num_columns:300   1.214 GiB/sec 870.740 MiB/sec   -29.978  {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 13917}
  BatchToTensorSimple<Int64Type>/size:65536/num_columns:300   1.508 GiB/sec 989.418 MiB/sec   -35.907  {'family_index': 3, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17500}
  BatchToTensorSimple<Int32Type>/size:65536/num_columns:300   1.421 GiB/sec 901.875 MiB/sec   -38.012  {'family_index': 2, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16439}
  1. Template the ToTensor method for RecordBatch and Table separately avoiding heap-allocations from creating Table for RecordBatch (1f12b90)
  • Number of regressions fell to between 3 and 5 in multiple runs
  • Max regression fell from -38% to between 15% and 18% in multiple runs
  • Net throughput improves on most shapes. Remaining losses are concentrated in Int64
Benchmark result 2
$archery --quiet benchmark diff --benchmark-filter=BatchToTensorSimple
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (21)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                  benchmark        baseline       contender  change %                                                                                                                                                                                                 counters
 BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30   3.380 GiB/sec   4.020 GiB/sec    18.919  {'family_index': 2, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 609}
  BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3   9.570 GiB/sec  11.156 GiB/sec    16.575  {'family_index': 2, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1718}
  BatchToTensorSimple<Int16Type>/size:65536/num_columns:300   1.199 GiB/sec   1.385 GiB/sec    15.483 {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10000}
  BatchToTensorSimple<Int32Type>/size:65536/num_columns:300   1.424 GiB/sec   1.631 GiB/sec    14.528 {'family_index': 2, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16702}
  BatchToTensorSimple<Int64Type>/size:65536/num_columns:300   1.517 GiB/sec   1.735 GiB/sec    14.355 {'family_index': 3, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17535}
BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300   3.912 GiB/sec   4.381 GiB/sec    11.977 {'family_index': 2, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int32Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 602}
   BatchToTensorSimple<Int8Type>/size:65536/num_columns:300 743.539 MiB/sec 800.822 MiB/sec     7.704   {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 8398}
   BatchToTensorSimple<Int64Type>/size:65536/num_columns:30   8.443 GiB/sec   8.652 GiB/sec     2.476  {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 96545}
   BatchToTensorSimple<Int16Type>/size:65536/num_columns:30   4.061 GiB/sec   4.161 GiB/sec     2.461  {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 46607}
    BatchToTensorSimple<Int8Type>/size:65536/num_columns:30   1.250 GiB/sec   1.267 GiB/sec     1.378   {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 14433}
 BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30   2.308 GiB/sec   2.339 GiB/sec     1.378  {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 417}
    BatchToTensorSimple<Int16Type>/size:65536/num_columns:3   5.172 GiB/sec   5.237 GiB/sec     1.257   {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int16Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 59889}
  BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3   5.737 GiB/sec   5.765 GiB/sec     0.493  {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1029}
 BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300 723.279 MiB/sec 725.363 MiB/sec     0.288  {'family_index': 0, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 126}
  BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30 979.022 MiB/sec 981.447 MiB/sec     0.248   {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 171}
   BatchToTensorSimple<Int32Type>/size:65536/num_columns:30   6.662 GiB/sec   6.678 GiB/sec     0.242  {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 76524}
BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300   2.232 GiB/sec   2.219 GiB/sec    -0.569 {'family_index': 1, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int16Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 400}
   BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3   1.344 GiB/sec   1.302 GiB/sec    -3.108    {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int8Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 238}
 BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30   7.073 GiB/sec   6.852 GiB/sec    -3.124 {'family_index': 3, 'per_family_instance_index': 4, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:30', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1281}
    BatchToTensorSimple<Int32Type>/size:65536/num_columns:3   9.812 GiB/sec   9.479 GiB/sec    -3.388  {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int32Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 112938}
     BatchToTensorSimple<Int8Type>/size:65536/num_columns:3   1.372 GiB/sec   1.316 GiB/sec    -4.075    {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int8Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15748}

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (3)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                  benchmark       baseline      contender  change %                                                                                                                                                                                                  counters
    BatchToTensorSimple<Int64Type>/size:65536/num_columns:3 17.550 GiB/sec 16.345 GiB/sec    -6.865   {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'BatchToTensorSimple<Int64Type>/size:65536/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 202920}
BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300 13.096 GiB/sec 11.669 GiB/sec   -10.894 {'family_index': 3, 'per_family_instance_index': 5, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:300', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2345}
  BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3 14.224 GiB/sec 12.131 GiB/sec   -14.716   {'family_index': 3, 'per_family_instance_index': 3, 'run_name': 'BatchToTensorSimple<Int64Type>/size:4194304/num_columns:3', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2570}

cc @jorisvandenbossche in case you are interested =)

I see there is a Python CI build with a failing test and a Windows C++ failure when building. Will fix it asap.

@AlenkaF AlenkaF marked this pull request as ready for review April 17, 2026 06:39
@AlenkaF AlenkaF requested review from raulcd and rok as code owners April 17, 2026 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants