GH-40062: [C++][Python] Conversion of Table to Arrow Tensor#41870
GH-40062: [C++][Python] Conversion of Table to Arrow Tensor#41870AlenkaF wants to merge 22 commits intoapache:mainfrom
Conversation
|
Benchmarks for |
…to tensor.cc (#41932) ### Rationale for this change This is a precursor PR to #41870 with the purpose to make the review of #41870 easier (the diff of the code will be visible as it currently isn't because the code was moved to table.cc. I should also live in tensor.cc). ### What changes are included in this PR? The code from `RecordBatch::ToTensor` in record_batch.cc is moved to `RecordBatchToTensor` in tensor.cc. ### Are these changes tested? Existing tests should pass. ### Are there any user-facing changes? No. **This PR does not close the linked issue yet, it is just a precursor!** * GitHub Issue: #40062 Authored-by: AlenkaF <frim.alenka@gmail.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com>
13c49a7 to
15574f8
Compare
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Generally looks good! Some minor comments, and wondering if we can reduce the duplication in testing a bit
15574f8 to
4decf7f
Compare
|
I have researched the benchmark regression a bit and found that:
benchmark diff outputPlan to also try profiling in python ( |
Do you see those regressions of up to 40% for both row major and column major conversions? And both for uniform vs mixed type with casting? |
Benchmarks for RecordBatch only test row-major conversion. The newly added Table benchmarks test both. I think that was due to the fact we were adding features for RecordBatch::ToTensor step by step and we needed one simple benchmark that we could check while adding the features. Row-major conversion was the last to be added. As for the types, we only test uniform types in C++ benchmarks at the moment. ps: haven't been able to find extract any information with neither py-spy nor cProfile. |
|
Thank you for your contribution. Unfortunately, this |
…rk for Arrow Tables
…st for both batch and table
…ary Table creation
32eedbc to
1f12b90
Compare
|
I finally took time to improve the benchmarks on this change. It has been clear from #41870 (comment) that creating a
Benchmark result 1
Benchmark result 2cc @jorisvandenbossche in case you are interested =) I see there is a Python CI build with a failing test and a Windows C++ failure when building. Will fix it asap. |
Rationale for this change
There is currently no method to convert Arrow Table to Arrow Tensor (conversion from columnar format to a contiguous block of memory). This work is a continuation of
RecordBatch::ToTensorwork, see #40058.What changes are included in this PR?
This PR:
Table::ToTensorconversionRecordBatch::ToTensorand uses the Table implementation (RecordBatch::ToTensorbenchmarks checked)Are these changes tested?
Yes, in C++ and Python.
Are there any user-facing changes?
No, it is a new feature.