Skip to content

Spark is_valid_utf8 function implementation#21627

Open
kazantsev-maksim wants to merge 8 commits intoapache:mainfrom
kazantsev-maksim:is_valid_utf8
Open

Spark is_valid_utf8 function implementation#21627
kazantsev-maksim wants to merge 8 commits intoapache:mainfrom
kazantsev-maksim:is_valid_utf8

Conversation

@kazantsev-maksim
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

N/A

Rationale for this change

Add new spark function: https://spark.apache.org/docs/latest/api/sql/index.html#is_valid_utf8

What changes are included in this PR?

  • Implementation
  • SLT tests

Are these changes tested?

Yes, tests added as part of this PR.

Are there any user-facing changes?

No, these are new function.

@kazantsev-maksim kazantsev-maksim marked this pull request as draft April 14, 2026 16:38
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Apr 14, 2026
@kazantsev-maksim kazantsev-maksim marked this pull request as ready for review April 16, 2026 16:37
fn spark_is_valid_utf8_inner(args: &[ArrayRef]) -> Result<ArrayRef> {
let [array] = take_function_args("is_valid_utf8", args)?;
match array.data_type() {
DataType::Utf8 => Ok(Arc::new(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utf8 types should always be valid, unless theres an edge case I'm missing? We'd just need to return a BinaryArray with all true values, and take the null buffer from input array

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Thanks, fixed.

# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be a good idea to have null test too, as well as array input test (currently these are all scalars)

DataType::Binary => Ok(Arc::new(
as_binary_array(array)?
.iter()
.map(|x| x.map(|y| String::from_utf8(y.into()).is_ok()))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.map(|x| x.map(|y| String::from_utf8(y.into()).is_ok()))
.map(|x| x.map(|y| str::from_utf8(y).is_ok()))

Avoid need of allocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants