Skip to content

bnorm can fail with out_of_memory on cpu #4964

@Sqvid

Description

@Sqvid

Summary

bnorm tests via benchdnn do not gracefully skip in cases where the required scratchpad memory is too large. This is causing failures in the AArch64 Nightly pipeline in test_benchdnn_modeC_bnorm_regressions_cpu.

Version

8d4aa8b

Environment

Reproduced on x64 and AArch64.
hash: 8d4aa8b (introduced the failing case, but the issue already existed).

Steps to reproduce

$ ./build/tests/benchdnn/benchdnn --bnorm --dt=bf16 --inplace=true mb1ic512ih65536
Error: Function 'create_primitive' at (oneDNN/tests/benchdnn/dnnl_common.hpp:469) returned 'out_of_memory'
Error: Function 'init_prim' at (oneDNN/tests/benchdnn/dnnl_common.hpp:523) returned '1'
Error: Function 'createit' at (oneDNN/tests/benchdnn/bnorm/bnorm.cpp:710) returned '1'
Error: Function 'create' at (oneDNN/tests/benchdnn/utils/task.hpp:57) returned '1'
0:UNTESTED_FAILED (0 ms) __REPRO: --bnorm --dt=bf16 --inplace=true mb1ic512ih65536
===========================================================
= Failed cases summary (--summary=no-failures to disable) =
===========================================================
0:UNTESTED_FAILED (0 ms) __REPRO: --bnorm --dt=bf16 --inplace=true mb1ic512ih65536
============================
tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:1 listed:0
total: 0.00s; create_pd: 0.00s (3%); create_prim: 0.00s (0%); fill: 0.00s (0%); execute: 0.00s (0%); compute_ref: 0.00s (0%); compare: 0.00s (0%);

Observed behavior

Fails with status UNTESTED_FAILED

Expected behavior

Expected status SKIPPED. The gpu backend seems to have some handling for skipping large cases though I have not tested it.

// The library scratchpad is allocated at create_primitive stage. The memory
// check is moved after the creation stage. It's necessary to check the
// library scratchpad size against gpu_max_alloc, otherwise, out_of_memory
// would be issued by the library.
if (res->mem_size_args.scratchpad_size > 0 && is_gpu()
&& query_scratchpad_mode(query_attr(pdw))
== dnnl_scratchpad_mode_library) {
static size_t gpu_device_capacity = 0;
static size_t gpu_max_alloc_capacity = 0;
SAFE(get_gpu_ram_sizes(gpu_device_capacity, gpu_max_alloc_capacity),
WARN);
const bool fit
= res->mem_size_args.scratchpad_size < gpu_max_alloc_capacity;
if (!fit) {
BENCHDNN_PRINT(1,
"[CHECK_MEM]: Size of the scratchpad %s "
"doesn't fit the allocation limit of %s.\n",
smart_bytes(res->mem_size_args.scratchpad_size).c_str(),
smart_bytes(gpu_max_alloc_capacity).c_str());
res->state = SKIPPED;
res->reason = skip_reason::not_enough_ram;
return OK;
}
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugA confirmed library bugcomponent:testsCodeowner: @oneapi-src/onednn-archplatform:cpu-aarch64Codeowner: @oneapi-src/onednn-cpu-aarch64platform:cpu-x64Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions