Senior AI/ML Systems Engineer · on-device GenAI, ML compilers, and NPU deployment
Qualcomm · IIT Madras (MTech, Industrial AI) · NIT Rourkela (BTech, CSE)
ML systems engineer working at the intersection of deep learning compilers, runtime systems, and hardware-aware optimization for edge ML accelerators. My day-to-day sits closer to the runtime than to the notebook — where model graphs, execution providers, and silicon meet.
Background blends production deployment engineering on Windows-on-Snapdragon with graduate-level study in Industrial AI at IIT Madras.
- On-device and edge AI deployment for transformer and diffusion workloads
- Execution providers and runtime behavior across ONNX Runtime, QNN EP, DirectML, and WinML
- Graph-level optimization, fusion, layout, and dtype legality on ONNX graphs
- Kernel-level optimization along Conv / GEMM / attention / activation paths
- Quantization and calibration (INT8 / INT4) for transformer-class models
- Olive + WinML enablement pathways for on-device model delivery
- Contributed to enabling Stable Diffusion v1.5 on Windows-on-Snapdragon — the publicly-announced Qualcomm + Microsoft collaboration bringing on-device generative AI to the NPU.
- Work on on-device GenAI and NPU-facing model enablement across Windows platforms.
- Execution-provider and runtime integration across the ONNX Runtime / QNN EP / DirectML / WinML ecosystem.
- Graph-level optimization, kernel tuning, and quantization practice applied to transformer and diffusion workloads.
- Young Technocrat Award — external recognition.
Compilers & runtimes — ONNX Runtime, QNN EP, DirectML, WinML, Olive, ONNX, IR-level graph transformations
Hardware targets — Qualcomm NPUs (Hexagon / HTP), Snapdragon X, ARM64, x64
Models & frameworks — PyTorch, Hugging Face Transformers, diffusion models, quantization toolchains (INT8 / INT4)
Languages — C++, Python, C
Perf & debugging — Profiling, tracing, kernel-level analysis, hardware-in-the-loop benchmarking
Platforms — Windows-on-Snapdragon, Linux
- Deployment-time optimization for transformer and diffusion workloads on edge accelerators
- Compile-time vs. runtime tradeoffs across ONNX Runtime execution providers
- Mixed-precision (INT4 / INT8 / FP16) scheduling for transformer blocks
- Reproducible hardware-in-the-loop benchmarking
Most of my production work is proprietary and lives in internal repositories. Public artifacts, writeups, and benchmark harnesses will land here as they become shareable.



