GitHub - WisdomShell/SCD: [AAAI'26, Oral] Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitigation

Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitigation

AAAI 2026 Oral

Bo Li, Zhenghua Xu, Rui Xie

Overview

Multilingual retrieval-augmented generation (RAG) allows large language models to answer knowledge-intensive questions by using retrieved documents as external evidence. However, when the language of the retrieved evidence differs from the language of the user query or in-context exemplars, the model may generate responses in an unintended language. This phenomenon is referred to as language drift.

This issue becomes especially visible in reasoning-heavy generation, such as chain-of-thought decoding, where intermediate steps can further amplify language instability. Our work systematically studies language drift across multiple multilingual QA datasets, languages, and model backbones, and shows that the problem is not simply caused by comprehension failure. Instead, it is strongly related to decoder-level behavior, where dominant token distributions, especially English, can override the intended target language.

To mitigate this, we propose Soft Constrained Decoding (SCD), a lightweight and training-free decoding strategy that softly penalizes non-target-language tokens during generation. SCD is model-agnostic and can be integrated into standard generation pipelines without modifying model architecture or requiring additional training data. Experiments on three multilingual datasets show consistent improvements in language alignment and downstream task performance.

Related RAG projects from us: GRIP (ACL 2026 Main) · ETC (AAAI 2026 Oral) · SCD (AAAI 2026 Oral)

Key Features

Training-free decoding-time mitigation
Model-agnostic and easy to integrate
Focuses on language alignment in multilingual RAG
Includes released multilingual versions of three QA datasets
Suitable for analysis and follow-up research on multilingual reasoning and generation drift

Repository Structure

The current repository contains the following main files:

.
├── README.md
├── SCD.py
├── data generation.py
├── dureader_MultiLang_1000.json
├── hotpotqa_MultiLang_1000.json
└── musique_MultiLang_1000.json

File Description

SCD.py
Main script containing the decoding-time mitigation logic for SCD.
data generation.py
Script related to multilingual data construction / generation.
dureader_MultiLang_1000.json
Multilingual DuReader-based dataset.
hotpotqa_MultiLang_1000.json
Multilingual HotpotQA-based dataset.
musique_MultiLang_1000.json
Multilingual MuSiQue-based dataset.

Method Summary

SCD addresses multilingual generation drift at decoding time.

Instead of retraining the model or introducing an additional controller, SCD adjusts the decoding distribution by softly discouraging tokens that are inconsistent with the intended target language. In this way, the method keeps generation close to the desired language while preserving the flexibility of open-ended reasoning.

This design makes SCD:

simple to implement
lightweight in inference
compatible with standard autoregressive generation
applicable to multilingual RAG settings with cross-lingual retrieval interference

Datasets

This repository releases multilingual versions of three QA datasets for multilingual RAG research:

DuReader
HotpotQA
MuSiQue

The released files are:

dureader_MultiLang_1000.json
hotpotqa_MultiLang_1000.json
musique_MultiLang_1000.json

These files can be used for:

multilingual RAG evaluation
language drift analysis
decoding-time intervention experiments
follow-up work on multilingual reasoning and language control

Usage

The repository is currently organized as a lightweight release of the core code and datasets.

Basic Workflow

Prepare your multilingual RAG input data
Load a target model in SCD.py
Configure the target language and decoding settings
Run generation with the SCD decoding processor
Evaluate output language consistency and task performance on the released datasets

Notes

The main decoding implementation is in SCD.py.
The current code is most suitable for research use and follow-up adaptation.
Depending on your environment, you may need to adjust model paths, device settings, and tokenizer/model loading code before running experiments.

Example Research Scope

This repository is especially useful for studying:

output language drift in multilingual RAG
cross-lingual interference during reasoning
decoding-time language control
multilingual chain-of-thought stability
lightweight mitigation strategies without retraining

🧭 Related RAG Projects

This repository is part of our broader research line on controllable and adaptive Retrieval-Augmented Generation (RAG).

GRIP [ACL 2026 Main Conference]: Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning
A training-based dynamic RAG framework that internalizes retrieval control into token-level decoding.
ETC [AAAI 2026 Oral Paper]: Modeling Uncertainty Trends for Timely Retrieval in Dynamic RAG
A training-free dynamic RAG method that improves retrieval timing by modeling entropy trends during decoding.
SCD [AAAI 2026 Oral Paper]: Language Drift in Multilingual Retrieval-Augmented Generation
A training-free multilingual RAG method that mitigates language drift through decoding-time control.

Together, these projects cover three complementary directions in RAG: training-based retrieval planning, training-free retrieval timing, and decoding-time control for multilingual generation.

Citation

If you find this repository useful, please cite:

@inproceedings{DBLP:conf/aaai/LiXX26,
  author       = {Bo Li and
                  Zhenghua Xu and
                  Rui Xie},
  editor       = {Sven Koenig and
                  Chad Jenkins and
                  Matthew E. Taylor},
  title        = {Language Drift in Multilingual Retrieval-Augmented Generation: Characterization
                  and Decoding-Time Mitigation},
  booktitle    = {Fortieth {AAAI} Conference on Artificial Intelligence, Thirty-Eighth
                  Conference on Innovative Applications of Artificial Intelligence,
                  Sixteenth Symposium on Educational Advances in Artificial Intelligence,
                  {AAAI} 2026, Singapore, January 20-27, 2026},
  pages        = {31519--31526},
  publisher    = {{AAAI} Press},
  year         = {2026},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Key Features

Repository Structure

File Description

Method Summary

Datasets

Usage

Basic Workflow

Notes

Example Research Scope

🧭 Related RAG Projects

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
SCD.py		SCD.py
data generation.py		data generation.py
dureader_MultiLang_1000.json		dureader_MultiLang_1000.json
hotpotqa_MultiLang_1000.json		hotpotqa_MultiLang_1000.json
musique_MultiLang_1000.json		musique_MultiLang_1000.json

Folders and files

Latest commit

History

Repository files navigation

Overview

Key Features

Repository Structure

File Description

Method Summary

Datasets

Usage

Basic Workflow

Notes

Example Research Scope

🧭 Related RAG Projects

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages