Skip to content

docs: clarify limitations of weighted-average embedding for long inputs#2570

Open
alobroke wants to merge 2 commits intoopenai:mainfrom
alobroke:fix/embedding-long-inputs-clarify-averaging
Open

docs: clarify limitations of weighted-average embedding for long inputs#2570
alobroke wants to merge 2 commits intoopenai:mainfrom
alobroke:fix/embedding-long-inputs-clarify-averaging

Conversation

@alobroke
Copy link
Copy Markdown

Summary

Addresses #2549 — clarifies the mathematical limitations of the
weighted-average approach for long-input embeddings.

Problem

OpenAI embedding models return unit-normalized vectors (L2 norm = 1).
This means the original embedding magnitude is discarded before the
user receives it. The notebook previously implied that weighting chunks
by token count produces a sound representation of the full text — but
this is mathematically a heuristic, not a reconstruction.

Changes

  • Added ⚠️ warning callout explaining the unit-normalization issue
  • Updated len_safe_get_embedding docstring with explicit caveats
  • Updated truncate_text_tokens docstring recommending it as the
    preferred approach for classification tasks
  • Added use-case comparison table at the end of the notebook

References

Fixes #2549

alobroke added 2 commits April 1, 2026 00:24
…ts OpenAI embeddings are unit-normalized (L2 norm = 1), so weighting chunks

by token count does not recover original embedding magnitudes. Added explicit
warning callout, updated docstrings for both functions, and added a use-case
comparison table recommending truncation for classification tasks.

Fixes openai#2549
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[PROBLEM]

1 participant