DeepSeek researchers have released a new experimental model, V3.2-exp, designed to reduce inference costs in long-context AI operations significantly.
The model’s key innovation is a system called DeepSeek Sparse Attention, which introduces two core components. First, a “lightning indexer” identifies and prioritizes relevant excerpts within the context window. Then, a “fine-grained token selection system” narrows down specific tokens from those excerpts to feed into the limited attention span of the model. Together, these processes allow the model to handle extended context while consuming far fewer server resources.
According to DeepSeek, early tests show that this approach can cut the cost of a simple API call by up to 50% in long-context scenarios. While more independent evaluations will be needed, the open-weight release on Hugging Face means third-party testing is already within reach.
The development is part of a broader push across the AI industry to tackle the high inference costs — the expenses tied to running pre-trained models, as opposed to training them. DeepSeek’s research focuses on optimizing the transformer architecture itself, showing that significant efficiency gains are still possible.
Based in China, DeepSeek has been a disruptive player in global AI research. Its earlier R1 model, unveiled earlier this year, attracted attention for using reinforcement learning to achieve competitive results at a fraction of the cost of U.S. rivals. However, R1 did not trigger the anticipated revolution in AI training and the company has been relatively quiet since.
While the new Sparse Attention system may not spark the same level of controversy as R1, it offers valuable techniques that could influence how U.S. and global providers approach long-context AI efficiency in the future.





