On February 20, 2026, Sogang University’s Department of Computer Science made waves in the artificial intelligence (AI) community by announcing a breakthrough in the performance of Large Vision-Language Models (LVLMs)—a class of AI systems that integrate visual and textual information. The research, led by Professor Junsuk Choi and a team of doctoral and master’s students, zeroed in on a stubborn problem plaguing these models: the so-called 'cross-image information leakage' that occurs when multiple images are processed together.
LVLMs have gained prominence for their impressive ability to handle tasks like visual question answering (VQA) and image captioning, especially when working with single images. But, as anyone who’s tried to compare a series of medical scans or sift through multiple product images can attest, real-world scenarios often require reasoning across several images at once. That’s where things get tricky. According to AITimes, the team discovered that when LVLMs are fed several images, the boundaries between them start to blur. Information from one image can seep into the context of another—a phenomenon known as cross-image information leakage—which undermines the model’s inference accuracy.
To mitigate this, LVLMs typically insert what are called 'delimiter tokens' between images. In theory, these tokens act as signposts, telling the model where one image ends and the next begins. However, Professor Choi’s group found that, in practice, these delimiters weren’t doing their job. As Choi explained, “Delimiter tokens were supposed to block information flow between images, but our analysis showed that they don’t fully prevent cross-image leakage.” The culprit? The self-attention mechanism at the heart of transformer-based models, which allows tokens—even those from different images—to interact more than intended.
The team’s investigation revealed that delimiter tokens failed as effective barriers because their hidden states weren’t sufficiently distinct from those of the surrounding image tokens. This meant that, during the model’s attention calculations, tokens from one image could still influence those from another, muddying the waters and leading to suboptimal answers in tasks involving multiple images. As Sogang University noted in their announcement, “Existing delimiter tokens used to separate images do not sufficiently block information leakage between images.”
Rather than overhauling the entire model architecture or embarking on costly retraining, the researchers took a more elegant approach. They proposed a method called 'delimiter token hidden state scaling.' In essence, this technique tweaks the representation strength of delimiter tokens, making them stand out more distinctly from the image tokens on either side. By scaling the hidden state of these tokens, the model is nudged to strengthen interactions within each image (intra-image interaction) while dampening unnecessary attention between images (inter-image interaction).
The results, as detailed in their soon-to-be-presented paper at the International Conference on Learning Representations (ICLR) 2026 in Rio de Janeiro, were impressive. “We confirmed that by scaling the hidden state of delimiter tokens, we could stabilize and improve multi-image inference accuracy without changing the model structure or requiring additional training,” said a member of the research team. The significance of this is hard to overstate: the method can be applied immediately to existing LVLMs, even those already deployed in industrial settings, without the need for resource-intensive modifications or retraining.
But what does this mean for the broader AI landscape? For starters, it addresses a critical bottleneck in the practical deployment of LVLMs in fields such as medicine, law, and industry—where multi-image or multi-document reasoning is often the norm. Think of the challenge faced by a radiologist comparing a series of X-rays, or a legal analyst reviewing multiple scanned documents for a single case. If the AI can’t reliably keep one image’s context separate from another’s, its answers may be muddled or even misleading. According to AITimes, “This method is practical for immediate application to existing large LVLMs and is expected to enhance reliability and accuracy in multi-image and multi-document AI systems in fields such as medical, legal, and industrial diagnostics.”
From a technical standpoint, the research also shines a spotlight on the inner workings of multimodal transformer models—systems that have become the backbone of state-of-the-art AI but whose attention dynamics remain only partly understood. By dissecting the role and limitations of delimiter tokens, Professor Choi’s team has not just improved performance, but also provided a clearer theoretical and experimental understanding of how these models handle complex, multi-image inputs. The team described their work as “re-examining the functional role of delimiter tokens in LVLMs, both theoretically and experimentally.”
Importantly, this isn’t just a marginal gain or a tweak for the sake of novelty. The approach offers a structural solution to a structural problem, all at minimal cost and with high practical value. As the team put it, “This technique does not require model architecture modification or additional fine-tuning, making it highly practical for industrial AI systems.” The proposed solution—simple, effective, and cost-efficient—has the potential to be adopted widely, especially as LVLMs continue to expand rapidly into new application domains.
The academic community has taken notice, too. The acceptance of the team’s paper at ICLR 2026, one of the most prestigious conferences in AI and machine learning, underscores the significance of their findings. The event, set for April 23-27 in Rio de Janeiro, is expected to draw leading experts from around the globe. As Sogang University highlighted, “This work highlights the overlooked role of delimiter tokens in LVLMs and is expected to be a key technology for improving reliability and accuracy in multi-image and multi-document AI systems.”
Looking ahead, the implications stretch far beyond the lab. The method could be applied to next-generation multimodal agent systems, advanced VQA tasks, and even AI-driven diagnostics in high-stakes industries. As more companies and institutions deploy LVLMs in environments where multiple images or documents must be processed together, the need for robust, reliable separation of information will only grow.
For those eager to dive deeper, the team’s paper, 'Enhancing Multi-Image Understanding through Delimiter Token Scaling,' is available for review, and the associated code has been made public on GitHub. This openness signals the researchers’ hope that their work will serve as a foundation for further advancements in the field.
In a rapidly evolving AI landscape, sometimes the most effective solutions are also the most straightforward. By taking a closer look at the humble delimiter token, Professor Choi’s team has offered a practical fix to a persistent problem—one that could shape the future of multimodal AI for years to come.