Technology

Sogang University Team Unveils Breakthrough In AI Models

A novel technique to combat information leakage in vision-language AI models promises more reliable multi-image reasoning without costly upgrades.

5 min read

On February 20, 2026, Sogang University’s computer engineering department found itself in the global spotlight, thanks to a breakthrough by Professor Choi Joon-seok and his research team. Their work, which delves into the intricacies of Large Vision-Language Models (LVLMs), has not only been accepted for presentation at the prestigious International Conference on Learning Representations (ICLR) 2026 in Rio de Janeiro but also promises to reshape the way artificial intelligence handles complex, real-world data.

LVLMs—those powerful AI systems that combine the strengths of large language models (LLMs) with vision encoders—have been praised for their prowess in tasks like Visual Question Answering (VQA) and image captioning, especially when working with single images. However, as the team at Sogang University discovered, the real world rarely presents information one image at a time. From medical diagnostics that require comparing multiple scans to retail applications analyzing product differences, the need for robust multi-image reasoning is everywhere.

Yet, despite their strengths, LVLMs have a significant Achilles’ heel. According to Sogang University’s announcement, the models suffer from what’s called “cross-image information leakage” when handling multiple images at once. This means that information from one image can inadvertently seep into the context of another, muddying the waters and reducing the accuracy of the AI’s inferences. This problem, while subtle, can have serious consequences in high-stakes environments like healthcare or legal analysis, where precision is paramount.

Traditionally, developers have tried to tackle this issue by inserting delimiter tokens between images—essentially, digital signposts meant to tell the model where one image ends and another begins. In theory, these tokens should act as barriers, keeping information neatly separated. But as the research team led by Professor Choi and including PhD candidates Lee Min-young, Park Ye-ji, Hwang Dong-jun, and master’s candidate Kim Ye-jin, found out, the reality is more complicated.

Through a systematic analysis, the team discovered that delimiter tokens, as implemented in current LVLMs, don’t fully block the flow of information between images. The culprit? The “hidden state” of these tokens. In the self-attention mechanism that powers modern transformers, these hidden states weren’t distinct enough to prevent unintended attention from leaking across images. Instead of acting as robust barriers, delimiter tokens sometimes allowed tokens from different images to interact, leading to the very leakage they were meant to prevent.

“Delimiter tokens were not sufficient in fully blocking cross-image attention due to hidden state overlap,” the team reported. This insight, according to the researchers, suggests that the performance drop in multi-image tasks isn’t just a matter of data or model size—it’s rooted in the very structure of how tokens interact within the model.

Determined to find a solution that wouldn’t require overhauling existing models or retraining them from scratch, the team proposed a deceptively simple yet effective fix: scaling the hidden state of delimiter tokens. By adjusting the strength of these tokens, they could reinforce interactions within the same image (intra-image interactions) and suppress unnecessary interactions between different images (inter-image interactions). The result? A clearer separation of image representation spaces and, crucially, a stable improvement in the model’s multi-image reasoning accuracy.

What makes this approach particularly appealing is its practicality. There’s no need to modify the underlying architecture of the LVLM or embark on additional fine-tuning. As the research team emphasized, “This method requires no model architecture modification or additional fine-tuning and can be applied to existing large LVLMs, making it practical for industry use.” For companies and organizations that have already deployed AI systems based on LVLMs, this means they can reap the benefits of improved performance without costly or disruptive upgrades.

The implications are far-reaching. According to the research, this technique could boost the reliability and accuracy of AI systems in a wide range of fields: multi-image VQA, multi-document Q&A, medical and legal diagnostics, industrial inspections, and even next-generation multimodal agent systems. As LVLMs become more integral to critical decision-making processes, ensuring that they can handle complex, mixed-input environments without stumbling over information leakage is more important than ever.

Beyond the immediate performance gains, the research also shines a light on the often-overlooked functional role of delimiter tokens in LVLMs. By providing both theoretical and experimental analyses of attention dynamics within multimodal transformers, the team’s work offers valuable insights for AI researchers and engineers worldwide. As Sogang University explained, “The research highlights the functional role of delimiter tokens in LVLMs and provides a theoretical and experimental analysis of attention dynamics in multimodal transformers.”

The academic community has taken notice. The team’s paper, titled “Enhancing Multi-Image Understanding through Delimiter Token Scaling,” was accepted for presentation at ICLR 2026, widely regarded as one of the top conferences in artificial intelligence and machine learning. The event, taking place from April 23 to 27 in Rio de Janeiro, will provide a global platform for the team to share their findings and engage with leading experts in the field.

In the spirit of open science, the researchers have also made their code publicly available via GitHub, inviting other scientists and practitioners to test, adapt, and build upon their work. This move not only accelerates the pace of innovation but also ensures that the benefits of their discovery can be widely shared.

Of course, the journey doesn’t end here. As LVLMs continue to evolve and find new applications, the need for robust, scalable solutions to multi-image reasoning challenges will only grow. The delimiter token scaling technique represents a significant step forward, but it also opens up new questions about how best to manage attention and representation in ever-more complex AI systems.

For now, though, Professor Choi Joon-seok and his team can take pride in having tackled a subtle yet critical limitation of modern AI. Their work not only advances the state of the art but also exemplifies the kind of practical, elegant problem-solving that will shape the future of artificial intelligence.

Sources