Recent developments in artificial intelligence (AI) are transforming the way individuals seek health information. A comprehensive analysis conducted by researchers sought to compare the performance of traditional search engines and large language models (LLMs) to provide accurate health-related answers to users. The study, focusing on data from the TREC Health Misinformation (HM) Track, evaluated four popular search engines—Google, Bing, Yahoo!, and DuckDuckGo—against seven LLMs including advanced models such as ChatGPT and GPT-4.
With the increase of misinformation prevalent on the web, the reliability of these information systems is more important than ever. The evaluation examined 150 health-related questions, aiming to determine the effectiveness and precision of each platform.
Overall, the findings revealed traditional search engines accurately answered approximately 50 to 70 percent of the health queries presented to them. While search engines faced challenges, often failing to provide relevant responses for many health queries, the advanced LLMs displayed superior aptitude, correctly addressing about 80 percent of the questions posed.
Among the SEs analyzed, Bing emerged as the most reliable option, showing promising results when answering user queries. Researchers noted, "Bing seems to be the most solid choice among the four SEs," indicating its higher accuracy potential over Google and others.
While traditional SEs have dominated the field of information retrieval for years, the rise of LLMs poses strong competition. LLMs, utilizing extensive datasets and complex reasoning capabilities, have established themselves as formidable alternatives for users, especially those seeking nuanced information.
The study also delved deeply and explored retrieval-augmented generation (RAG) methodologies, demonstrating how integrating retrieval evidence can significantly boost the performance of smaller LLM models. RAG methods improved accuracy by 30 percent, showcasing their effectiveness particularly for users who may not have access to the latest, most sophisticated AI tools.
Despite showing overall good performance, the research highlighted key shortcomings of LLMs, such as their sensitivity to input prompts. Users who ask questions without providing sufficient detail or clarity may receive largely inaccurate responses from these models. This concern is especially relevant for individuals seeking medical advice, as incorrect information can lead to dangerous consequences.
Among the results of LLM testing, the researchers found compelling evidence supporting the superiority of certain models when exposed to appropriate input formats. For example, augmenting LLM responses with retrieval evidence significantly enhanced their results. Notably, even smaller LLM models matched the performance of more advanced models when equipped with relevant retrieval content.
It's interesting to note how users' engagement varied across platforms. Traditional SE user behavior showed users leaning toward making quick judgments based on the first results displayed. This "lazy" behavior lacked the depth of analysis but surprisingly didn’t yield inferior results compared to more diligent approaches. Users often reported trusting the initial health advice derived from the top search results, even if such reliance is risky.
The research concluded with recommendations for future direction, emphasizing the importance of continuously refining both traditional search methodologies and LLM development. This approach encourages developers to leverage the strengths of both retrieval systems and large language models, promoting higher standards of reliability and accuracy for health information.
Further, it highlighted the need for providing users with more comprehensive and trustworthy sources, highlighting potential improvements for traditional SEs to filter harmful or misleading health information proactively.
The convergence of these technologies opens new avenues for health information seeking, prompting researchers and developers alike to work collaboratively toward more effective AI-driven solutions. By enhancing their models with verified retrieval augmentations, LLMs can stand as dependable partners for health inquiries, making substantive contributions to personal and public health intelligence.