HTML is better than TXT

Nov 07, 2024

RAG using HTML responds better than RAG using TXT

Paper: HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems (14 Pages)

Researchers from Beijing are interested in development of a novel approach for Retrieval-Augmented Generation (RAG) that leverages the rich structure and semantic information embedded in HTML documents.

Hmm..What’s the background?

Traditional RAG systems typically rely on plain text extracted from web pages, which leads to a loss of crucial information. HtmlRAG aims to address this limitation by directly using HTML as the format for retrieved knowledge, capitalizing on the inherent ability of LLMs to understand HTML.

Current RAG systems predominantly use plain text, discarding valuable structural and semantic information present in HTML.

Source: https://lexica.art/prompt/0d201951-7a4e-42f1-97aa-f2d0eeced454

So what is proposed in the research paper?

In the research paper

HtmlRAG retains the structural and semantic richness of HTML, enabling LLMs to better understand context and relationships within the retrieved knowledge
LLMs are pre-trained on vast datasets that include HTML, giving them an inherent ability to process and interpret HTML documents

Experiments conducted on six diverse QA datasets (ASQA, HotpotQA, NQ, TriviaQA, MuSiQue, ELI5) demonstrate HtmlRAG's superior performance compared to plain-text-based baselines. It achieves higher scores on metrics like Exact Match, Hit@1, ROUGE-L, and BLEU, indicating more accurate and relevant answers.

What’s next?

The researchers propose several avenues for future research:

Developing better HTML processing techniques: Refining the HTML cleaning and pruning algorithms to further improve efficiency and reduce computational costs
Exploring different LLM architectures: Investigating the effectiveness of various LLMs for HTML-based RAG, especially with the emergence of models capable of handling longer input sequences
Expanding to other structured data formats: Adapting HtmlRAG's principles to utilize other structured formats like Latex, PDF, and Word documents, broadening the scope of external knowledge integration

RAG using HTML responds better than RAG using TXT

Learned something new? Consider sharing it!

So Essentially

Discussion about this post