DocLLM: The Machines Are Reading Your Receipts!
What does it mean for the future of Document Understanding?
So essentially,
DocLLMs support rich documents like forms, invoices, receipts and contracts!
Paper: DocLLM: A layout-aware generative language model for multimodal document understanding (16 pages)
Researchers from JPMorgan AI Research are interested in processing documents properly. To process documents we currently rely on expensive image encoders or only accept text-only inputs, which makes them unsuitable for processing complex docs with visual elements.
Hmm..What’s the background?
Most documents have complex layouts and irregular formatting which poses challenges for current language models.
Traditional LLMs struggle to capture the relationships between text semantics and spatial structures in visually rich documents
Ok, So what is proposed in the research paper?
DocLLM leverages bounding box information to incorporate spatial layout structure without the need for image encoders.
A disentangled attention mechanism captures cross-alignment between text and spatial modalities.
A pre-training objective focused on infilling text blocks addresses irregular layouts and heterogeneous content and additional instruction-tuning on a comprehensive dataset fine-tunes the model for document intelligence tasks
As a result, DocLLM outperforms existing LLMs on 14 out of 16 datasets for visual question answering, natural language inference, key information extraction, and document classification tasks. DocLLM generalizes well to previously unseen datasets, demonstrating its robustness and adaptability.
And what’s next?
The researchers plan to infuse vision into DocLLM to improve its understanding of the visual elements of documents, such as images, tables, and charts. They also plan to investigate the use of more accurate OCR engines to obtain cohesive blocks of text from documents. This will help improve the model's performance on tasks such as visual question answering (VQA). The researchers intend to construct a VRDU instruction-tuning dataset with crowdsourced instructions and preferences to provide a more comprehensive and diverse dataset for fine-tuning the model on a variety of document intelligence tasks.
So essentially,
DocLLMs support rich documents like forms, invoices, receipts and contracts!