VeriFact: Ensuring Factual Accuracy of LLM-Generated Text in Clinical Medicine

Computer Science Jan 30, 2025 0 281 Add to Reading List

In the rapidly evolving field of artificial intelligence (AI) applications in healthcare, one of the greatest challenges lies in ensuring that AI-generated content, especially in clinical settings, is factually accurate. As large language models (LLMs) become more prevalent in generating clinical documentation and summaries, the risk of factual inaccuracies — or "hallucinations" — in the text they produce is a significant concern. These inaccuracies can have serious implications, such as misdiagnosis or incorrect treatment plans, which may jeopardize patient safety. To address this gap, the VeriFact system introduces a new method of validating LLM-generated clinical text against a patient’s electronic health record (EHR), ensuring the generated content is supported by the patient’s actual medical history.

Introducing VeriFact: A Novel Approach to Fact-Checking Clinical Text

VeriFact is a revolutionary AI system designed to automatically verify the factual accuracy of text generated by LLMs. The system integrates retrieval-augmented generation (RAG) and an LLM-as-a-Judge to cross-check generated content against the patient's EHR. The core of VeriFact’s functionality lies in its ability to break down long-form clinical text — such as discharge summaries — into simple, verifiable statements (propositions). Each proposition is then checked for consistency with the patient’s clinical records.

To facilitate this, VeriFact employs several components:

Proposition Decomposition: Long-form input texts are parsed into simple, discrete statements. These statements can either be complete sentences or atomic claim propositions, each representing a logical assertion.
Reference Context Retrieval: For each statement, relevant facts are retrieved from the patient’s EHR using vector-based search, ensuring the context is patient-specific.
Fact Verification with LLM: The LLM-as-a-Judge evaluates whether each proposition is Supported, Not Supported, or Not Addressed by the facts retrieved from the EHR, providing explanations for its verdicts.

VeriFact-BHC: A New Dataset for Clinical Fact-Checking

To assess the performance of VeriFact, the authors introduced VeriFact-BHC, a specialized dataset derived from the MIMIC-III database, which contains real patient data. This dataset focuses on decomposing the Brief Hospital Course (BHC) section of discharge summaries into 13,290 simple statements across 100 patients. These statements are annotated by human clinicians who classify them as supported, not supported, or not addressed by the patient’s EHR.

The annotations from clinicians serve as the gold standard for fact-checking performance. The results showed that while the highest inter-clinician agreement was 88.5%, VeriFact achieved up to 92.7% agreement when compared to a denoised and adjudicated average human clinician ground truth. This demonstrates that VeriFact can exceed the performance of human clinicians in verifying the accuracy of LLM-generated clinical content, suggesting it is a viable tool for ensuring the factual accuracy of AI-generated medical texts.

How VeriFact Works: Decomposition and Evaluation

VeriFact works by splitting long-form input clinical texts, like BHC narratives, into individual propositions that can be independently verified. The process follows these steps:

Decomposition of Clinical Text: Clinical text is broken down into logical propositions that can be evaluated for truth value. This can be done by parsing complete sentences or by decomposing text into atomic claims, which capture subject-object-predicate relationships.
Fact Extraction from EHR: Each patient’s EHR is also decomposed into individual facts (e.g., diagnostic data, treatment history) that represent truths about the patient. These facts are stored in a vector database for efficient retrieval.
Contextual Retrieval: When evaluating a proposition, VeriFact dynamically retrieves the most relevant facts from the EHR to provide context for the proposition. This reference context ensures that the proposition is evaluated against accurate, patient-specific data.
LLM-as-a-Judge: An LLM, trained as a judge, compares each proposition with its reference context to determine whether it is Supported, Not Supported, or Not Addressed. The system also generates explanations for the verdicts, ensuring transparency and providing valuable insights into the verification process.

Performance Evaluation and Results

VeriFact was tested on both human-written and LLM-generated BHC summaries. For each type of input text, both sentence-level and atomic claim-level decompositions were evaluated. The results revealed that VeriFact significantly outperforms previous methods for fact-checking long-form clinical text. Notably, it surpassed the highest agreement among human clinicians by 4.2%, demonstrating its efficacy in ensuring the factual consistency of LLM-generated clinical content.

Implications for LLM Applications in Healthcare

The potential impact of VeriFact extends far beyond just verifying discharge summaries. As LLMs become more integrated into clinical workflows — from patient risk prediction to clinical decision support — ensuring the factual accuracy of AI-generated content is critical. VeriFact addresses this challenge by providing an automated and scalable solution for verifying patient-specific information, reducing the reliance on human clinicians for manual fact-checking.

Moreover, by integrating common-sense reasoning and real-time adjustments based on patient-specific data, VeriFact enhances the credibility of LLMs in healthcare. This could accelerate the adoption of AI-driven clinical applications by alleviating concerns about the accuracy and reliability of AI-generated content.

Conclusion: A Step Forward in AI-Driven Clinical Applications

VeriFact presents a robust solution to the problem of fact-checking LLM-generated text in clinical medicine. By combining retrieval-augmented generation with a LLM-as-a-Judge, it can automatically verify the factual consistency of clinical narratives against a patient’s EHR. The VeriFact-BHC dataset provides an invaluable resource for further development and evaluation of patient-specific fact-checking systems. With its ability to achieve human-level accuracy and transparency, VeriFact holds the potential to play a key role in the future of AI applications in healthcare, ensuring that AI-driven clinical decision-making is both accurate and reliable.