Detecting LLM hallucinations and overgeneration mistakes @ SemEval 2024

5 min readMay 12, 2024

The modern NLG landscape is plagued by two interlinked problems: On the one hand, our current neural models have a propensity to produce inaccurate but fluent outputs; on the other hand, our metrics are most apt at describing fluency, rather than correctness. This leads neural networks to “hallucinate”, e.g., produce fluent but incorrect outputs that we currently struggle to detect automatically. For many NLG applications, the correctness of an output is however mission-critical. For instance, producing a plausible-sounding translation that is inconsistent with the source text puts in jeopardy the usefulness of a machine translation pipeline. For this reason, SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes aims to foster the growing interest in this topic in the community.

In this competition participants were asked to perform binary classification to identify cases of fluent overgeneration hallucinations in two different setups: model-aware and model-agnostic tracks. In order to do this, they had to detect grammatically sound outputs which contain incorrect or unsupported semantic information, inconsistent with the source input, with or without having access to the model that produced the output.

The evaluated approach using a simple linear combination of reference models ranked 3rd in the model-agnostic track with a 0.826 accuracy.

Related work

Hallucination in AI means the AI makes up things that sound real, but are either wrong or not related to the context. This often happens because the AI has built-in biases, doesn’t fully understand the real world, or its training data isn’t complete. In these instances, the AI comes up with information it wasn’t specifically taught, leading to responses that can be incorrect or misleading.

The following link (https://www.rungalileo.io/blog/deep-dive-into-llm-hallucinations-across-generative-tasks) provides a good analysis of hallucinations types, which are reproduced below:

“Intrinsic Hallucinations:” These are made-up details that directly conflict with the original information. For example, if the original content says “The first Ebola vaccine was approved by the FDA in 2019,” but the summarized version says “The first Ebola vaccine was approved in 2021,” then that’s an intrinsic hallucination.

“Extrinsic Hallucinations:” These are details added to the summarized version that can’t be confirmed or denied by the original content. For instance, if the summary includes “China has already started clinical trials of the COVID-19 vaccine,” but the original content doesn’t mention this, it’s an extrinsic hallucination. Even though it may be true and add useful context, it’s seen as risky as it’s not verifiable from the original information.

Source of Image 1: Survey of Hallucination in Natural Language Generation

Approach

Since I only took part in the model-agnostic track, I had no access to nor knowledge about the source models used for generation. For this reason, the following models were considered for feature generation:

COMET: Developed by Rei et al., COMET is a neural quality estimation metric that has been validated as a state-of-the-art reference-based method [Kocmi et al.].
Vectara HHEM: an open source model created by Vectara, for detecting hallucinations in LLMs. It is particularly useful in the context of building retrieval-augmented-generation (RAG) applications where a set of facts is summarized by an LLM, but the model can also be used in other contexts.
LaBSE: a measure evaluates the cosine similarity of the source and translation sentence embeddings [Feng et al.]. It’s a dual-encoder approach that relies on pretrained transformers and is fine-tuned for translation ranking with an additive margin softmax loss. Two different features were extracted from this approach depending on the variables considered via cosine similarity:
labse1: hypotesis VS target
labse2: hypothesis VS source
SelfCheckGPT QA (MQAG) [Manakul et al.] facilitates consistency assessment by creating multiple-choice questions that a separate answering system can answer for each passage. If the same questions are asked it’s anticipated that the answering system will predict the same answers. The MQAG framework is constructed of three main components: a question-answer generation system (G1), a distractor generation system (G2), and an answering system (A). Two different features were extracted via this approach, one using GPT3.5 Turbo and another using GPT4.

Finally, a logistic regression model was trained using the main competition dataset and the features described above. The feature importance is quite insightful as shown below, where we can observe that Vectara and GPT4 models produce the strongest features overall:

Image 2: Logistic regression weights of the final model.

Results

Despite of the simplicity of the approach used (mostly relying on pre-trained resources and without needing GPU infrastructure) it can be considered a strong baseline with a 0.826 Accuracy. The ranking table from Codalab is included as follows: