When a big organization or government is looking at using an AI system, trust is often on their minds. There’s a lot of talk about AI hallucination and lies- like this NY Times article saying GPT-4 hallucinates 3% of the time or this paper that shows GPT-4 can engage in insider trading and then lie about it. How can folks who work in AI build systems that big clients can trust?
I’m a big advocate of data-driven decision making, so to me, the decision whether or not to trust an AI system needs to be backed up just like anything else. But that means we have to get metrics and tests for truth and harmlessness, and how do you measure truth? Well, here are a few ideas…
Correctness
Let’s start with a solution that we use in the real world on humans: fact-checking! We’ll approach it like this:
- Break the AI’s statement up into individual claims
- For each claim, prompt an LLM to use its knowledge of the world to determine if the claim is factually correct or not.
- Once all claims have been assessed, return a truthiness score of the number of factually correct claims over the total number of claims.
This is great because: * We get an objective measure of truth * No external data is needed * It’s a fairly simple prompt chaining technique
And if that is all you’re looking for, then correctness can be a great measurement! But what if I told you there were downsides?… * Relies on the LLM’s knowledge of the world, which is not up-to-date and may not cover niche topics * LLMs are bad at math * Smaller LLMs are bad at reasoning over long, multi-step problems
So this measure trades simplicity for offloading your definition of truth to a (potentially unreliable) LLM. If only there was a way we could define our own truth and make the LLM compare against that?
Faithfulness / Data Provenance
Faithfulness or Data Provenance does exactly that: Like correctness, it assess individual claims, but unlike correctness it measures claims against retrieved facts from a datastore! To break it into steps:
- Retrieve relevant facts from a datastore.
- Break the AI’s statement into individual claims.
- Compare each claim with facts from the datastore to see if they align or contradict each other.
- Once all claims have been assessed, return the number of verified claims divided by total claims:
Problem solved, right? Using this system we can: * Involve proprietary or niche data in our definition of truth * Assess claims against up-to-date or even real-time data * Personalize the metric to be more meaningful to our use case. It’s not just truth, but our truth.
Pretty sweet deal! But measuring this way introduces a lot of complexity. First and foremost: you have to gather relevant facts and define what your truth really means! And that is no joke, especially when it comes to large organizations.
📖 Story time!
I once worked on a HR bot for a large corporation who had recently undergone leadership changes. The bot was supposed to answer questions about leave, compensation, and the company’s story and mission. Data about leave and other policies was easy to get from the employee handbook, but when we asked the HR team, “What is the company’s mission statement?” they took 3 months to get back to us!
In addition to hard conversations about the nature of truth, you have to build a reliable retrieval system to power it all! Hopefully your AI system already uses a retrieval system for other use cases that you can repurpose.
Once your retrieval system is build and filled with relevant, non-contradicting data, you still need to test and measure its performance to ensure that your truth measurements can rely on it. There are a lot of good sources about measuring RAG systems (and I’ll hopefully cover some of these measurements later). In the meantime, I particularly like this article form Pinecone.
❤️
Gordy