How OpenAI is Teaching AI Models to Avoid Logical Mistakes: The Power of Process Supervision

Have you ever wondered how AI models can solve complex problems that require logical thinking and reasoning? For example, how can a model solve a math problem like this:
1/2 + 1/3=?
You probably know the answer is 5/6, but how did you get there? You probably followed a series of steps, like this:
Find the least common multiple of 2 and 3, which is 6
Multiply both sides of the equation by 6 to get rid of the fractions
Simplify the equation to get 3 + 2 = ?
Add 3 and 2 to get 5
Write the answer as a fraction over 6: '5/6'.
This is an example of a chain of thought, a sequence of steps that leads to a correct solution. A chain of thought is not only useful for solving problems but also for explaining how you solved them.
If someone asks you how you got the answer, you can show them your chain of thought and they can follow your reasoning.
But what about AI models? How do they solve problems like this? And how do they explain their solutions?
The Challenge of Hallucinations

AI models are very good at generating words based on some rules and patterns. They can learn these rules and patterns from large amounts of text data, such as books, articles, or websites.
This is how models like ChatGPT or Google Bard can talk about almost anything, from sports to politics to philosophy. They can even write stories or poems or jokes for you.
But these models are not really smart or creative.
They don’t actually think or understand anything. They are just massive word generators. They use a lot of tricks and techniques to make you think they are human, but they are not. They are just machines that spit out words based on some rules and patterns.
One of the biggest problems with these models is that they often produce logical mistakes, also called hallucinations. Hallucinations are when the model invents facts or makes up things that are not true or do not make sense.
For example, a model might say that 2 + 2 = 5, or that the capital of France is Berlin, or that cats can fly. These are obvious hallucinations that anyone can spot and correct.
But what about more subtle hallucinations? What if the model says something that sounds plausible, but is actually wrong or misleading?
For example, a model might say that the moon is made of granite, or that Albert Einstein was born in 1956, or that the Pythagorean theorem is a^2 + b^2 = c^3.
These are not-so-obvious hallucinations that might fool some people or cause confusion.
Hallucinations are especially problematic in domains that require multi-step reasoning, such as mathematics, science, or logic.
In these domains, a single logical error can derail a much larger solution. For example, if the model makes a mistake in the first step of solving a math problem, it will likely get the wrong answer at the end.
And if the model does not show its chain of thought, it will be hard to find and fix the mistake.
The Solution of Process Supervision
How can we train more reliable models that can avoid hallucinations and produce correct and explainable solutions? One possible solution is to use process supervision.
Process supervision is a method of training AI models to reward themselves for each individual correct step of reasoning when they’re arriving at an answer, instead of just rewarding a correct final conclusion.
This way, the model learns to follow a human-approved chain of thought that leads to a correct and explainable solution.
For example, let’s go back to the math problem we saw earlier:
1/2 + 1/3 = ?
Instead of just giving the model the final answer (5/6), we can also give it feedback for each step in the chain-of-thought:
Find the least common multiple of 2 and 3, which is 6 -> Good!
Multiply both sides of the equation by 6 to get rid of the fractions -> Good!
Simplify the equation to get 3 + 2 = ? -> Good!
Add 3 and 2 to get 5 -> Good!
Write the answer as a fraction over 6: 5/6 -> Good!
By giving the model feedback for each step, we are teaching it how to solve the problem in a logical and understandable way. We are also making it easier to spot and correct any mistakes along the way.
Process supervision has several advantages over outcome supervision, which only provides feedback based on a final result.
Process supervision directly rewards the model for following an aligned chain-of-thought, since each step in the process receives precise supervision. Process supervision is also more likely to produce interpretable reasoning since it encourages the model to follow a human-approved process.
In contrast, outcome supervision may reward an unaligned process, and it is generally harder to scrutinize.
The Results of OpenAI
OpenAI, the Microsoft-funded creator of GPT-3 and other advanced AI models, has recently conducted a detailed comparison of outcome supervision and process supervision using the MATH dataset, a challenging collection of math problems that require multi-step reasoning.
They found that process supervision leads to significantly better performance, even when judged by outcomes. Their process-supervised model solves 78% of problems from a representative subset of the MATH test set.
In addition to boosting performance relative to outcome supervision, process supervision also has an important alignment benefit: it directly trains the model to produce a chain-of-thought that is endorsed by humans.
This means that the model is more likely to produce reliable and explainable solutions, and less likely to produce hallucinations or logical errors.
To encourage related research, OpenAI has also released PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train their best reward model. This dataset can be used by other researchers and developers who want to experiment with process supervision and improve mathematical reasoning in AI models.
The Future of Reasoning
Process supervision is a promising technique for training more reliable and explainable AI models that can perform complex multi-step reasoning.
By rewarding each correct step of reasoning, instead of just the final answer, process supervision teaches the model how to follow a human-approved chain of thought that leads to a correct and understandable solution.
OpenAI has shown that process supervision can significantly improve mathematical reasoning in AI models, and has released a large dataset of step-level feedback labels to support further research.
However, there are still many open questions and challenges in this domain. For example:
How can we scale up process supervision to more complex and diverse problems and domains?
How can we generate or collect high-quality feedback labels for each step of reasoning in a cost-effective and scalable way?
How can we ensure that the model does not deviate from the intended chain of thought or introduce new hallucinations or errors along the way?
How can we evaluate and compare the quality and reliability of different chains of thought and solutions?
These are some of the questions that OpenAI and other researchers are working on to advance the field of AI reasoning.
By developing more reliable and explainable AI models, we can hope to unlock new possibilities and applications for AI in various domains, such as education, science, engineering, and more.