Paper: Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4
Researchers at Zhejiang University (among other contributors) are testing the limits of logical reasoning for AI models.
Mainly, they wanted to use algorithms to harness logical reasoning which has long been an endeavor for language models.
Logical Reasoning has been almost unreachable goal for the fields of NLU i.e. Natural Language Processing for many decades
In the past the researchers have had First-Order-Logic (FOL) or NaturalLogic where they use symbols to encode meaning and well as Rule based models to define logical reasoning
In the latest era of AI, language models such as OpenGPT and GPT4 which are trained on large internet datasets to “learn concepts” through language patterns
The researches created many logical tests for the models:
Multiple Choice Reading Comprehension (like SAT or ACT questions)
Natural Language Inference (like hypothesis testing)
Out-of-distribution tests (like Chinese Civil Servant exams)
The researchers had some interesting takeaways:
ChatGPT performed consistency in Chinese and English test, GPT4 did not
Chat-GPT performs well on well-known Logical reasoning tests like LogiQA and ReClor. However, when tested on the newly released dataset, namely AR-LSAT, and on LogiQA2.0 out-of-distribution dataset, the performance declined significantly.
On the ReClor dev set, GPT-4 reaches a 92.00% accuracy which is remarkable! However, when tested on the AR-LSAT test set, GPT-4 performs surprisingly worse with only an 18.27% accuracy
So essentially,
"Language AI models do well on known logical tests but struggle with out-of-distribution tests immensely!"