Digital Library[ Search Result ]
In-Depth Evaluations of the Primality Testing Capabilities of Large Language Models: with a Focus on ChatGPT and PaLM 2
http://doi.org/10.5626/JOK.2024.51.8.699
This study aims to thoroughly evaluate the primality testing capabilities of two large language models, ChatGPT and PaLM 2. We pose two different yes/no questions for a given number, assessing whether it is prime or composite. To deem a model successful, it must correctly answer both questions while also avoiding any division errors in the generated prompt. Analyzing the inference results using a dataset consisting of 664 prime and 1458 composite numbers, we discovered a decrease in testing accuracy as the difficulty of the target numbers increased. Considering the calculation errors, both models experienced a decrease in testing accuracy, with PaLM 2 failing to conduct primality testing for all composite numbers with four high-difficulty digits. These findings highlight the potential for misleading evaluations of language models' reasoning abilities based on simple questions, emphasizing the need for comprehensive assessments.
Search

Journal of KIISE
- ISSN : 2383-630X(Print)
- ISSN : 2383-6296(Electronic)
- KCI Accredited Journal
Editorial Office
- Tel. +82-2-588-9240
- Fax. +82-2-521-1352
- E-mail. chwoo@kiise.or.kr