Systematic testing of three Language Models reveals low language accuracy, absence of response stability, and a yes-response bias

Author:

Dentella Vittoria1ORCID,Günther Fritz2ORCID,Leivada Evelina34ORCID

Affiliation:

1. Departament d'Estudis Anglesos i Alemanys, Universitat Rovira i Virgili, Tarragona 43002, Spain

2. Institut für Psychologie, Humboldt-Universitat zu Berlin, Berlin 10099, Germany

3. Departament de Filologia Catalana, Universitat Autònoma de Barcelona, Barcelona 08193, Spain

4. Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona 08010, Spain

Abstract

Humans are universally good in providing stable and accurate judgments about what forms part of their language and what not. Large Language Models (LMs) are claimed to possess human-like language abilities; hence, they are expected to emulate this behavior by providing both stable and accurate answers, when asked whether a string of words complies with or deviates from their next-word predictions. This work tests whether stability and accuracy are showcased by GPT-3/text-davinci-002, GPT-3/text-davinci-003, and ChatGPT, using a series of judgment tasks that tap on 8 linguistic phenomena: plural attraction, anaphora, center embedding, comparatives, intrusive resumption, negative polarity items, order of adjectives, and order of adverbs. For every phenomenon, 10 sentences (5 grammatical and 5 ungrammatical) are tested, each randomly repeated 10 times, totaling 800 elicited judgments per LM (total n = 2,400). Our results reveal variable above-chance accuracy in the grammatical condition, below-chance accuracy in the ungrammatical condition, a significant instability of answers across phenomena, and a yes-response bias for all the tested LMs. Furthermore, we found no evidence that repetition aids the Models to converge on a processing strategy that culminates in stable answers, either accurate or inaccurate. We demonstrate that the LMs’ performance in identifying (un)grammatical word patterns is in stark contrast to what is observed in humans (n = 80, tested on the same tasks) and argue that adopting LMs as theories of human language is not motivated at their current stage of development.

Funder

Horizon Europe 2020 | Marie Sklodowska-Curie Actions

Universitat Rovira i Virgili

Deutsche Forschungsgemeinschaft

Spanish Ministry of Science and Innovation

Publisher

Proceedings of the National Academy of Sciences

Subject

Multidisciplinary

Reference57 articles.

1. S. Arehalli T. Linzen Neural networks as cognitive models of the processing of syntactic constraints. https://sarehalli.github.io/Arehalli_AgrAttr.pdf (2022).

2. W. Merrill A. Warstadt T. Linzen “Entailment semantics can be extracted from an ideal language model” in Proceedings of the 26th Conference on Computational Natural Language Learning A. Fokkens V. Srikumar Eds. (Association for Computational Linguistics Abu Dhabi United Arab Emirates (hybrid) 2022) pp. 176–193.

3. M. Baroni, “On the proper role of linguistically oriented deep net analysis in linguistic theorizing” in Algebraic Structures in Natural Language, S. Lappin, J. -P. Bernardy, Eds. (CRC Press, 2022).

4. Psychological models and their distractors

5. V. Dentella E. Murphy G. Marcus E. Leivada Testing AI performance on less frequent aspects of language reveals insensitivity to underlying meaning. arXiv [Preprint] (2023). https://doi.org/10.48550/arXiv.2302.12313 (Accessed 7 January 2023).

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3