XPrag 2025, 19/09/2025
Vague quantifiers display a huge amount of variability. Despite this, psycholinguists and theoretical linguists agree that human quantification is affected/regulated by:
- Typicality
(Tiel and Geurts 2014; Ramotowska et al. 2024, a.o.)
- Ordered Scales: <some, many, most, all>
(Horn 1972; Katsos et al. 2016; Pagliarini et al. 2018; Gotzner and Romoli 2022, a.o.)
- Cognitive biases: clustering, size, etc.
(Krueger 1972; Pauw 2013; Bertamini et al. 2018, a.o.)
Study comparing the language used by humans and Large Vision Language Models (LVLMs) when describing images.
Conclusion: models deviated most from humans in the quantifier task.
Hypothesis: Poor counting skills of the models (Testoni, Sprott, and Pezzelle 2024).
[…] all models struggle to successfully count how many animals appear in the image. We hypothesize that the reason for the poor performance in assigning quantifiers lies in the quantity estimation and comparison skills of the models. (Testoni, Sprott, and Pezzelle 2024, 5)
However, quantifiers have been argued to use an approximate number system that is independent from the symbolic counting system (Lidz 2016; Dolscheid et al. 2017; Szymanik, Kochari, and Bremnes 2023).
There might be other reasons (internal or external) that explain the poor performance of the models with quantifiers.
GPT-4o (non-reasoning model) and o4-mini (reasoning model)
| `Participants’ | Method | |
|---|---|---|
| Humans | 40 | PCIbex |
| GPT-4o | 40 | openAI API |
| o4-mini | 40 | openAI API |
Ordered Scales:
<a few, some, many, most>\(^1\)
• GPT-4o: ✘
• o4-mini\(^2\): ✔
Typicality (Mode) and range (IQR):
• GPT-4o: ✘
• o4-mini: ✘
Multinomial Regression \(\rightarrow\) significant
Models seem to have problems with the internal representation of quantification.
1.Catalan data does not follow this exact pattern. 2. Greek deviates from this.
Cosine similarity
\(\cos(\vec{q},\vec{p}) = \dfrac{\vec{q}\cdot \vec{p}}{||\vec{q}|| \cdot||\vec{p}||}\)
Gives a number between -1 and 1.
The closer to 1, the higher the similarity.
Model: text-embedding-3-large \(\rightarrow\) only accessible embedding model in OpenAI’s API
Some models face problems with the internal representation of quantification.
Reasoning models are trained via Reinforcement Learning (on verifiable problems) to create long chains of thought before providing a final answer (OpenAI 2025).
Hypothesis: given the inter-speaker variability in quantifier usage, these techniques may not be enough to replicate typicality values and ranges of usage.
Models have problems in the internal representation of typicality, range of usage and ordering of scales; and differ from humans in numerosity estimation.
GPT-4o struggles with typicality, range of usage and ordering of scales, while o4-mini seems at least to have captured the ordering of quantifiers in some languages.
Slight differences, particularly with low-resource languages (e.g. in Greek), and cross-linguistic variability is not captured by the models (e.g. Catalan).
Many thanks to M. Teresa Espinal and the members of CLT at UAB for their useful feedback and suggestions.
We also gratefully acknowledge funding from the Spanish National Research Agency (CNS2023-144415); MCIN/AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR (Grant RYC2021-033969-I).
| English | Greek | Russian | Spanish | Italian | Catalan |
|---|---|---|---|---|---|
| Most | Τα περισσότερα | большинство | La mayoría | La maggior parte | La majoria |
| Many | Πολλά | многие | Muchos | Molti | Molts |
| Some | Κάποια | некоторые | Algunos | Alcuni | Alguns |
| A few | Λίγα | несколько | Unos pocos | Pochi | Uns quants |
| (Kearns 2017) | (Giannakidou 2012) | (Krasikova 2011) | (Martı́ 2015) | (Crisma 2012) | (Brucart and Rigau 2006) |