Vague quantification
and object representation
in humans vs. Large Vision Language Models

Raquel Montero, Natalia Moskvina, Paolo Morosi, Elena Pagliarini and Evelina Leivada

Universitat Autònoma de Barcelona

XPrag 2025, 19/09/2025

Vague quantification in human language

Vague quantifiers display a huge amount of variability. Despite this, psycholinguists and theoretical linguists agree that human quantification is affected/regulated by:

 

- Ordered Scales: <some, many, most, all>
(Horn 1972; Katsos et al. 2016; Pagliarini et al. 2018; Gotzner and Romoli 2022, a.o.)

 

- Cognitive biases: clustering, size, etc.
(Krueger 1972; Pauw 2013; Bertamini et al. 2018, a.o.)

Variation Quantifier Few (Ramotowska et al. 2024)

Regular-random illusion (Bertamini et al. 2018)

Quantification in LVLM: Testoni, Sprott, and Pezzelle (2024)

Study comparing the language used by humans and Large Vision Language Models (LVLMs) when describing images.

Conclusion: models deviated most from humans in the quantifier task.

Why might models struggle with quantification?

 

Hypothesis: Poor counting skills of the models (Testoni, Sprott, and Pezzelle 2024).

[…] all models struggle to successfully count how many animals appear in the image. We hypothesize that the reason for the poor performance in assigning quantifiers lies in the quantity estimation and comparison skills of the models. (Testoni, Sprott, and Pezzelle 2024, 5)

However, quantifiers have been argued to use an approximate number system that is independent from the symbolic counting system (Lidz 2016; Dolscheid et al. 2017; Szymanik, Kochari, and Bremnes 2023).

 

There might be other reasons (internal or external) that explain the poor performance of the models with quantifiers.

Research Questions

  1. Why might models struggle with quantification?
  • Failure to replicate human cognitive mechanisms/biases for numerosity estimation
  • Model’s internal representations of quantification (ordered scales, typicality, etc.).
  1. Are there differences across different types of models?

GPT-4o (non-reasoning model) and o4-mini (reasoning model)

  1. Do the same patterns of behavior emerge cross-linguistically?

Methodology

  • Languages: English, Greek, Russian, Spanish, Italian and Catalan
  • Data collection per language:
`Participants’ Method
Humans 40 PCIbex
GPT-4o 40 openAI API
o4-mini 40 openAI API

Results: internal representations

 

Ordered Scales:
<a few, some, many, most>\(^1\)

• GPT-4o: ✘
• o4-mini\(^2\): ✔

Typicality (Mode) and range (IQR):
• GPT-4o: ✘
• o4-mini: ✘

Multinomial Regression \(\rightarrow\) significant

 

Models seem to have problems with the internal representation of quantification.

 

1.Catalan data does not follow this exact pattern. 2. Greek deviates from this.

Internal respresentations in LVLM - embeddings

 

Cosine similarity

\(\cos(\vec{q},\vec{p}) = \dfrac{\vec{q}\cdot \vec{p}}{||\vec{q}|| \cdot||\vec{p}||}\)

Gives a number between -1 and 1.
The closer to 1, the higher the similarity.

 

Results: internal respresentations - embeddings

Model: text-embedding-3-large \(\rightarrow\) only accessible embedding model in OpenAI’s API

 

 

Some models face problems with the internal representation of quantification.

Internal representations in LVLM - reasoning

Reasoning models are trained via Reinforcement Learning (on verifiable problems) to create long chains of thought before providing a final answer (OpenAI 2025).

Image adapted from Wolfe (2025) and Wei et al. (2022). Simplified explanation of (non)-reasoning models.

Hypothesis: given the inter-speaker variability in quantifier usage, these techniques may not be enough to replicate typicality values and ranges of usage.

Recap

Results: approximate number system

 

Conclusions

  1. Why might models struggle with quantification? \(\rightarrow\) multi-causal

Models have problems in the internal representation of typicality, range of usage and ordering of scales; and differ from humans in numerosity estimation.

  1. Are there differences across different types of models? \(\rightarrow\) Yes

GPT-4o struggles with typicality, range of usage and ordering of scales, while o4-mini seems at least to have captured the ordering of quantifiers in some languages.

  1. Do the same patterns of behavior emerge cross-linguistically? \(\rightarrow\) Mostly

Slight differences, particularly with low-resource languages (e.g. in Greek), and cross-linguistic variability is not captured by the models (e.g. Catalan).

Acknowledgements

 

Many thanks to M. Teresa Espinal and the members of CLT at UAB for their useful feedback and suggestions.

 

We also gratefully acknowledge funding from the Spanish National Research Agency (CNS2023-144415); MCIN/AEI/10.13039/501100011033 and the European Union NextGenerationEU/PRTR (Grant RYC2021-033969-I).

 

Extra Slides

Results: overall patterns II

Catalan: quantifiers’ variation

Quantifiers Selection

English Greek Russian Spanish Italian Catalan
Most Τα περισσότερα большинство La mayoría La maggior parte La majoria
Many Πολλά многие Muchos Molti Molts
Some Κάποια некоторые Algunos Alcuni Alguns
A few Λίγα несколько Unos pocos Pochi Uns quants
(Kearns 2017) (Giannakidou 2012) (Krasikova 2011) (Martı́ 2015) (Crisma 2012) (Brucart and Rigau 2006)

References

Bertamini, Marco, Martin Guest, Giorgio Vallortigara, Rosa Rugani, and Lucia Regolin. 2018. “The Effect of Clustering on Perceived Quantity in Humans (Homo Sapiens) and in Chicks (Gallus Gallus).” Journal of Comparative Psychology 132 (3): 280.
Brucart, Josep M, and Gemma Rigau. 2006. “La Quantificació.” In Gramàtica Del Català Contemporani, 1517–89. Editorial Empuries.
Crisma, Paola. 2012. “Quantifiers in Italian.” Handbook of Quantifiers in Natural Language, 467–534.
Dolscheid, Sarah, Christina Winter, Lea Ostrowski, and Martina Penke. 2017. “The Many Ways Quantifiers Count: Children’s Quantifier Comprehension and Cardinal Number Knowledge Are Not Exclusively Related.” Cognitive Development 44: 21–31.
Giannakidou, Anastasia. 2012. “The Landscape of Greek Quantifiers.” Handbook of Quantifiers in Natural Language, 285–346.
Gotzner, Nicole, and Jacopo Romoli. 2022. “Meaning and Alternatives.” Annual Review of Linguistics 8: 213–34.
Horn, Laurence Robert. 1972. On the Semantic Properties of Logical Operators in English. University of California, Los Angeles.
Katsos, Napoleon, Chris Cummins, Maria-José Ezeizabarrena, Anna Gavarró, Jelena Kuvač Kraljević, Gordana Hrzica, Kleanthes K Grohmann, et al. 2016. “Cross-Linguistic Patterns in the Acquisition of Quantifiers.” Proceedings of the National Academy of Sciences 113 (33): 9244–49.
Kearns, Kate. 2017. Semantics. Bloomsbury Publishing.
Krasikova, Sveta. 2011. “On Proportional and Cardinal ‘Many’.” Generative Grammar in Geneva 7: 93–114.
Krueger, Lester E. 1972. “Perceived Numerosity.” Perception & Psychophysics 11 (1): 5–9.
Lidz, Jeffrey. 2016. “Quantification in Child Language.”
Martı́, Luisa. 2015. “The Morphosemantics of Spanish Indefinites.” In Semantics and Linguistic Theory, 576–94.
OpenAI. 2025. “Learning to Reason with LLMs.” https://www.nasa.gov/nh/pluto-the-other-red-planet.
Pagliarini, Elena, Cory Bill, Jacopo Romoli, Lyn Tieu, and Stephen Crain. 2018. “On Children’s Variable Success with Scalar Inferences: Insights from Disjunction in the Scope of a Universal Quantifier.” Cognition 178: 178–92.
Pauw, Simon. 2013. Size Matters: Grounding Quantifiers in Spatial Perception. University of Amsterdam.
Pezzelle, Sandro, Raffaella Bernardi, and Manuela Piazza. 2018. “Probing the Mental Representation of Quantifiers.” Cognition 181: 117–26.
Ramotowska, Sonia, Julia Haaf, Leendert Van Maanen, and Jakub Szymanik. 2024. “Most Quantifiers Have Many Meanings.” Psychonomic Bulletin & Review 31 (6): 2692–2703.
Szymanik, Jakub, Arnold Kochari, and Heming Strømholt Bremnes. 2023. “Questions about Quantifiers: Symbolic and Nonsymbolic Quantity Processing by the Brain.” Cognitive Science 47 (10): e13346.
Testoni, Alberto, Juell Sprott, and Sandro Pezzelle. 2024. “Naming, Describing, and Quantifying Visual Objects in Humans and LLMs.” arXiv Preprint arXiv:2403.06935.
Tiel, Bob van, and Bart Geurts. 2014. “Truth and Typicality in the Interpretation of Quantifiers.” In Proceedings of Sinn Und Bedeutung, 18:451–68.
Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Advances in Neural Information Processing Systems 35: 24824–37.
Wolfe, Cameron R. 2025. “Demystifying Reasoning Models.” https://cameronrwolfe.substack.com/p/demystifying-reasoning-models.