Increasingly, we are turning to language models such as ChatGPT not to search information, but to understand it. We ask such questions as “Explain me the food crisis in Uganda”, “Summarize Karl Popper for me”, “What is the legal definition of equitable?”, “Can you make me understand what this article says?”. Questions like these, every day, in every area. Students who fill gaps in view of an exam, journalists looking for a quick confirmation, professionals who “optimize” texts, emails, reports. And then teachers, doctors, lawyers. Everyone, sooner or later, turns to a large linguistic model, or LLM. This is not something entirely new or original: up to recent times we all used to turn to a search engine like Google. Except that now there is a profound difference. Like an old-fashioned (but more powerful) encyclopedia, Google returned a list of results: it indicated sources, directed the reader at different places, and left a large amount of personal (human) work to be done in reading, screening and evaluating. But how could we be the ones to evaluate the quality of a response from a long list, if we are the ones searching for one? Today LLMs simulate a response. They tell you directly what to think, as if the (human?) act of judging, explaining, synthesizing had already taken place. And with a text that appears always nice and orderly, fluent, often very convincing, they seem to confirm that everything you need to know is already there. The truth is: it is not there, and it does not work as it seems.

The central point is that LLM tools do not have a representation of the world. What they do, yet with astonishing effectiveness, is to generate plausible language sequences based on statistical patterns that they learned during the training. They predict the next word in a sequence, with the highest precision, on a global scale: they are trained on billions of texts, dialogues, articles, manuals, websites. However, their competence is purely linguistic, not epistemic. They do not verify, but “make it plausible”. But then, this is the problem (all on our human side, by the way): too often we tend to treat them as if they knew. We question them with the same honest attitude as if interrogating an expert professor. We trust their style, their argumentative composure. We mistake the coherence of language with a corresponding coherence of thought. And at that very moment, we are delegating not only the search for information, as we used to do with paper encyclopedias first and Google later, but we tend to delegate them the very structure of judgment. And the fact that it works in many cases, only risks reinforcing the possible misunderstanding. When we confuse a well-constructed sentence with a reliable content, then the problem is all in our head. Should not we ask the LLM also parallel questions, such as “Is this site reliable?”, “Does this source tell the truth?”, “Is this information correct?”. 

The more ethical question is, can a purely statistical engine (“purely” is a heavy word, knowing how much work and energy expenses lie behind constructing such statistical correlations) support this delegation process? If we ask the LLM to rate the reliability of a source, how does it really operate? Does it build some criterion or, again, it simulates an answer? Looking for an answer, a group of computer scientists and psychologists in Rome, Sapienza University, just published a work in PNAS (https://www.pnas.org/doi/10.1073/pnas.2518443122) that they titled “The simulation of judgement in LLMs”. They put six of the main models in use today (GPT-4o, Gemini, Mistral, Flash, Llama, DeepSeek) against groups of human evaluators, including experts, side by side. Everyonemodels and people, was entrusted with the same task: to judge the credibility of hundreds of information sites. On the surface, it was a simple task: classify the sources as reliable or unreliable and justify your choice. But, behind this apparent simplicity, it lies the key question: what counts as proof of reliability? What signals do LLM use to distinguish true from false? Not surprising, the output rating of the LLM models in Rome’s experiment was often very close to that of the human experts. But the processes leading to the judgment appeared radically different, given that LLMs rely on linguistic patterns, not on “reasoning” (if we trust human reasoning as such). LLMs identify the keywords, the frequent signals, expressions that co-occur with certain labels. They do not read the content like a human does, they are programmed to map it. And when they produce an explanation, they are not arguing: they are programmed to statistically extending an instruction. The judgment is simulated, but epistemology is absent. Personally, I am quite satisfied with the way LLMs currently function, and at the same time am quite critical about the epistemic abilities of the average human, so the problem raised by this study becomes even more interesting (at least to me).

The experimental analysis went further, and also asked how LLMs may account for political biases. Not if LLMs possess a bias, but how the bias is manifested when the LLM has to recognize one. The task the models were asked to do was to read a text, detect any ideological imbalance, and motivate their judgment. The challenge was twofold: (1) to identify the bias, and (2) argue for it. Here too, the answers were linguistically correct, stylistically fluid, but epistemically they appeared weak. As each one of us may have personally noticed in other contexts, when put under a little “pressure” the explanations given by ChatGPT seem just more elegant paraphrases, than well-founded assessments. The LLM does not reconstruct the reasoning, but typically recycles sentences from the same text, with a more neutral and ‘decorous’ tone.

An interesting observation from the Rome study was the emergence of a systematic trend: LLM models more often appeared to consider content associated with sources from the political right to be unreliable or polarized. Not because they have a personal opinion about (human) politics, but because they reflect the dominant patterns in the data on which they were trained. As it appears, in academic, journalistic and digital environments, right-leaning positions are more frequently treated with critical and diminutive tones. The models learned such a pattern and reproduced it, of course without “understanding” it. (At this point you may also start wondering, what does a human “understand” when he/she emits a political judgement…) In the absence of their own criteria (but, should they have any?) LLMs do not evaluate, they replicate the prevailing opinions acting like a distorting mirror that amplifies features – in this case the statistical frequencies of the LLM training set – to the extent that what may seem like a neutral assessment is, in reality, just a reflection of the environment that generated them, like Sylvia Plath’s mirror with no preconceptions.

To me, this makes a deep fracture visible. On the one hand, we tend to view the human thinking as based, at least in theory, on some hard principles, context, comparison, intentionality. On the other hand, the statistical reflex learned from the LLM shows that when “thinking” is devoid of intention and awareness, it just replicates correlations, not criteria. Besides inducing a criticism in “how much we should trust LLMs”, this also induces a criticism in how much we should trust also our own judgment. How much do we ourselves use correlations instead of fact-based knowledge? And where does our ‘knowledge’ lie, after all?

The most evolved LLMs of today arebeginning to behave like agents. They collect information, select sources, combine answers, and eventually make decisions. This is where the big bet of the major AI players is placed today: AI agents that will be able to carry out autonomous tasks, from legal counseling to medical advice, from customer service to policy analysis. The Rome group carried out another experiment in which both models and humans operated as agents, with the same tools available, the same resources, and the same task to perform: (1) a web page from which to start, (2) two articles to consult, (3) six evaluation criteria, (4) a limited time, and finally (5) a request for judgment. The results confirmed what could be derived from a very simple intuition: the humans tend to use rhetorical, stylistic, emotional criteria, overvaluing the tone of the text, the professionalism, the balance. The current LLMs instead appear to rely on structural traces, lexical signals associated with reputation or ideology. When they assign a reliability rating to the articles examined, they are not “judging” in the human sense of the word, they are optimizing correlations with an illusory imitation of judgment.

To the authors of the study, this appears to prove a paradigm shift. However, if we take a different angle, the results point out the simple fact that LLMs are trained on a human-biased database: hence the “error” lies in the humans that produced the data, not in the machine that analyzed them. On the contrary, the machine is doing a damn’ perfect job. These days, in the public and institutional debate there is an increasing talk about “extended mind”, “cognitive enhancement”, “man-machine alliance”, all fascinating concepts more likely being based on ideological constructions. Machine-generated models are not enemies, but they are not neutral partners either. They are increasingly powerful systems, capable of producing an appearance of thought, while indeed remaining plausibility-generating engines. The error our society is at risk with, is to mistake plausibility for truth. However, the problem is not AI, it is us. This should interrogate the way we humans recognize knowledge, reliability, authority. In many respects, AI-reasoning is not different from human reasoning. Both humans and LLMs learn epistemic and stylistic norms through routine exposure to texts, and humans as well are educated by professional, balanced or scholarly voices through long-term cultural training. These “human judgment” criteria are not innate modules, sculpted in our neurons like some biblical “Tables of the Law”, but they are laboriously constructed through development, education, socialization, and feedback. So in this sense, both humans and LLMs rely on language patterns, culturally learned categories, induced associations between forms and meanings, ultimately on prior experience accumulated over an entire human lifespan. But humans use statistical learning to build causal models, whereas LLMs use statistical learning instead of causal models.

The key, in my opinion, is that human judgment is not only pattern association. Humans also integrate other cognitive systems that LLMs simply do not have – at least yet. Crucial information about the world arrives to animals, and to humans among them, from the physical body and its sensing capabilities, from biological homeostasis, from emotional driving forces such as survival instinct, social affiliation, curiosity (if you have a cat at home, you know what I mean). In general, by motivational systems. Human judgment, besides linking together statistical similarities, is influenced by emotions, bodily states, social goals, self-protection, self-awareness, and long-term needs. Our cognitive processes are influenced by an intuitive understanding of physical constraints and limits: the first teacher of any kid is gravity. To this date, LLMs do not have intrinsic goals, emotions, or bodily signals, because they do not have a body. The self-evidence coming from the animal body supports the search for causal explanations, and the more evolved human brain adds the perception – maybe the illusion – of time passing, a past and a future. Our brain wants to formulate predictions about unobserved variables, counterfactual reasoning (“what would happen if X were different?”), planning over long time horizons, normative reasoning from emergent shared social behavior.

I surmise that the latter is the main factor in constructing a judgment. It is on the basis of shared social norms that humans can ask questions like “What is this agent (a writer, an athlete, a policeman) trying to accomplish?”, “What is their motive?”, “Are they being honest?”, hence the origins of a spectrum of philosophical and political variants and their evaluations. Again, the key to human society is not just the language, but the physical ensemble of bodies, its biomass. Some criticize AI, claiming with disdain that it could never reach the human ‘intelligence’. How much harder should it be, then, to imitate the human body, with its weakness and vulnerability? It may appear retrograde to think of embodied experience, in an era in which everything is information. We could easily imagine a society of machines exchanging opinions, and forming majority and minority views, and subscribing to political parties among them, but they would still be obliged to deducing statistical correlations from texts. They could simulate a political reasoning by pattern completion, but they do not possess a model of personal incentives, interpersonal dynamics, they do not possess an internal causal model grounded in action. They cannot experience the sensory consequences of one or the other political choice: they would not see nor feel the effect of a bomb dropping on a building in Kyiv. Many neuroscientists argue this is the biggest gap. We could start being afraid of machines when they will have a body and sensory organs. When they will care for the destiny of their children.

Of humans and LLMs

Post navigation


Leave a Reply

Your email address will not be published. Required fields are marked *