4 The Last Thirty Years: Data, Models, and Behavioural Science at Scale

4.1 Three Modes of Automated Persuasion

The computational literature on persuasion can be organised into three capability classes, ordered by the direction of information flow relative to the persuasion act:

Capability	Direction	Goal
Descriptive	Observation → Explanation	Explain and measure persuasion in existing content
Simulative	Content → Predicted Response	Predict human persuasive response before it occurs
Generative	Target → Optimised Content	Create maximally persuasive content for a given audience

Descriptive systems explain the mechanics of persuasion in observed content: why some arguments land and others do not, what emotional valence predicts engagement, what visual features drive ad memorability. The core output is understanding, a post-hoc account of what made existing content persuasive. Argumentation mining, sentiment analysis, and multimodal content analysis belong here.

Simulative systems predict how a given person or population will respond to a message before it is deployed. This closes the loop from observation to prediction. Single-person micro-targeting, opinion dynamics modelling, propaganda propagation forecasting, and audience selection all require simulative capacity. Large agent-based simulators now make it possible to test a message’s trajectory through synthetic populations before any human sees it.

Generative systems produce persuasive content optimised for a particular configuration of audience, time, channel, sender, and topic. Large language models have dramatically lowered the cost of generating fluent, targeted text at scale. This is the most powerful, and ethically fraught, capability class. If an AI system can produce a message more persuasive than any human writer could, tailored to a single individual, delivered via the channel and at the moment of greatest receptivity, the governance questions become pressing in ways they have not been before.

These three modes are concurrent capabilities. The most consequential deployed systems combine all three: observe at scale, simulate response, generate optimised output. The history of how that integration became possible is the subject of this chapter.

Chapter Overview

This chapter traces the most consequential structural change in the study of persuasion since Aristotle: the emergence of digital data and machine learning as instruments for observing, analysing, and ultimately generating persuasive communication at population scale.

4.2 A Before and After

For most of human history, the study of persuasion was limited by what could be directly observed: a speech in the agora, a pamphlet, an advertisement, a laboratory experiment with dozens of undergraduate participants. The mechanisms of attitude change were inferred from small samples, self-reports, and controlled but artificial stimuli. The reach of any persuasive act was difficult to measure, and the diversity of human responses to the same message was largely invisible.

The last twenty years have fundamentally altered this situation. Three developments, each building on the last, have together created a new scientific infrastructure for understanding and producing persuasion at scale:

The internet as a data-collection instrument — the digitisation of everyday behaviour into machine-readable traces.
Machine learning at scale — the development of methods capable of finding structure in datasets orders of magnitude larger than anything previously studied.
Large Language Models — systems trained on the accumulated textual output of humanity, capable of generating fluent, contextually appropriate, and persuasive communication.

4.3 The Internet and the Instrumentation of Behaviour

The internet created a new kind of scientific instrument that also happened to be a communication channel. Every search query, every click, every purchase, every moment of attention leaves a digital trace: a record of human behaviour generated as a byproduct of ordinary life rather than constructed for scientific purposes [13]. By the early 2010s, the aggregate volume of these traces (what came to be called big data) exceeded anything the social sciences had previously encountered.

[13] describe this as enabling observation of “populations and phenomena that were previously difficult or impossible to study.” Location data from mobile phones can reveal urban mobility patterns. Search engine queries serve as real-time proxies for public health trends, most famously (and cautionarily) in the case of Google Flu Trends [14]: during the 2011–2012 flu season, GFT overshot actual influenza-like illness levels by more than 50%, and in February 2013 it predicted more than double the CDC-measured proportion of doctor visits for flu, remaining miscalibrated in 100 of 108 consecutive weeks. Social media posts have been used to infer psychological traits, predict economic indicators, and track the spread of misinformation. In each case, the underlying scientific move is the same: substitute scarce, expensive, purpose-built data collection (surveys, experiments, ethnography) with abundant, cheap, incidentally generated digital traces.

The implications for persuasion research are profound. Rather than measuring attitude change in a laboratory sample of 80 participants, researchers can observe the response of millions of people to the same message in the wild. Rather than constructing artificial stimuli, researchers can study the persuasive effects of actual political advertisements, actual product descriptions, actual public health communications, on their actual audiences.

4.3.1 Digital Traces and the Limits of Measurement

[13] are also careful to note the dangers. Digital data sources were not designed for scientific measurement and bring their own distortions: platform algorithms determine which content is seen, changing affordances alter what traces are generated, and the populations that produce digital data are not representative of humanity [16]. Most fundamentally, human behaviour is not stable under observation: the act of measuring changes what is measured, as platforms that optimise for engagement discover when they shape the very behaviour they are trying to predict.

The engagement-optimisation literature makes this tension vivid. [10] show that maximising user engagement, the primary metric of most digital platforms, does not necessarily maximise user welfare. Platforms can increase the time users spend on a service while simultaneously making them less happy, because the behaviours that are most measurable (clicks, time-on-site, shares) are the ones most subject to impulsive rather than reflective choice. The persuasion effects of platform design are therefore built into the measurement infrastructure itself.

4.4 The Wikipedia Moment: Big Data Meets Social Science

The availability of large digital corpora created an early opportunity to test whether the social sciences could make use of them. Wikipedia, with its comprehensive edit histories and contributor metadata, became a proving ground. [15] showed that large-scale analysis of Wikipedia contributions could reveal geographic imbalances in knowledge production, patterns of conflict and consensus among editors, and the dynamics of public goods contributions at a scale unattainable by any previous method.

Yet [15] also found that this research remained largely disciplinarily siloed: studies of Wikipedia tended to answer questions motivated by the structure of the available data rather than by established theoretical questions, and findings were rarely integrated across fields. Big data turned out to require better theories to exploit well, not fewer.

4.5 The ImageNet Inflection: Machine Learning at Scale

The second development was the maturation of machine learning into a tool capable of finding structure in datasets of unprecedented size. The pivotal moment was the 2012 ImageNet Large Scale Visual Recognition Challenge, in which Krizhevsky, Sutskever, and Hinton demonstrated that a deep convolutional neural network (AlexNet) could reduce image classification error from 26% to 15%, a margin so large that it immediately shifted the direction of the entire field [6, 11].

ImageNet itself, a manually labelled database of over 14 million images across more than 20,000 categories assembled by Fei-Fei Li and colleagues, made this possible by providing a training set large enough to prevent overfitting in high-capacity models. The lesson generalised immediately: the combination of large, well-labelled datasets and high-capacity models trained with sufficient compute produced qualitative improvements in performance on tasks previously considered intractable.

The implications for behavioural research took a decade to become clear. If models trained on millions of images could learn to recognise objects, scenes, and faces, could models trained on millions of human interactions learn to recognise, and produce, persuasion?

4.6 Large Language Models: Persuasion Industrialised

The third development was the emergence of Large Language Models (LLMs): neural networks trained on massive corpora of human-generated text to predict the next token in a sequence. The training corpora are themselves extraordinary artefacts, aggregations of the textual output of human civilisation at a scale never before assembled for scientific or engineering purposes.

BERT [7], introduced by Google in 2018, was pre-trained on the English Wikipedia and the BooksCorpus — roughly 3.3 billion words.
GPT-3 [5], introduced by OpenAI in 2020, was trained on Common Crawl, WebText, BooksCorpus, and Wikipedia — roughly 570 GB of filtered text, encompassing hundreds of billions of words of internet content.
T5 [19] introduced the encoder-decoder framing that unified text classification, translation, summarisation, and question-answering into a single model architecture trained on the Colossal Clean Crawled Corpus (C4), a 750 GB filtered web corpus.
Subsequent models, including LLaMA [20], Falcon, and BLOOM, extended the scale to trillions of tokens and dozens of languages.

The effect of training on text at this scale is qualitative, not merely quantitative. These models acquire world knowledge, cultural context, and an implicit model of how humans communicate, argue, and persuade. A model trained to predict the next word in a persuasive political speech has, in some sense, learned something about persuasion, through statistical exposure to millions of examples.

LLMs can generate persuasive messages, adapt linguistic style to specific audiences, construct arguments, anticipate counter-arguments, and calibrate emotional tone [3]. Studies have shown that GPT-4-generated persuasive essays are rated as more persuasive than human-written essays by blind evaluators [21], and that AI-generated messages can change opinions on contested political topics [8].

4.7 Large Content and Behaviour Models

The most recent development extends the LLM paradigm to behavioural data: the full range of human digital traces, clicks, views, purchases, engagement patterns, and search queries. Models in this class, which we term Large Content and Behaviour Models (LCBMs) following [12], are trained on joint corpora of content and the behavioural responses that content elicited from real users at scale.

Where LLMs learn what people say, LCBMs learn what people do in response to what they see. This distinction is significant for persuasion research. A model that has been trained on the engagement patterns of millions of users across millions of pieces of content has, in some sense, learned a generalised model of persuasive effectiveness: what kinds of content, in what contexts, for what audiences, produce what behavioural changes.

[12] demonstrate that LCBMs trained on large-scale content-behaviour corpora significantly outperform content-only models on downstream behavioural prediction tasks, including engagement prediction, sentiment forecasting, and audience response modelling. The intuition is straightforward: knowing that a message was watched, shared, or acted upon by millions of people provides training signal that text alone cannot supply.

4.8 LLMs as Simulators of Human Behaviour

A striking recent trend has been the use of LLMs not as generators of persuasion but as simulators of human responses to persuasion, in effect as synthetic subjects for social science experiments.

[2] showed that GPT-3 can be “conditioned” on demographic profiles (age, education, political affiliation, geographic region) to produce responses that mirror the survey responses of those groups with remarkable fidelity, a technique they term “algorithmic fidelity.” Working with three waves of the American National Election Studies (2012, 2016, and 2020) and thousands of socio-demographic backstories, they created “silicon subjects” that completed the same survey tasks as their human counterparts. The simulated subjects reproduced complex patterns of relationships between political attitudes, demographic identities, and contextual cues that previously required large probability samples to detect. The implication is that LLMs internalise statistical representations of group-level opinion distributions that can be queried at near-zero marginal cost.

[1] extended this finding, showing that LLMs can replicate the results of classic social science experiments, including Milgram’s obedience studies and Kahneman and Tversky’s framing effects, without any task-specific training. The models produce responses matching the direction and magnitude of the original experimental effects.

More directly relevant to persuasion, [4] found that GPT-4 could predict the results of randomised controlled experiments on political persuasion with striking quantitative accuracy. Across 476 experimental effect sizes drawn from 70 U.S. survey experiments, GPT-4’s predicted treatment effects correlated with actual outcomes at r = 0.85 (adjusted r = 0.91). For unpublished studies that could not have appeared in GPT-4’s training data, the correlation was even higher at r = 0.90 (adjusted r = 0.94), suggesting that the model has internalised something close to a general theory of what persuades, not merely memorised prior experimental results.

[17] demonstrated that populations of LLM agents, given minimal backstory descriptions, can simulate emergent social dynamics, including opinion polarisation, information cascading, and social influence, that closely mirror patterns observed in real human populations.

[18] showed that LLMs can capture human personality traits as measured by standardised psychometric instruments (Big Five, MBTI), reproducing known correlations between personality and persuasibility with precision. In their study, simulated subjects high in Openness showed significantly stronger preference for personality-matched advertisements (β = 0.36, p < 0.001), as did those high in Extraversion (β = 0.40, p < 0.001) and Conscientiousness (β = 0.29, p = 0.020), effect sizes consistent with decades of empirical literature. Crucially, the LLM-generated synthetic subjects reproduced these correlations without any task-specific training, implying the model had absorbed the personality–persuasibility regularity from its pre-training corpus alone.

The Limits of LLM Simulation

These findings should be interpreted carefully. LLMs are trained predominantly on text produced by Western, English-speaking, educated, and digitally connected populations, what Henrich and colleagues term the WEIRD bias (Western, Educated, Industrialised, Rich, Democratic). [9] Simulation accuracy degrades for populations underrepresented in training data: people over 65, the highly religious, non-English speakers, and communities without significant digital footprints [13]. The claim that LLMs are “human simulators” is therefore a claim that they are simulators of particular humans , a limitation with significant implications for the use of LLM simulation in persuasion research.

4.9 What This Means for the Study of Persuasion

The developments traced in this chapter, digital instrumentation, large-scale machine learning, and LLMs, have changed the object of study itself.

Persuasion now operates at the intersection of algorithmic content selection, platform incentive structures, user behavioural data, and industrially-generated content. The mechanisms by which a message reaches a receiver, and the mechanisms by which the receiver’s response feeds back to determine what message they see next, are now computational and largely invisible to both sender and receiver.

Understanding persuasion in this environment requires the integration of the classical persuasion literature (elaboration likelihood, attitude change, source credibility) with the emerging literature on algorithmic mediation, behavioural data at scale, and LLM capabilities. That integration is the animating ambition of this review.

Table 4.1: Evolution of the empirical infrastructure for persuasion research.

Era	Primary data source	Primary method	Scale of observation
Pre-digital (–2000)	Surveys, experiments	Statistical inference	Hundreds to thousands
Early digital (2000–2012)	Web traces, social media	Descriptive analytics, NLP	Millions
Big data / deep learning (2012–2020)	Platform logs, images, video	Deep neural networks	Billions
LLM era (2020–)	Multimodal corpora + behavioural signals	Foundation models, LCBMs	Trillions of tokens

[1]

Aher, G.V. et al. 2023. Using large language models to simulate multiple humans and replicate human subject studies. Proceedings of the International Conference on Machine Learning. (2023).

[2]

Argyle, L.P. et al. 2023. Out of one, many: Using language models to simulate human samples. Political Analysis. 31, 3 (2023), 337–351.

[3]

Bai, H. et al. 2023. The potential of generative AI for personalized persuasion at scale. Scientific Reports. (2023).

[4]

Broockman, D. et al. 2024. Can LLMs replace human evaluators? Predicting the results of social science experiments. arXiv preprint arXiv:2406.14508. (2024).

[5]

Brown, T.B. et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems. 33, (2020).

[6]

Deng, J. et al. 2009. ImageNet: A large-scale hierarchical image database. Proceedings of the IEEE conference on computer vision and pattern recognition (2009), 248–255.

[7]

Devlin, J. et al. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT. (2019), 4171–4186.

[8]

Hackenburg, K. and Margetts, H. 2024. Evaluating the persuasive influence of political microtargeting with large language models. Proceedings of the National Academy of Sciences. 121, 24 (2024).

[9]

Henrich, J. et al. 2010. The weirdest people in the world? Behavioral and Brain Sciences. 33, 2–3 (2010), 61–83. https://doi.org/10.1017/S0140525X0999152X.

[10]

Kapoor, S. et al. 2024. Measuring and improving the well-being of users in recommendation systems. Proceedings of the ACM on Human-Computer Interaction. (2024).

[11]

Krizhevsky, A. et al. 2012. ImageNet classification with deep convolutional neural networks. Advances in neural information processing systems (2012).

[12]

Kumar, Y.K. et al. 2024. Large content and behavior models to understand, simulate, and optimize content and behavior. arXiv preprint arXiv:2410.02653. (2024).

[13]

Lazer, D. et al. 2021. Meaningful measures of human society in the twenty-first century. Nature. 595, (2021), 189–196.

[14]

Lazer, D. et al. 2014. The parable of Google Flu: Traps in big data analysis. Science. 343, 6176 (2014), 1203–1205.

[15]

Lemmerich, F. et al. 2019. World versus Wikipedia: Measuring the coverage of the world in language editions of Wikipedia. Proceedings of the ACM Web Science Conference. (2019).

[16]

Olteanu, A. et al. 2019. Social data: Biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data. 2, (2019), 13.

[17]

Park, J.S. et al. 2023. Generative agents: Interactive simulacra of human behavior. Proceedings of UIST. (2023).

[18]

Pellert, M. et al. 2023. AI psychometrics: Assessing the psychological profiles of large language models through psychometric inventories. Perspectives on Psychological Science. (2023).

[19]

Raffel, C. et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research. 21, 140 (2020), 1–67.

[20]

Touvron, H. et al. 2023. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. (2023).

[21]

Voelkel, J.G. et al. 2024. Artificial intelligence can persuade humans on political topics. PNAS Nexus. 3, 2 (2024).