5 Resources for Persuasion Research

Chapter Overview

This chapter catalogues the datasets, benchmarks, simulation environments, and computational tools available to persuasion researchers. Resources are grouped by type: text corpora and argument quality benchmarks; multimodal advertising and emotion datasets; eye-tracking and attention corpora; social media and behavioural datasets; pre-trained models relevant to persuasion tasks; and simulation environments. Each entry includes the resource’s scope, size where known, the primary task it was designed for, and known limitations for persuasion research.

Living chapter

Entries are added as the field evolves. To suggest a resource, open an issue on GitHub.

Computational persuasion research sits at the intersection of natural language processing, computer vision, social psychology, and behavioural economics. Each of those fields has produced its own data infrastructure, and those datasets were rarely designed with persuasion as the primary target. The result is a fragmented resource landscape: strong coverage of some phenomena (argument structure, sentiment, eye movements during reading) and almost none for others (longitudinal attitude change, cross-cultural message effectiveness, behavioural outcomes of AI-generated content). This chapter maps what exists and where the gaps are.

5.1 Text Corpora and Argument Quality

5.1.1 Argument Mining Datasets

What makes a convincing argument [28] is the foundational dataset for computational argument quality research. Habernal and Gurevych collected 16,000 argument pairs from online debate platforms (ConvinceMe.net, iDebate.org, CreatedDebate.com), crowd-annotated for relative convincingness. Each pair presents two arguments on the same topic; annotators judged which was more convincing and why. The corpus also includes 9,111 argumentative sentences annotated for 15 attributes of convincingness, including specificity, emotional appeal, and use of evidence. Limitation: convincingness ratings are from crowdworkers reading in an artificial comparison context, not from people who were actually persuaded to change a position.

Persuasion of the Undecided [45] focuses on a practically important population: individuals who report no strong position on a debate topic. The dataset contains crowd-annotated argument pairs from the same debate platforms, filtered for pairs involving undecided evaluators. Because persuasion of the already-committed is structurally different from persuasion of the undecided, this resource targets a distinct phenomenon.

What makes a convincing argument: Cross-Lingual [6] extends the Habernal and Gurevych framework to German, providing one of the few argument quality datasets outside English. The corpus enables, at least in principle, cross-lingual comparisons of argument strength, though the translational mapping of persuasive norms across languages is itself an open research question.

CMV (Change My View) corpus from Reddit [67] contains 3,461 OP-reply pairs from the r/ChangeMyView subreddit, where the original poster (OP) explicitly asked to have their view challenged. The key annotation is whether the OP subsequently awarded a delta (acknowledging their view was changed). This makes it one of the few large-scale datasets where actual attitude change, rather than rated convincingness, is the outcome variable. Tan and colleagues showed that stylistic features, use of hedges, argument length, lexical diversity, and specific discourse markers, predict delta awards. Limitation: the CMV population is self-selected for willingness to be persuaded, and delta awards measure acknowledgement of a good argument rather than durable belief change.

ChangeMyView Threads [25] extend the basic CMV resource by including full thread structure, enabling analysis of how the social dynamics of a discussion thread (who responds, in what order, with what framing) affect persuasive outcomes beyond argument content alone.

Argument quality annotations [65, 66] from Stab and Gurevych provide a 402-essay corpus with fine-grained annotation of argument components (major claim, claim, premise) and argumentative relations (support, attack). The corpus enables training and evaluation of systems that identify argument structure rather than just argument quality. Limitation: essays were written by students and crowdworkers under controlled conditions; the argument structure of spontaneous persuasive speech or social media is substantially messier.

Debate speech persuasion — the IBM Project Debater datasets [29, 69] provide argumentative essays and structured debate transcripts with quality and persuasiveness annotations. Several of these datasets include listening audience persuasion scores, making them closer to a ground-truth persuasion signal than crowdworker quality ratings.

5.1.3 Language Generation and Quality

BooksCorpus and Common Crawl (used in pre-training BERT [23] and GPT-3 [16]) yield pre-trained models that form the backbone of most contemporary computational persuasion systems. Understanding what these corpora contain, and what they over-represent, is relevant for understanding the biases in downstream persuasion models.

Social Chemistry 101 [26] contains 292,000 rules of thumb describing social norms (“it’s rude to interrupt someone”) with 4.5 million human judgements on situational variation. This resource enables modelling of norm-based persuasion: arguments that appeal to social expectations rather than factual evidence.

5.2 Multimodal Advertising and Emotion

5.2.1 Advertising Datasets

Hussain et al. (2017) Advertisement Dataset [34] contains 64,832 image advertisements from Ads of the World, annotated for sentiment, topic, and persuasion strategy. It is the largest publicly available image advertising corpus, and the annotations include persuasion strategy labels derived from Aristotle’s three appeals (ethos, pathos, logos), making it directly relevant to classical persuasion theory. Limitation: images only, no video; annotations were produced by a small expert panel.

ADVISE (Symbolism and External Knowledge) [79] provides 3,587 advertisements annotated for symbolic content: the non-literal visual associations that carry persuasive meaning (a car advertisement showing a mountain road evokes freedom, not geography). Symbolic annotation requires cultural knowledge; the dataset reflects primarily Western advertising norms.

Visual Rhetoric in Advertisements [78] annotates a subset of advertisements for rhetorical figures, metaphor, hyperbole, and irony, in visual form. A visual metaphor argues analogically through juxtaposition of two unlike images; the dataset enables training systems to recognise this fundamentally persuasive structure.

M2P2: Multimodal Persuasion Prediction [3] pairs image, text, and audio features from advertisements with crowd-rated persuasiveness scores. The dataset covers 535 distinct advertisements across a range of product categories, with ratings from Amazon Mechanical Turk workers. M2P2 is currently the most complete multimodal persuasion benchmark available, enabling systematic comparison of unimodal versus multimodal persuasion models. Limitation: ratings measure predicted persuasiveness rather than actual behaviour change; the rater pool is WEIRD-skewed.

Image and Text Persuasion [80] provides experimental data on how the relationship between image and text in an advertisement affects overall persuasive impact. The key finding, that redundant image-text pairs are less persuasive than complementary ones, is supported by the corpus and has direct implications for automated advertisement generation.

Audio Persuasion [64] extends persuasion analysis to the acoustic channel, examining how prosodic features, speaking rate, and vocal affect modulate the persuasive impact of spoken messages.

A Video Is Worth 4096 Tokens [10] addresses the bottleneck of video analysis at scale by verbalising advertisement videos into coherent text narratives using a pipeline of keyframe captioning (BLIP-2), OCR, automatic speech recognition, and brand metadata retrieval, followed by LLM-based story synthesis. The resulting text stories are evaluated on five benchmark datasets across fifteen video understanding tasks — emotion classification, topic classification, and persuasion strategy identification. The paper also releases the first annotated dataset of persuasion strategies in video advertisements. The zero-shot approach outperforms supervised video understanding baselines on four of the five datasets.

5.2.2 Emotion Datasets

International Affective Picture System (IAPS) [49] contains 1,182 standardised emotional images rated on valence, arousal, and dominance by US undergraduate samples. IAPS is the oldest and most widely cited emotional image database; it has been used as a reference standard for affective computing for over three decades. Limitation: images are dated (many from the 1990s), the sample is exclusively American, and the valence/arousal/dominance model captures only a fraction of emotional complexity.

EmoSet [75] is a large-scale visual emotion dataset containing 3.3 million images annotated with eight discrete emotion categories (amusement, awe, contentment, excitement, anger, disgust, fear, sadness), brightness, colorfulness, scene type, and object presence. The dataset was constructed to enable models that connect low-level visual features to high-level emotional responses, with a scope three orders of magnitude larger than IAPS.

eMotions [73] focuses on short video clips, providing frame-level and clip-level emotion annotations across a large corpus of social media video. For persuasion research, short-form video is increasingly the dominant format for political advertising, public health messaging, and commercial persuasion.

BAM! (Behance Artistic Media) [72] provides semantic and emotional annotations of artistic images, enabling the study of persuasive communication in non-photographic, stylised visual content. It is particularly relevant for branded content and design-driven advertising.

Affect in images and text [47] provides colour and texture feature annotations linked to emotional response, grounding affective computing models in low-level visual properties.

5.2.3 Behaviour-Signal Advertising Datasets

The datasets above primarily annotate content features of persuasive material. A complementary line of work uses real behavioural outcomes, clicks, likes, and engagement, as labels, connecting message features to audience response.

Persuasion Strategies in Advertisements [41] provides a taxonomy of 21 persuasion strategies used by brands, appeals to authority, social proof, scarcity, reciprocity, analogical reasoning, among others, together with a large image advertisement dataset annotated with those strategy labels. The annotation scheme is derived from marketing and rhetoric literature and applied to advertisements collected from public platforms. The dataset is used as a ground-truth resource in several downstream works on automatic persuasion strategy identification in video [10].

BoigBench [38] is a benchmark for behaviour-optimised image generation, containing advertisement images paired with real engagement signals (likes, shares, click-through rates) collected from social media. It supports evaluation of generative models on the task of producing images predicted to drive higher user engagement. Two baseline models accompany the benchmark: BoigLLM, which conditions a language model on engagement history to select among candidate images, and BoigSD, a Stable Diffusion variant fine-tuned with engagement as the reward signal.

ALPHA / ALPHA50M [11] aligns LLMs to advertisement engagement data. ALPHA is trained on engagement signals, likes, comments, and shares, collected from real social media ad campaigns, producing a model that predicts and generates content optimised for behavioural response rather than human preference ratings. ALPHA50M extends this to 50 million ad-engagement pairs. The approach introduces behavioural alignment as distinct from standard RLHF: the reward signal comes from observed audience behaviour rather than human annotator preferences.

SPRO (Self-Play Reward Optimisation) [36] applies self-play reinforcement learning to diffusion-based image generation, using ad engagement data as the reward. The method alternates between generating candidate images and selecting higher-performing outputs via a reward model trained on behavioural data, progressively improving the engagement profile of generated advertising images over standard supervised fine-tuning baselines.

MEMENTO [50] uses web-scale data (web pages, their associated advertisements, and implicit engagement signals from web interactions) as a learning signal for low-data advertising domains. The approach does not require manually annotated training sets for each new domain; instead, the model learns domain-appropriate content features from the distribution of web content encountered in natural browsing. Particularly relevant for advertisers in specialised verticals where labelled ad-engagement data is scarce.

5.3 Eye-Tracking and Attention

Eye movements during reading and scene viewing provide a window into cognitive processing that self-reports cannot. For persuasion research, eye-tracking data reveals what receivers actually attend to, distinct from what they report attending to.

5.3.1 Reading Corpora

Dundee Corpus contains eye movements from 10 participants reading 20 newspaper articles (~51,000 words), recorded with full-paragraph context. It is a standard benchmark for models predicting reading time as a function of linguistic features (frequency, surprisal, syntactic complexity). Reading time is a proxy for processing effort, which in turn predicts encoding depth.

Provo Corpus [46] provides eye-tracking and self-paced reading data for 55 texts (~2,700 words each), with cloze probability norms for each word position. Its combination of oculomotor data and predictability norms makes it the richest available resource for studying how expectation violation (a core mechanism of persuasive surprise) affects processing.

CELER [9] contains 365 participants reading 5,000 sentences in both native English and as L2 English learners, with concurrent eye-tracking. Its scale enables statistical modelling of individual differences in reading, relevant for personalised persuasion research that wants to account for varying processing depths across recipients.

CMCL Shared Task Corpora [32, 33] provide eye-tracking data integrated with NLP benchmarks, enabling direct comparison of human reading patterns with model attention patterns.

5.3.2 Scene Viewing and Advertisement Attention

Human Attention in Image Captioning [30] links eye-tracking data during image captioning to the saliency maps produced by neural captioning models. For persuasion, the dataset enables analysis of whether model attention aligns with human attention during the processing of complex visual scenes.

Scanpath datasets — ScanGAN360 [48], ScanpathNet [8], and ScanPathApp1 provide scanpath (sequence of fixations over time) data during scene viewing, enabling models that predict where people look, in what order, and for how long. Scanpath prediction is relevant to understanding the temporal unfolding of attention during advertisement processing.

Eye-tracking in NLP benchmarks — the ZuCo [31] and related datasets provide simultaneous EEG and eye-tracking during reading of sentences with various complexity levels. These enable study of neural processing correlates of linguistic difficulty, with implications for message design.

Gaze embeddings for zero-shot classification — Karessli et al. [37] demonstrate that eye-tracking patterns during image viewing carry semantic information sufficient for zero-shot object recognition, evidence that gaze data encodes meaningful representations of visual attention that go beyond simple saliency.

5.5 Pre-Trained Models and APIs

The following computational models are the primary infrastructure for persuasion research.

5.5.1 Language Models

GPT-3 / GPT-4 [16, 51] — the GPT family from OpenAI is currently the most widely used infrastructure for LLM-based persuasion experiments, simulation studies, and automated message generation. GPT-4 specifically has been used by [15] to predict persuasion experiment outcomes and by multiple groups to generate personalised persuasive messages for controlled experiments. Accessible via the OpenAI API.

LLaMA / Vicuna [18, 68] — open-weight alternatives to the GPT family, enabling local deployment and fine-tuning on domain-specific persuasion corpora without API dependency. The LLaMA-family models have been used in persuasion research where full access to model internals (activations, logit distributions) is required.

BERT [23] — the standard encoder model for classification tasks in NLP, including argument quality classification, stance detection, and persuasion strategy identification. Fine-tuned BERT variants remain competitive on most persuasion classification benchmarks.

T5 / Flan-T5 [19, 58] — the text-to-text framework enables unified treatment of persuasion classification and generation tasks. Instruction-tuned Flan-T5 models perform well on argument quality benchmarks without task-specific fine-tuning.

LCBM (Large Content and Behavior Models) [42] — a framework proposed specifically for modelling the relationship between content and downstream behavioural response. Unlike the general-purpose models above, LCBM explicitly targets the link between message features and audience behaviour, making it the closest currently available model to a persuasion-specific foundation model.

LaMP (Language Model Personalization benchmark) [59] provides seven personalised NLP tasks spanning classification (citation identification, news categorisation, product rating) and generation (headline writing, scholarly title generation, email subject generation, tweet paraphrasing). Each task pairs task inputs with a user profile (that user’s prior outputs) and evaluates how well LLMs adapt to individual style and preference. LaMP is the current standard benchmark for personalised generation, using retrieval augmentation as the personalisation mechanism and reporting user-based and time-based data splits.

FSPO (Few-Shot Preference Optimization) [61] frames LLM personalisation as a meta-learning problem: given a small set of labelled preference pairs from a new user, the model rapidly constructs a personalised reward function and generates responses aligned with that user’s preferences. FSPO is trained on over one million synthetic personalised preferences spanning three domains and achieves strong performance on held-out user evaluation. It addresses a structural weakness of standard RLHF, that aggregating preferences across all users loses individual variation, which is directly relevant to personalised persuasive message generation.

Transsuasion / PersuasionBench [63] introduces the task of transsuasion: given a low-performing tweet, rewrite it to achieve the engagement level of a high-performing semantically equivalent tweet by the same author, transferring persuasive impact while preserving content. PersuasionBench pairs tweets written by the same user on the same topic where one version received significantly more engagement. The companion PersuasionArena evaluates LLMs on this and related persuasiveness tasks, demonstrating that persuasive ability scales with model size and that smaller LLMs can be brought to parity with larger ones through targeted fine-tuning.

Behavior-LLaVA [62] fine-tunes a vision-language model on human behavioural signals, video replay graphs, likes, and comments, rather than on human-annotated captions or preference labels. Training on behavioural outcomes teaches the model which visual and semantic features drive viewer engagement, producing representations that transfer to downstream content understanding tasks (emotion recognition, persuasion strategy detection) with improved accuracy over caption-supervised baselines.

5.5.2 Vision-Language Models

BLIP-2 [43] enables querying of image content in natural language, enabling structured annotation of visual persuasive elements (what is shown, what emotion is evoked, what claim is implied) at scale. Relevant for large-scale analysis of advertising corpora.

VideoChat [44] extends BLIP-style vision-language interaction to video, enabling natural-language querying of video advertisement content. This is relevant for the emerging literature on video persuasion, where manual annotation at scale has been the bottleneck.

Segment Anything (SAM) [39] and Track Anything [76] provide universal segmentation for images and video, enabling identification of objects, faces, and text regions within advertising content without task-specific training.

5.5.3 Multimodal Processing Tools

PySceneDetect [14] — a Python library for automated detection of scene cuts and transitions in video. For persuasion research, scene-cut detection is a prerequisite for analysing video advertisement structure (number of cuts per second is a standard measure of production intensity linked to emotional engagement).

GMFlow [74] — optical flow estimation for video, enabling quantification of motion dynamics within scenes. High optical flow correlates with arousal and attention capture in advertising research.

PP-OCR [24] — lightweight text detection and recognition for images and video. Advertising content frequently embeds text overlays; OCR enables this text to be extracted and analysed alongside the visual channel.

U2-Net [57] — salient object detection, identifying the primary visual subject of an image. Saliency and persuasive intent are related: advertisements are designed to direct attention to specific elements.

5.6 Simulation Environments

5.6.1 LLM-Based Population Simulation

Silicon sampling [1] — rather than a packaged tool, this is a methodology: conditioning LLMs on demographic profiles and using the resulting outputs as synthetic survey responses. Argyle and colleagues demonstrate the approach using GPT-3 conditioned on ANES respondent profiles, validating against real survey distributions. The methodology is available for replication; the key parameter choices (conditioning prompt structure, model temperature, validation procedure) are documented in the paper.

Generative Agents [53] — a simulation environment in which 25 LLM-powered agents live in a small-town setting, with persistent memory, daily schedules, and social relationships. Agents produce emergent collective behaviours (organising a party, running an election) without explicit programming. For persuasion research, the environment enables study of how information spreads through a social network, how agents update beliefs based on peer communication, and how persuasive interventions propagate through a simulated community. Code is open-source on GitHub.

OASIS [77] is a scalable open-source social media simulator supporting up to one million LLM-based agents, designed to replicate the dynamics of platforms such as X (formerly Twitter) and Reddit. Agents have persistent profiles, a dynamically updated information environment, and diverse action spaces (post, like, share, follow). The simulator has been used to study information spreading, group polarisation, and herd effects at a scale impossible with human participants. For persuasion research, OASIS provides the closest available approximation to a full social media information environment: a controlled setting where the researcher can inject a message, observe cascade dynamics, and measure attitude shifts across a large synthetic population.

Social Agents: Collective Intelligence [12] — a multi-agent framework in which a diverse population of LLM personas, each instantiated with systematically varied demographic and psychographic profiles, independently responds to a stimulus (advertisement, message, policy proposal), with aggregate predictions outperforming any single model on behavioural outcome tasks. As a simulation environment for persuasion, it functions as a heterogeneous audience simulator: a sender can test a message against a synthetic population before deployment, observing how predicted engagement and attitude-shift vary across demographic subgroups.

AI Psychometrics simulation [54] — Pellert and colleagues demonstrate that standard psychological inventories (Big Five, moral foundations, social value orientation) can be administered to LLMs, producing stable psychological profiles that vary by model and by prompt conditioning. The framework enables systematic exploration of how LLM “personality” interacts with persuasive message design.

Pressman et al. simulacra [56] — a framework for using LLMs as proxies for specific demographic groups in persuasion experiments, with explicit attention to the validity conditions under which LLM responses correspond to human responses.

5.6.2 Agent-Based Models

Axelrod’s tournament [2] — the original computer tournament for iterated Prisoner’s Dilemma strategies is reproducible with standard software (Python axelrod library). While not a persuasion environment per se, it is the foundational simulation environment for studying the evolution of cooperative signalling, the theoretical basis of honest communication.

Agent-based social simulation [5] — the broader class of agent-based models (ABMs) used in computational social science, implemented in platforms such as NetLogo and Mesa (Python). For persuasion, ABMs enable study of opinion dynamics, information cascade, and norm propagation in networks with controlled structure. Standard models include the bounded confidence model and the DeGroot model of opinion averaging.

5.7 Evaluation Frameworks and Metrics

5.7.1 Argument Quality Metrics

Standard NLP evaluation metrics, BLEU [52], METEOR [4], and BERTScore, are used for argument generation evaluation but are poorly calibrated to persuasive quality. A generated argument can be fluent and similar to a reference argument while being entirely unpersuasive, and vice versa. Habernal and Gurevych’s convincingness rankings [28] and the delta rate on CMV [67] are the most widely used task-specific metrics for argument persuasiveness.

5.7.2 Dialogue and Persuasion Metrics

LLM-as-judge [81] — using a strong LLM (GPT-4) to rate the quality of outputs from weaker models has become a standard evaluation approach. For persuasion, LLM judges can assess message fluency, coherence, and apparent persuasive intent, but there is no validated evidence that LLM-rated persuasiveness correlates with actual human attitude change.

Human preference ratings [17] — pairwise preference rating by human annotators is the most valid available proxy for persuasive effectiveness, short of measuring actual attitude or behaviour change. The Chatbot Arena methodology is the current standard for large-scale human preference evaluation of LLM outputs.

5.7.3 Long-Term Memorability

Video memorability [20, 35] datasets annotate videos for long-term memory retention: how likely is a viewer to remember this clip at a 1-week delay? Memorability is related to, but distinct from, persuasiveness: a memorable advertisement is more likely to influence future purchase decisions, but a highly memorable message may be memorable precisely because it is surprising or disturbing rather than because it changed a belief.

LAMBDA (Long-term Ad Memorability Dataset) [60] provides long-term memorability scores for 2,205 multimodal advertisements from 276 brands, collected from 1,749 participants in two-stage sessions with at least a one-day gap between exposure and recall. The dataset distinguishes brand recall from ad recognition, separating short-term recognisability from the long-term memory traces relevant to purchase-funnel effects. The accompanying model, Henry, integrates visual, cognitive, and world-knowledge representations to predict memorability. A companion dataset, UltraLAMBDA, scales to 5 million ads with automatically assigned memorability scores.

ToT2Mem [13] collects memorability signal at scale from Tip-of-the-Tongue retrieval queries on Reddit: posts where users describe content they half-remember but cannot identify. Over 470,000 content-recall pairs spanning multiple modalities are extracted from this unsupervised source, removing the scalability bottleneck of laboratory memorability studies. ToT2Mem-Video is an 82,500-pair video-recall subset. Fine-tuned VLMs on this dataset outperform GPT-4o on descriptive recall generation.

5.7.4 Behaviour-Based Engagement Metrics

Click-through rate (CTR) and engagement rate are the primary outcome variables in digital advertising research. CTR measures the fraction of content impressions resulting in a user click; engagement rate aggregates downstream interactions (likes, shares, comments, saves). CREATER [71] demonstrates a contrastive learning approach that trains content generation models directly on A/B test data, using CTR differences between content variants as the training signal rather than human preference labels. The approach treats behavioural A/B data as implicit pairwise preference data and constructs a loss that pushes the model toward generating content similar to high-CTR variants and away from low-CTR variants. For persuasion research, CREATER introduces a blueprint for converting the vast stores of platform A/B test data into training signal for persuasive content generation.

5.8 Summary: Coverage and Gaps

The table below maps available resources against the major research questions in computational persuasion. A tick indicates at least one well-validated resource; a dash indicates significant gap.

Table 5.1: Coverage of major persuasion research questions by available resources.

Research question	Available resources	Gap
Argument quality (text)	CMV, Habernal, Stab	✓ adequate for English
Attitude change (real-world)	CMV deltas, GOTV field exps	Gap: no large-scale randomised dataset
Multimodal persuasion	M2P2, Hussain ads, IAPS	Gap: no behavioural outcomes
Longitudinal effects	ANES (coarse), deep canvassing	Gap: no content-linked panel
Cross-cultural	CMV cross-lingual, CELER L2	Gap: no matched cross-cultural benchmark
LLM-mediated persuasion	Chatbot Arena, LaMP, FSPO, PersuasionBench	PersuasionBench nascent; no cross-platform benchmark
Simulation validity	Argyle, Pellert, Park, Social Agents	Gap: validation against real behaviour
Detection of AI content	—	Gap: no large labelled corpus
Video persuasion	M2P2, eMotions, LAMBDA, BoigBench, EMNLP-4096	Partial: engagement signals available; no attitude-change measurement
Eye-tracking + persuasion	ZuCo, Provo, CELER	Gap: ad-specific attention corpora

The two most consequential gaps are the absence of a large-scale dataset linking persuasive content to real behavioural outcomes, and the absence of a provenance-labelled corpus of AI-generated persuasive content. Both are prerequisites for the research frontiers described in Section 6.3 and Section 6.5 of Chapter 7.

[1]

Argyle, L.P. et al. 2023. Out of one, many: Using language models to simulate human samples. Political Analysis. 31, 3 (2023), 337–351.

[2]

Axelrod, R. 1984. The evolution of cooperation. Basic Books.

[3]

Bai, C. et al. 2021. M2p2: Multimodal persuasion prediction using adaptive fusion. IEEE Transactions on Multimedia. (2021).

[4]

Banerjee, S. and Lavie, A. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (2005), 65–72.

[5]

Bankes, S.C. 2002. Agent-based modeling: A revolution? Proceedings of the National Academy of Sciences. 99, suppl_3 (2002), 7199–7200.

[6]

Barrett, M. et al. 2016. Cross-lingual transfer of correlations between parts of speech and gaze features. Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical papers (Osaka, Japan, Dec. 2016), 1330–1339.

[7]

Baumgartner, J. et al. 2020. The pushshift reddit dataset. Proceedings of the international AAAI conference on web and social media (2020), 830–839.

[8]

Belen, R.A.J. de et al. 2022. ScanpathNet: A recurrent mixture density network for scanpath prediction. IEEE CVPR (2022).

[9]

Berzak, Y. et al. 2022. CELER: A 365-participant corpus of eye movements in L1 and L2 english reading. Open Mind. (2022).

[10]

Bhattacharyya, A. et al. 2023. A video is worth 4096 tokens: Verbalize videos to understand them in zero shot. Proceedings of the 2023 conference on empirical methods in natural language processing (2023), 9822–9839.

[11]

Bhattacharyya, A. et al. 2026. ALPHA: Aligning LLMs with ad engagement data. Proceedings of the AAAI conference on artificial intelligence (2026).

[12]

Bhattacharyya, A. et al. 2026. Social agents: Collective intelligence improves LLM-based predictions. International conference on learning representations (2026).

[13]

Bhattacharyya, S. et al. 2025. Unsupervised large-scale memorability modeling from tip-of-the-tongue retrieval queries. arXiv preprint arXiv:2511.20854. (2025).

[14]

Breakthrough 2023. PySceneDetect: Video scene cut detection and analysis tool. GitHub.

[15]

Broockman, D. et al. 2024. Can LLMs replace human evaluators? Predicting the results of social science experiments. arXiv preprint arXiv:2406.14508. (2024).

[16]

Brown, T.B. et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems. 33, (2020).

[17]

Chiang, W.-L. et al. 2024. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132. (2024).

[18]

Chiang, W.-L. et al. 2023. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality.

[19]

Chung, H.W. et al. 2022. Scaling instruction-finetuned language models.

[20]

Cohendet, R. et al. 2019. VideoMem: Constructing, analyzing, predicting short-term and long-term video memorability. Proceedings of the IEEE/CVF international conference on computer vision (2019), 2531–2540.

[21]

Danescu-Niculescu-Mizil, C. et al. 2012. Echoes of power: Language effects and power differences in social interaction. Proceedings of the 21st international conference on world wide web (2012), 699–708.

[22]

Danescu-Niculescu-Mizil, C. et al. 2012. You had me at hello: How phrasing affects memorability. arXiv preprint arXiv:1203.6360. (2012).

[23]

Devlin, J. et al. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT. (2019), 4171–4186.

[24]

Du, Y. et al. 2020. PP-OCR: A practical ultra lightweight OCR system.

[25]

Durmus, E. and Cardie, C. 2018. Exploring the role of prior beliefs for argument persuasion. NAACL: Human language technologies, volume 1 (long papers) (New Orleans, Louisiana, June 2018), 1035–1045.

[26]

Forbes, M. et al. 2020. Social chemistry 101: Learning to reason about social and moral norms. Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (Online, Nov. 2020), 653–670.

[27]

Gerber, A.S. et al. 2016. A field experiment shows that subtle linguistic cues might not affect voter behavior. Proceedings of the National Academy of Sciences. 113, 26 (2016), 7112–7117.

[28]

Habernal, I. and Gurevych, I. 2016. What makes a convincing argument? Empirical analysis and detecting attributes of convincingness in web argumentation. Proceedings of the 2016 conference on empirical methods in natural language processing (2016).

[29]

Hadoux, E. et al. 2021. Strategic argumentation dialogues for persuasion: Framework and experiments based on modelling the beliefs and concerns of the persuadee. arXiv preprint arXiv:2101.11870. (2021).

[30]

He, S. et al. 2019. Human attention in image captioning: Dataset and analysis. ICCV (2019).

[31]

Hollenstein, N. et al. 2019. Advancing NLP with cognitive language processing signals. arXiv preprint arXiv:1904.02682. (2019).

[32]

Hollenstein, N. et al. 2021. CMCL 2021 shared task on eye-tracking prediction. Proceedings of the workshop on cognitive modeling and computational linguistics (Online, June 2021), 72–78.

[33]

Hollenstein, N. et al. 2022. CMCL 2022 shared task on multilingual and crosslingual prediction of human reading behavior. CMCL shared task on multilingual and crosslingual prediction of human reading behavior (2022).

[34]

Hussain, Z. et al. 2017. Automatic understanding of image and video advertisements.

[35]

I, H.S. et al. 2024. Long-term ad memorability: Understanding and generating memorable ads.

[36]

Jha, P. et al. 2025. Improving image generation for advertising via self-play reward optimization. Advances in neural information processing systems (2025).

[37]

Karessli, N. et al. 2017. Gaze embeddings for zero-shot image classification. IEEE CVPR (2017).

[38]

Khurana, T. et al. 2023. Behavior optimized image generation via online exploration. arXiv preprint arXiv:2311.10995. (2023).

[39]

Kirillov, A. et al. 2023. Segment anything.

[40]

Klimt, B. and Yang, Y. 2004. The enron corpus: A new dataset for email classification research. European conference on machine learning (2004), 217–226.

[41]

Kumar, M. et al. 2023. Persuasion strategies in advertisements. arXiv preprint arXiv:2208.09626. (2023).

[42]

Kumar, Y.K. et al. 2024. Large content and behavior models to understand, simulate, and optimize content and behavior. arXiv preprint arXiv:2410.02653. (2024).

[43]

Li, J. et al. 2023. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.

[44]

Li, K. et al. 2023. VideoChat: Chat-centric video understanding.

[45]

Longpre, L. et al. 2019. Persuasion of the undecided: Language vs. The listener. Proceedings of the 6th workshop on argument mining (2019).

[46]

Luke, S.G. and Christianson, K. 2018. The provo corpus: A large eye-tracking corpus with predictability norms. Behavior research methods. (2018).

[47]

Machajdik, J. and Hanbury, A. 2010. Affective image classification using features inspired by psychology and art theory. Proceedings of the 18th ACM international conference on multimedia (2010), 83–92.

[48]

Martin, D. et al. 2022. ScanGAN360: A generative model of realistic scanpaths for 360° images. IEEE Transactions on Visualization and Computer Graphics. (2022). https://doi.org/10.1109/TVCG.2022.3150502.

[49]

Mikels, J.A. et al. 2005. Emotional category data on images from the international affective picture system. Behavior research methods. 37, (2005), 626–630.

[50]

Ojha, U. et al. 2026. MEMENTO: Web as a learning signal for low-data advertising domains. arXiv preprint arXiv:2605.29795. (2026).

[51]

OpenAI 2023. GPT-4 technical report.

[52]

Papineni, K. et al. 2002. Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting of the ACL (2002), 311–318.

[53]

Park, J.S. et al. 2023. Generative agents: Interactive simulacra of human behavior. Proceedings of UIST. (2023).

[54]

Pellert, M. et al. 2023. AI psychometrics: Assessing the psychological profiles of large language models through psychometric inventories. Perspectives on Psychological Science. (2023).

[55]

Petrovic, S. et al. 2011. Rt to win! Predicting message propagation in twitter. Proceedings of the international AAAI conference on web and social media (2011), 586–589.

[56]

Pressman, J.D. et al. 2023. Simulacra aesthetic captions. Technical Report Version 1.0, Stability AI, 2022. url https://github. com/JD ….

[57]

Qin, X. et al. 2020. U2-net: Going deeper with nested u-structure for salient object detection. Pattern recognition. 106, (2020), 107404.

[58]

Raffel, C. et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research. 21, 140 (2020), 1–67.

[59]

Salemi, A. et al. 2023. LaMP: When large language models meet personalization. arXiv preprint arXiv:2304.11406. (2023).

[60]

Si, H. et al. 2024. Long-term ad memorability: Understanding and generating memorable ads. arXiv preprint arXiv:2309.00378. (2024).

[61]

Singh, A. et al. 2025. FSPO: Few-shot preference optimization of synthetic preference data in LLMs elicits effective personalization to real users. arXiv preprint arXiv:2502.19312. (2025).

[62]

Singh, S. et al. 2025. Teaching human behavior improves content understanding abilities of LLMs. International conference on learning representations (2025).

[63]

Singh, S.K. et al. 2025. Transsuasion: Introducing and measuring behavior transfer capabilities. International conference on learning representations (2025).

[64]

Singla, Y.K. et al. 2022. What do audio transformers hear? Probing their representations for language delivery & structure. 2022 IEEE international conference on data mining workshops (ICDMW) (2022), 910–925.

[65]

Stab, C. and Gurevych, I. 2014. Annotating argument components and relations in persuasive essays. Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical papers (2014), 1501–1510.

[66]

Stab, C. and Gurevych, I. 2017. Parsing argumentation structures in persuasive essays. Computational Linguistics. 43, 3 (2017), 619–659.

[67]

Tan, C. et al. 2016. Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions. Proceedings of the 25th international conference on world wide web (2016), 613–624.

[68]

Touvron, H. et al. 2023. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. (2023).

[69]

Wachsmuth, H. et al. 2017. Computational argumentation quality assessment in natural language. Proceedings of the 15th conference of the european chapter of the association for computational linguistics: Volume 1, long papers (2017), 176–187.

[70]

Wang, K. et al. 2018. Retweet wars: Tweet popularity prediction via dynamic multimodal regression. 2018 IEEE winter conference on applications of computer vision (WACV) (2018), 1842–1851.

[71]

Wei, P. et al. 2022. CREATER: CTR-driven advertising text generation with controlled pre-training and contrastive fine-tuning. arXiv preprint arXiv:2205.08943. (2022).

[72]

Wilber, M.J. et al. 2017. Bam! The behance artistic media dataset for recognition beyond photography. Proceedings of the IEEE international conference on computer vision (2017), 1202–1211.

[73]

Wu, X. et al. 2023. eMotions: A large-scale dataset for emotion recognition in short videos.

[74]

Xu, H. et al. 2022. GMFlow: Learning optical flow via global matching.

[75]

Yang, J. et al. 2023. EmoSet: A large-scale visual emotion dataset with rich attributes. Proceedings of the IEEE/CVF international conference on computer vision (2023), 20383–20394.

[76]

Yang, J. et al. 2023. Track anything: Segment anything meets videos.

[77]

Yang, Z. et al. 2025. OASIS: Open agent social interaction simulations with one million agents. arXiv preprint arXiv:2411.11581. (2025).

[78]

Ye, K. et al. 2019. Interpreting the rhetoric of visual advertisements. IEEE transactions on pattern analysis and machine intelligence. 43, 4 (2019), 1308–1323.

[79]

Ye, K. and Kovashka, A. 2018. Advise: Symbolism and external knowledge for decoding advertisements. Proceedings of the european conference on computer vision (ECCV) (2018), 837–855.

[80]

Zhang, M. et al. 2018. Equal but not the same: Understanding the implicit relationship between persuasive images and text. arXiv preprint arXiv:1807.08205. (2018).

[81]

Zheng, L. et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems. 36, (2024).