Cognitive Archetype Extraction: Beyond Demographics for Audience Modeling
Abstract
Demographic audience models — age, gender, location — systematically under-describe the variance that actually drives online comment sections: two viewers of identical demographics routinely react in opposite ways to the same content. We present Cognitive Archetype Extraction (CAE), a schema-first pipeline that constructs cognitive rather than demographic audience models directly from a channel's real comment history.
Each viewer is encoded as a 22-dimensional feature vector spanning five categories — topic distribution, sentiment pattern, linguistic style, social behavior, and stance stability — and clustered with HDBSCAN. Each cluster is then synthesized by an LLM into a labeled archetype with a 3–4 sentence cognitive profile, few-shot example comments, and derived attention/reasoning/risk/social-role attributes. The resulting archetypes power a three-phase simulation pipeline (attention filter → CoT comment generation → `[segment × archetype]` sensitivity analysis) that predicts, for any draft script, which segments will trigger consensus, controversy, or indifference. We evaluate CAE on a Web3/AI content corpus using a composite fidelity score built from four independent metrics (Distribution Alignment via JSD, Topic Coverage via Jaccard, Controversy Prediction via F1, and Engagement Calibration via Pearson correlation), and show that simulated audiences track real audiences across all four axes.
Key Contributions
- 22-dimensional cognitive feature vector — Topic distribution (8), sentiment pattern (4), linguistic style (5), social behavior (4), and stance stability (1), each derived from per-user aggregated comment histories.
- Density-based archetype discovery — HDBSCAN automatically determines cluster count, with per-cluster representative selection (closest-to-centroid) for few-shot grounding.
- LLM-synthesized cognitive profiles — Each cluster is transformed into a labeled archetype with reasoning style, risk cognition, social role, verbosity, and stylistic descriptors.
- Three-phase simulation pipeline — Attention filtering by Jaccard overlap, CoT + few-shot comment generation with parallel archetype execution, and `[segment × archetype]` sensitivity scoring for heatmap visualization.
- Four-metric fidelity framework — Composite score `0.30·DA + 0.30·TC + 0.20·CP + 0.20·EC` in `[0, 1]`, with proportional weight redistribution when any metric is unavailable.
Why It Matters
CAE is a concrete instance of the broader digital-clone-of-an-audience research program: instead of cloning a single individual, we clone the cognitive diversity of a community. The same methodology generalizes beyond content-preview use cases to any setting that needs a faithful, manipulable simulacrum of a collective — governance forecasting, market research, message testing, and user-study synthesis.
Productization
The methodology ships as AudienceLens, an open-access platform where creators can register a YouTube channel and, within minutes, preview how their draft script will land — surfacing controversy, consensus, and cold segments before publishing. A live deployment is available at audiencelens.ustlabproject.live.