the contextual user substrate - why collecting and curating rich contextual user data will soon matter more than cranking up model intelligence

early trends showing a different ai race for your data

1 introduction to the personalisation loop 2 the trajectory - the hypercaler mission continues 3 the importance of collecting and curating contextual user data 4 value will accrue to the contextualizors 5 restructuring efforts 6 everything becomes a data extraction ‘wrapper’ 7 the single contextual data substrate, and centralisation concerns 8 the consciousness simulation

introduction

in the most recent Dwarkesh pod with Zuck, it was the ‘personalisation / context loop’ references I found most interesting. meta seems less interested in benchmarks and the race to agi, and more interested in this direct user interaction loop. i.e. what you read, who you connect with and how you interact with ai itself, being looped back into the experience.

“especially once you get the personalization loop going, which we’re just starting to build in now really, from both the context that all the algorithms have about what you’re interested in — feed, your profile information, your social graph information — but also what you're interacting with the AI about. That’s going to be the next thing that's super exciting.”

reading between the lines - this does sound somewhat like meta conceding defeat in the agi race, and pivoting to leverage their enormous data extraction empire by creating more unique personal ai experiences.

either way, these are the early stages of a broader trend in consumer ai - collecting and curating rich contextual user data will soon matter more than cranking up model intelligence.

nevertheless, the big labs will continue to crank.

the race to increasingly general intelligence will play out centre stage - but the battle for consumer ai will happen at the wrapper-layer. as part of this decoupling, virtually the entire internet will become an ai wrapper - less concerned with model progress, increasingly concerned with the collection and curation of contextual user data.

the trajectory - the hypercaler mission continues

in just a few years, we’ve gone from:

“explain quantum entanglement”,

“write me a python script that simulates superposition vs. entanglement.”

we will soon(ish), bump up against the limit of improvements from raw hardware and flops, but progress will likely continue through better data and better algorithms. the big labs will continue their mission toward increasingly general intelligence - scaling compute, refining software and training methods and “unhobbling” models with breakthroughs in architecture and inference.

it’s unclear when (or if) this progress will slow. it’s also unclear when or how exactly we get there.

so why are these labs building towards agi if contextual user data is more important than cranking up model intelligence?

the hyperscalers are not playing to build a better chatbot experience for plebs like you and me - they are chasing ‘keys to the kingdom’ level intelligence, intelligence that will fundamentally reshape the world - geopolitically, technologically, economically, philosophically. the payoff for which is unimaginable.

crudely, we already know this to be the case because the numbers don’t add up.

openai is projecting a billion users by the end of the year (10%+ of the world's population!), tens of billions in revenue. combined subscriptions, enterprise plans and api usage fees, and they’re still nowhere near break even.

lex conversation with Dylan Patel and Nathan Lambert.

the importance of collecting and curating contextual user data

think about it - how will ‘most’ people use ai ‘most’ of the time?

most individuals don't need the level of intelligence required to unify quantum mechanics and spacetime, or develop a novel cure for cancer.

most people are looking for a smart friend, that’s it. someone they can turn to for advice, for help, for companionship. to help them learn new things, to be more productive, to find meaning and purpose.

if ai hits a ‘good enough’ threshold for most tasks, bigger brains will become less important than the collection of rich, contextual user data.

another way to think about this - in most cases, you’d be better served by a ‘smart friend’ who knew you really well, versus having access to a giga-brain genius acquaintance who knew nothing about you.

value will accrue to the contextualizors

up until now, most ai related products and services have primarily focused on the improvements of the underlying models - bigger brains, better service. but at a certain threshold of intelligence, the companies and services who capture the most value, won’t be those with the smartest, newest models. it will be the ones which have solved for the seamless collection and curation of rich, contextual user data.

this contextual layer will be an ever-growing substrate of your personal, experiential data - used to enhance your interactions, collected and fed back into the models and services you interact with. most people will ultimately handover their data to these services - because the experience/payoff for doing so will be significantly better with this contextual layer.

we’re already seeing major labs and engineers with increasing interest in ‘memory’, ‘caching’, and RAG pipelines which pull contextually relevant information into context and prompt. I assume we’ll also see continuous fine-tuning applied to applications where contextual user data is fed back into the weights of unique model instances - probably through some form of distillation.

we’ll see an increasing interest in conversational ai, and experiences which encourage more seamless experiential input - in general, I think we’ll see increasing interest in hardware, applications and devices that more seamlessly collect experiential streams of data - wearables, screen capture, journaling applications etc.

this is the ‘personalisation loop’.

the obvious hardware example here is meta’s Rayban partnership.

there are some other interesting examples - well funded startups like sesame who launched with conversational ai, and are now evolving toward ‘ai companions’ and wearables.

restructuring efforts

if it turns out that drastically new ideas are required in the pursuit of ‘agi’ (and there is evidence to suggest this is the case), we may see a further decoupling.

it would make sense for google, anthropic, openai etc. to build more user-experience, data collection focused applications, rather than serving up the newest naked models.

it would also make sense for them to double down on locking in large B2B contracts and partnering with, investing in, and acquiring the companies, services and wrappers with the most user data extraction potential.

this is already happening.

openai is a primary cursor shareholder, anthropic’s enterprise revenue just outpaced it’s consumer revenue, Google have incubated some of the most interesting consumer ai products like notebooklm.

https://www.cursor.com/blog/series-a notebooklm.google.com

everything becomes a data extraction ‘wrapper’

at this point, we might as well just call everything a wrapper.

claude, chatgpt and gemini are wrappers of anthropic, openai and google’s proprietary models.

whatsapp is a wrapper of meta’s proprietary open(ish)-source llama

uber and hubspot can even be considered enterprise wrappers of openai

virtually every internet company will become a wrapper of a particular (or multiple) ai companies, with a focus on contextual data collection and curation.

thin wrappers will be eaten by the increasing intelligence of underlying models, thick wrappers will simply be the sticky services who manage to collect and curate contextual user data. and they will be increasingly incentivised to form relationship with the labs who serve them.

which leads to serious centralisation concerns.

the single contextual data substrate

a single, portable data layer would best serve both users and service providers.

for example, it makes no sense for me to have to re-state contextual information between the different, siloed ai services. right now, it’s annoying to maintain context between Claude, Perplexity and ChatGPT. but when every company becomes an ai wrapper, there will be more pressure to maintain wider context.

if im asking claude how to navigate a difficult relationship, it should have access to my whatsapp conversation history with that person. if i'm trying to book a flight between mexico city and new york, it should have access to my calendar and know my travel preferences.

while a single, evolving and interoperable contextual data layer would best serve both users and the service providers, it probably won’t happen. i’d love to be proven wrong on this, but i just can’t see a world where any of the big companies would encourage/allow a user to port data between products and risk handing over contextual data market share to the competition - a prisoner's dilemma-style situation..

so it’s likely the big companies will attempt to first build their own user personalisation loops - while also creating, investing in, acquiring and/or attempting to lock in the biggest wrapper services, enabling (direct or indirect) capture of as much user data as possible.

there will be immense pressure on the biggest wrapper/ai companies to ‘lock-in’ with a specific vendor.

this is why it’s so important that products like cursor and perplexity continue to allow users to byo model, for companies like DeepSeek to provide (fully) open source models with open licences, and for projects like MCP to encourage integration, compatibility and diversity between models.

we’ll need fundamentally new innovations, regulations and pressure to allow users to own their own contextual data substrate, and ensure it’s portable across and between ai services.

projects/frameworks like MCP are a step in this direction.

https://www.anthropic.com/news/model-context-protocol

companies like meta and google have such a goliath advantage here, it’s hard to see how they won’t capture and lock-in a large segment of the market, and ultimately own this contextual data layer for a disturbingly large segment of the world's population.

these companies also have a headstart when it comes to syphoning data, dopamine and attention to sell ads. if any company manages to successfully merge both the contextual flywheel of personal ai use with an effective advertising model, there could be a concerning runaway.

the consciousness simulation

an equally concerning and related consideration, is - what happens when we, in our attempts to generate this contextual layer, move more and more towards uploading and simulating our entire conscious experience?

as we move along this ‘latency-lowering’ trajectory towards full bci - concerns around who owns our digital blueprint will grow.

to unpack in the next essay.