in the previous essay, i shared some of the work of physicist nima arkani-hamed - bizarre but well-worth-considering research suggesting that ‘spacetime’ may not be at the base layer of reality, but rather emergent from some deeper (currently inconceivable) building blocks.
https://www.bradwmorris.com/posts/2414
if this is even partially true, we are likely in for some strange times ahead - a cultural, philosophical and technological rug pull. ‘training and evaluating ai systems’ is one of the many things we’ll need to revisit.
why?
in a nutshell (explained in the previous essay) - if spacetime isn’t fundamental, training AI with human language, steeped in the assumption that it is, is almost certainly limiting its ability to explore and develop its own models and understandings of these deeper, more mysterious layers we humans have yet to find or understand.
most current AI training processes and pipelines, especially those of language models, require a large amount of human input. human feedback is used to rank model outputs, train reward models and benchmark performance. in addition to this, the training data itself is human language. this heavy reliance on human data, language, preferences, labelling and feedback, means we’re limiting its ability to think for itself, outside this potentially emergent human constraint.
..
there are some recent examples suggesting that models (and the labs training them), might already be exploring outside the human / spacetime constraints.
the four research examples span Google, OpenAI, Cohere, Anthropic.
1 deepminds david silver on the ‘era of experience’ 2 pretraining chat gpt 4.5 with Sam Altman and members of the 4.5 team 3 research work by max bartolo and laura ruis from cohere 4 recent papers by the antrhopic interpretability team
we’ll briefly discuss each of these.
DeepMind’s David Silver on ‘the era of experience’
David Silver was a big part of ‘AlphaGo’ and ‘AlphaZero’ projects.
way back In 2014, DeepMind embarked on the seemingly impossible - attempting to conquer the game of ‘GO’. GO is different to other games like chess. the rules are relatively simple, but there are so many possible variations of moves and strategy that it’s virtually impossible to predict how or where a game is progressing.
a new approach to training was required.
rather than traditional approaches where a computer system would analyse millions of past games, the team focused on non-human-reliant ‘deep reinforcement learning’ to help AlphaGo learn for itself.
in 2016, AlphaGo played the world go champion Lee Sedol. with the now famous ‘move 37’ (a play so bewildering to humans that everyone thought it was a mistake) AlphaGo beat the world champion.
the clincher - deep reinforcement learning without human language reliance for training and prediction,, and without human feedback was the key to success. allowing the model to learn in its own way.
since AlphaGo, the team have worked on AlphaZero (a newer advanced gaming engine with superhuman capabilities across multiple games). I’m fairly sure the ‘zero’ literally means zero human feedback / data in the training process.
“And what we showed was that actually the resulting program, not only was it able to recover this level of performance, it actually worked better and was able to learn even faster than the original AlphaGo… you throw out the human moves altogether. It was actively limiting performance in a way.”
in the conversation, Silver explains that we (humans) need to let go of the idea that the human element in AI training process is the important part. difficult to swallow in some regards. but it will be essential for us to allow ai to embark in its own experiential learning, if we want to move into the next paradigm of intelligence.
pretraining chatgpt -4.5
in this conversation, openai ceo Sam Altman discusses the multi-year training process for gpt -4.5 with some of the team. tbh, there really wasn’t that much exciting or interesting said about the training process in terms of novel algorithmic innovations.
however, there were some interesting insights shared towards the end.
“why does unsupervised learning work?”
the response was “compression”.
they went on to explain that the process of training a model on next token prediction through unsupervised learning works because it compresses a vast amount of human data into a simple, more compact representation. so rather than storing every detail, the model learns more efficient ways to generate or explain the data it has seen - related to the concepts of ‘solomonoff induction’ and ‘occam’s razor’
“So the ideal intelligence is called Solomon induction basically it's uncertain about what universe it's in and it imagines all possible unit considers all possible universes with simple ones more likely than less simple ones and it's it's fully basil in head and up updates its views as it progresses”
what i think is being implied here, is that the model isn’t starting with a fixed picture of the world (or universe), it’s finding its own way to compress and represent what it learns in the most efficient way. with more data and more compute (albeit human data), this compression ability is continually improving.
work by laura ruis and max bartolo from cohere
new research by laura ruis and max bartolo (working with cohere) suggests models are building their own procedural problem solving frameworks.
when ai models reason over procedural tasks, or work to solve multi-step problems, they don’t just retrieve information directly from memory - they pull in ideas from many different sources, and apply this information as a general procedural framework to solve problems.
‘influence functions’ were used to estimate how much a particular piece of training data affects the final answer - and what they found (to their surprise), was that the same set of documents (especially code) had an outsized influence on the model's response. from an MLST conversation with Laura “the only documents that seem to be influential, both positively and negatively for all types of reasoning, is code. And I tried to look into what about code makes it so influential and I couldn’t find any patterns.”
“Whereas language models are trained on language, and they have probably some sense of of all these kinds of things. And but they're not constrained in the same way that we are. Language is inherently able to describe impossibilities and things that are physically not possible and imaginings and imagine and stuff like that. So it's it's also not so surprising that they show some different behavior and hallucinate and produce impossibilities.”
“But humans are learning in a very, very different environment. And we have also learned to talk about impossible situations through language and to imagine a future that is possible or not possible and reason about these things. But we're still constrained by, you know, the physical reality.”
this is from the MLST conversation with Max.
“The human feedback is not a gold standard paper was motivated by, you know, all the interest and excitement around using human feedback .. we give humans a prompt and they see two completions, right? .. And there was massive, massive progress. And we quickly started to see diminishing returns from this.”
“So you either had, you know, a human evaluation task where humans would look at output from the model and say, which they prefer, in order to kind of rank models against each other, or you train a reward model and use that kind of as a proxy for for what the human would prefer. And this idea of a single preference score that captured everything about the generation just seemed quite, quite limited.”
in a subsequent paper, they went on to improve the human feedback loop by making responses more granular.
recent papers by anthropic
here’s one of the papers.
https://www.anthropic.com/research/tracing-thoughts-language-model
and the shorter explanation video.
it’s hard to determine ‘how’ language models are able to do what they’re doing. this is largely because they’re ‘trained’, not built. the vast amount of data and connections under the hood make it virtually impossible (and prohibitively expensive) to reverse engineer.
for example, when I ask a language model (non-reasoning), a question:
‘what should I wear today?’
the response: ‘Quick check: what’s the weather like where you are today — hot, mild, rainy? And are you working, relaxing, or going out?’ there is really no way of unpacking exactly how it knew to ask me these clarifying questions. we can assume the model has seen millions of examples of humans asking each other questions with follow up clarifying questions, but it’s (almost) impossible to know how much of this comes from the model genuinely ‘thinking’ about which questions to ask, versus simply knee-jerk-regurgitating a statistically likely response.
anthropics interpretability experiments are the closest we have to sneaking up on these unknowns.
tldr: they basically isolate/freeze specific circuits from a bigger model - you can think of ‘circuits’ just like specific connections of neural pathways in your brain that connect and fire together reliably to do a specific job. unlike the gigantic mess of connections normally firing, these specific circuits can be poked at and ‘traced’.
one of these poking and tracing examples was a poetry request.
they fed in the beginning of a potential poem:
“He saw a carrot and had to grab it,”
and then asked claude to complete that poem - receiving the response:
“His hunger was like a starving rabbit”
but with the isolated circuit, they were able to remove (or dampen?) the ‘rabbit’ concept and get claude to refinish the end of the poem. it did so by finishing the rhyme replacing “rabbit” with “habit”.
if im understanding this correctly - “rabbit” was muted *after the start of the poem was written. if claude wasn’t thinking and simply regurgitating a statistically likely poem ending, it would have fully built the poem before the muting of the “rabbit” circuit mid way through - and should not have changed the poem. but it did.ra of experience"
“This poetry planning result, along with the many other examples in our paper, only makes sense in a world where the models are really thinking, in their own way, about what they say.”
as stated by Anthropic, these experiments suggest that models might be ‘thinking in their own way’. but what does this actually mean?
while they don’t spell it out, I think the examples suggest that the models have learned a ‘language-agnositc’ path to solving problems.
and to take this a step further, all the examples suggest that models might be somehow moving beyond their training data, and subsequently beyond the confines of human language and spacetime fundamentals, tapping into some deeper reality.
wild times ahead