Beyond Moore: Part 1 The Future of Compute

Article by

Article Date

Apr 29, 2025

Category

Articles

When future historians take on the Sisyphean task of writing a definitive account of the 2020s, they will write of an era when artificial intelligence escaped the lab and flooded the world.

Just four months after the release of ChatGPT, the passing away of Gordon Moore (architect of modern computing and the venture-backed startup) marked the beginning of the end for today’s “compute” and the dawn of a new era.

It will be one fuelled by the exponential growth of AI. Still in its infancy, but already straining classical compute capacity and the energy infrastructure that powers it.

AI Compute To Double Every Six Months?

The defining technology story of the early years of this decade will be that of “Generative AI” with ChatGPT spreading like wildfire to half a billion new users. An impact such that the tech industry seemingly forgot that artificial intelligence is nothing new: Alan Turing proposed “thinking machines” in his 1950 paper Computing Machinery and Intelligence.

ChatGPT comes nearly 30 years after Deep Blue became the first computer to beat a reigning world chess champion, marking it as one of the first cases of machine learning (a branch of AI that focuses on systems learning from data) surpassing human capabilities.

Having endured two AI winters, the first from 1974 to 1980 and the second from 1987 to 1993, the explosive commercialization of AI in the 2020s can largely be attributed to advances in the transformer architecture pioneered by Google in 2017.

*Garry Kasparov v computer, 1996. IBM’s Deep Blue was the size of a wardrobe and had a personality to match.*

In transformer models, text is converted to numerical representations called tokens (for years used in natural language processing) which are processed with a defined “context window.” A context window is the amount of text (or data) an AI model can process and understand at one time, to generate a response. It’s like the AI’s short-term memory span during a conversation or task.

Unlike older RNN models (recurrent neural networks, which work sequentially) the words are not read one-by-one. Instead the whole query is assessed at once in parallel using a mechanism called self-attention.

“This is the key part of transformers that make them work so well, understanding which part of the context is important for making predictions.” says Zehang Wang (co-founder of early AI startup Magic Pony Technology and today running embodied AI company Paddington Robotics).

This is then “fed forward” flowing in one direction through layers of the network, without looping back, to make predictions to output. “Next-token” prediction is then used, where the AI model tries to predict what word should come next in a sentence, based on words that came before. Like a very powerful auto-complete. These are the basic elements of how a model works behind the scenes to make AI seem smart.

Here’s an example:

“Once upon a time, in a dark forest, there was a…”

1. Context window - The model looks at your whole sentence (“Once upon a time…”). If your sentence was too long, it might forget or ignore earlier parts. But since it’s short, it’s fully in view, in the context window.

2. Next-token prediction - The AI now tries to guess what comes next—maybe “wolf”, “cabin”, or “princess”—based on patterns it has learned from tons of stories. It picks the word (token) most likely to make sense.

3. Feedforward - This is the step-by-step internal process the AI uses to score and choose that next word. It flows the input data through its layers (like a logic pipeline), computes probabilities, and picks a word.

In this case:

Input: Once upon a time, in a dark forest, there was a

Prediction: → wolf

Now the sentence becomes:

“Once upon a time, in a dark forest, there was a wolf”

Next prediction: → who

“Once upon a time, in a dark forest, there was a wolf who”

Next: → howled

Then it repeats that process over and over, one token at a time, until it finishes the fairy tale.

Development of new and scalable training techniques have made it feasible to train transformer models with hundreds of billions of parameters.

These include: gradient accumulation, mixed-precision training, and distributed computing, asynchronous SGD, federated averaging, communication efficiency methods (DiLoCo), pathways (concept) + DiPaCo, mixture of experts, branch-train-merge/mix, and RL Swarm, created by GensynAI, a 7percent portfolio company.

Meanwhile, powerful hardware chips like NVIDIA’s H100 GPUs and Google’s TPUs have dramatically sped up the matrix operations needed for training by between 2x and 9x. This has introduced a new scaling law, where the amount of AI-relevant compute capacity measured in FLOPs (floating point operations per second) is expected to double every six months.

For perspective, the Apollo Guidance Computers that flew the Apollo 11 mission to the moon in 1969, had no hardware floating point support, but at an estimate it was equivalent to around 85 FLOPs, running at around ~1 MHz.

Today’s iPhone 16 weighs in around 1.9 trillion TFLOPs.

The box of tricks which took us to the moon was 22 billion times
less powerful than your iPhone!

‍Electricity is really just organized lightning - George Calin

Meanwhile, tokens-per-dollar-per-watt could become the new benchmark that defines leadership in the foundational model race and may be a “game-changing formula for driving GDP growth”, at least until a new, exponentially less power hungry processing architecture comes online (more on this in pt 2).

*A new scaling law for AI - with global AI-relevant compute expected to double every six months*

And as these models become larger, trained on billions of sources (books, websites, code and forums) their general contextual understanding theoretically forever improves. Each iteration from GPT-2 to GPT-4 has shown measurable improvements across language tasks, but not without flaws.

Technical benchmarks have been rapidly exceeded in image classification (91%), visual reasoning (67%), clinical knowledge and content generation. OpenAI President Greg Brockman hailed the release of their o3 and o4-mini models as "the closest thing to AGI that humanity has ever seen," scoring over 99% on AIME math benchmarks and achieving more than a 25% success rate on "humanity’s last exam."

At 7percent we’re more cautious. Anyone who has used these models regularly will recognise that they still frequently hallucinate. “Ultimately this remains glorified pattern recognition, the question we keep asking is what comes next?” says our own Andrew J Scott.

Public access to state-of-the-art models, such as ChatGPT and Stable Diffusion, and API access for developers marks a new era of democratized expertise in a vast array of commercial applications. The latest models are estimated to have trillions of parameters, over ten times more than GPT-3.

The floodgates have opened, and there’s no going back.

“Models will get larger and larger until it doesn't make sense to call them individual models any more. They're just infrastructure that runs intelligence.” says Gensyn CEO Ben Fielding.

“We're heading for a complete shift in the way we see technology, away from deterministic execution and towards an always-on, probabilistic digital twin that can answer our questions directly and provide information by fetching it itself from parameter space. It's like the entire world's knowledge is captured in a library and we just created robots to run the library instead of people.”

*The historic increase in model size, with accompanied improvements in benchmark performance*

Enterprise adoption is among the fastest we’ve ever seen for a new technology, with 78% of enterprises using AI in at least one business function. But we must remember we’re right at the beginning of this revolution. Most corporations are only just beginning to experiment with AI. We’re the world wide web equivalent of version 1.0 of the web browser.

AI is also being pushed further out to the edge (meaning devices like mobiles, or sensors). From ocean freight or farmers in far-flung corners of the world, to satellites in low Earth orbit, companies like Plumerai are bringing intelligence closer to where data is generated, reducing latency and preserving privacy. For example, in Plumerai’s case they’re using binarised (simpler, faster) neural networks using 1-bit instead of up to 32-bits, to process video in a doorbell, reducing energy cost.

Our lives have already become noticeably hard - if not impossible already! - without internet access.

Our reliance on computing will grow ever stronger. The amount of stored data is currently doubling every 3 years. As smartphone and internet adoption rates move towards 100% (up from 44% and 59% respectively) edge and IoT devices will drive us to a world with over 50,000 Zettabytes of stored data. That’s more than 1,000 times the amount stored today.

To put it another way, 50,000 Zettabytes is a stack of books that would go to the Moon and back more than 30 MILLION times!

It's likely that a vast amount of that data will be recorded. Ilya Sutskever declared the "death of pre-training." What he meant by this was that structured, supervised training of foundation models on existing data is good enough today but the next step is to incorporate live data continually into the models, to personalise them and create true experts.

This is also closer to how we operate as humans. A much larger amount of data is processed by the human brain based on immediate, first hand experience than we are ever able to learn from the past, or from books.

AI agents (a computer program that can make decisions and take actions on its own to achieve a goal) evolve the number of agent-to-agent interactions will increase exponentially. Agents will decide to spin up other agents to do their sub tasks. This will drive demand for compute even further, possibly higher than the already predicted 19% to 22%.

The demand for compute is a business opportunity but also a challenge. With AI set to be the defining technology of the next 20 years as the silicon chip then internet has been for the second half of the 20th Century, there is a all to play for, whether nation state, corporate or startup.

At 7percent, over 25% of our investments have been made as part of our Future Compute thesis, focused on next-generation architectures and foundational intelligence, with portfolio companies including Gensyn, Nu Quantum, Magic Pony Technology (Exit to Twitter), Plumerai, Universal Quantum, Vaire, and (most recently) Paddington Robotics.

Next month in Part 2 of this post, we’ll look at the challenges and opportunities for AI and the compute that powers it…

Beyond Moore: Part 1 The Future of Compute

Smart Planet Part 1: Two Booms, Two Busts, Two Decades

Beyond Moore: Part 2 Robots in Disguise

Beyond Moore: Part 1 The Future of Compute

Related posts

Smart Planet Part 1: Two Booms, Two Busts, Two Decades

Beyond Moore: Part 2 Robots in Disguise