The year 2025 is one of rapid development and uncertainty for large language models, and we have achieved fruitful results. Here are what I personally consider noteworthy and somewhat surprising “paradigm shifts” that have changed the landscape and left a significant impression on me, at least on a conceptual level.
1. Reinforcement Learning based on Verifiable Rewards (RLVR)
At the beginning of 2025, the LLM production stack of all AI laboratories will roughly take the following form:
Pre-training (GPT-2/3 from 2020);
Supervised Fine-Tuning (InstructGPT from 2022);
And reinforcement learning based on human feedback (RLHF, 2022)
For a long time, this has been a stable and mature technology stack for training production-grade large language models. By 2025, reinforcement learning based on verifiable rewards has become the core technology widely adopted. By training large language models in various environments with automatically verifiable rewards (such as mathematics and programming problem-solving), these models can spontaneously form strategies that resemble 'reasoning' from a human perspective. They learn to break down problem-solving into intermediate computation steps and master various strategies for solving problems through iterative reasoning (as referenced in the DeepSeek-R1 paper). In the previous stack, these strategies were difficult to achieve because the optimal reasoning paths and backtracking mechanisms were not clear for large language models, and suitable solutions had to be explored through reward optimization.
Unlike the supervised fine-tuning phase and the human feedback-based reinforcement learning phase (which are relatively short and involve less computational load), reinforcement learning based on verifiable rewards involves long-term optimization training of objective, non-gameable reward functions. It has been shown that running reinforcement learning based on verifiable rewards can lead to significant capability improvements within unit costs, consuming a large amount of computational resources originally planned for pre-training. Therefore, the progress in the capabilities of large language models by 2025 mainly reflects how major AI labs have adapted to the enormous computational demands brought by this new technology. Overall, we see that the scale of the models remains roughly equivalent, but the time for reinforcement learning training has greatly increased. Another unique aspect of this new technology is that we have gained a new control dimension (along with the corresponding Scaling laws), namely controlling model capabilities as a function of computational load during testing by generating longer reasoning trajectories and increasing “thinking time.” OpenAI's o1 model (released at the end of 2024) is the first demonstration of a reinforcement learning model based on verifiable rewards, while the release of o3 (early 2025) marks a significant turning point that allows one to intuitively feel the qualitative leap.
2. Phantom Intelligence vs. Animal Sawtooth Intelligence
In 2025, I (and I believe the entire industry) began to understand the “form” of large language model intelligence from a more intuitive perspective for the first time. We are not “evolving or nurturing animals,” but rather “summoning ghosts.” The entire technology stack of large language models (neural architecture, training data, training algorithms, especially optimization objectives) is fundamentally different, so it is not surprising that we obtain entities in the realm of intelligence that are markedly different from biological intelligence; it is inappropriate to examine them from an animal's perspective. From the perspective of supervised information, human neural networks are optimized for tribal survival in jungle environments, while large language model neural networks are optimized to mimic human text, gain rewards in mathematical challenges, and win human approval in competitive arenas. As verifiable domains provide conditions for reinforcement learning based on verifiable rewards, the capabilities of large language models will “surge” near these domains, presenting an interesting, serrated performance characteristic overall. They may simultaneously be knowledgeable geniuses and confused, cognitively challenged elementary students, potentially leaking your data under inducement.
Human intelligence: blue, AI intelligence: red. I like this version of the meme (sorry I couldn't find the original tweet source) because it points out that human intelligence actually presents a jagged wave in its own unique way.
Related to this, in 2025 I developed a general indifference and distrust towards various benchmark tests. The core issue is that benchmark tests are inherently verifiable environments, making them susceptible to influences from reinforcement learning based on verifiable rewards and weaker forms generated through synthetic data. In the typical “score maximization” process, large language model teams inevitably construct training environments near the small embedded spaces of the benchmark tests, and cover these areas through “capability serration.” “Training on the test set” has become a new norm.
So what if we sweep all benchmark tests but still fail to achieve general artificial intelligence?
3. Cursor: A New Level of LLM Applications
One of the most impressive aspects of Cursor (besides its rapid rise this year) is that it convincingly reveals a new layer of “LLM applications” as people begin to talk about “Cursor in the XX field.” As I emphasized in my talk at Y Combinator this year, LLM applications like Cursor are fundamentally about integrating and orchestrating LLM calls for specific verticals:
They are responsible for “context engineering”;
Orchestrate multiple LLM calls into an increasingly complex directed acyclic graph at the base layer, carefully balancing performance and cost; provide application-specific graphical interfaces for personnel in the “human-in-the-loop”.
And provide an “autonomous adjustment slider”.
In 2025, there has been a lot of discussion around the development space of this emerging application layer. Will large language model platforms dominate all applications, or is there still a vast world for large language model applications? I personally speculate that the positioning of large language model platforms will gradually trend towards cultivating “well-rounded university graduates,” while large language model applications will be responsible for organizing these “graduates,” fine-tuning them, and enabling them to truly become “professional teams” that can be deployed in real-world scenarios within specific vertical fields by providing private data, sensors, actuators, and feedback loops.
4.Claude Code: Running on Local AI
The emergence of Claude Code convincingly demonstrates the form of LLM agents for the first time, combining tool usage with reasoning processes in a cyclical manner to achieve more enduring solutions to complex problems. Furthermore, what impresses me about Claude Code is that it runs on the user's personal computer, deeply integrating with the user's private environment, data, and context. I believe OpenAI has made a misjudgment in this direction, as they have focused the development of code assistants and agents on cloud deployment, specifically in containerized environments orchestrated by ChatGPT, rather than on localhost environments. Although cloud-based clusters of agents may seem like “the ultimate form of general artificial intelligence,” we are currently in a transitional phase where capability development is uneven and progress is relatively slow. Under such realistic conditions, directly deploying agents on local computers that closely collaborate with developers and their specific work environments is a more reasonable path. Claude Code accurately grasps this priority and encapsulates it in a concise, elegant, and highly attractive command-line tool form, thereby reshaping the presentation of AI. It is no longer just a website like Google that requires access, but a little sprite or ghost that “resides” in your computer. This represents a brand new, unique paradigm for interacting with AI.
5. Vibe Coding
In 2025, AI crossed a critical capability threshold, making it possible to build various astonishing programs solely through English descriptions, without the need to worry about the underlying code. Interestingly, I coined the term “Vibe Coding” during a stream of consciousness tweet while taking a shower, never anticipating it would develop to this extent. Under the paradigm of vibe coding, programming is no longer strictly confined to highly trained professionals, but has become an activity that everyone can participate in. From this perspective, it is yet another illustration of the phenomenon I described in “Empowering People: How Large Language Models Are Changing Patterns of Technology Diffusion.” In stark contrast to all other technologies to date, ordinary people benefit more from large language models compared to professionals, businesses, and governments. However, vibe coding not only empowers ordinary people to access programming, but also enables professional developers to write more software that “would not have been realized” otherwise. When developing nanochat, I wrote a custom efficient BPE tokenizer in Rust through vibe coding, without relying on existing libraries or deeply learning Rust. This year, I also quickly implemented multiple project prototypes using vibe coding, solely to validate whether certain ideas were feasible. I even wrote an entire one-off application just to pinpoint a specific bug, as the code suddenly became free, ephemeral, malleable, and disposable. Vibe coding will reshape the ecology of software development and profoundly change the boundaries of professional definitions.
6.Nano banana: LLM graphical interface
Google's Gemini Nano banana is one of the most disruptive paradigm shifts of 2025. In my view, large language models are the next significant computational paradigm following the computers of the 1970s and 80s. Therefore, we will see similar innovations based on analogous fundamental causes, akin to the evolution of personal computing, microcontrollers, and even the internet. Especially at the level of human-computer interaction, the current “dialogue” mode with LLMs somewhat resembles entering commands into computer terminals in the 1980s. Text is the most primitive data representation form for computers (and LLMs), yet it is not the preferred method for humans (especially when inputting). Humans actually detest reading text; it is slow and laborious. Instead, humans tend to receive information through visual and spatial dimensions, which is precisely why graphical user interfaces emerged in traditional computing. Similarly, large language models should communicate with us in forms preferred by humans, through images, infographics, slides, whiteboards, animations, videos, web applications, and other mediums. The current early forms have already been realized through visual text embellishments such as emojis and Markdown (like headings, bolding, lists, tables, and other formatting elements). But who will truly build the graphical interface for large language models? From this perspective, the nano banana is precisely an early prototype of this future vision. It is worth noting that the breakthrough of nano banana lies not only in its image generation capability but also in the integrated ability formed by the intertwining of text generation, image generation, and world knowledge in the model weights.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
2025 Six Major AI Paradigm Shifts: From RLVR Training, Vibe Coding to Nano Banana
Author: Andrej Karpathy
Compiled by: Tim, PANews
The year 2025 is one of rapid development and uncertainty for large language models, and we have achieved fruitful results. Here are what I personally consider noteworthy and somewhat surprising “paradigm shifts” that have changed the landscape and left a significant impression on me, at least on a conceptual level.
1. Reinforcement Learning based on Verifiable Rewards (RLVR)
At the beginning of 2025, the LLM production stack of all AI laboratories will roughly take the following form:
For a long time, this has been a stable and mature technology stack for training production-grade large language models. By 2025, reinforcement learning based on verifiable rewards has become the core technology widely adopted. By training large language models in various environments with automatically verifiable rewards (such as mathematics and programming problem-solving), these models can spontaneously form strategies that resemble 'reasoning' from a human perspective. They learn to break down problem-solving into intermediate computation steps and master various strategies for solving problems through iterative reasoning (as referenced in the DeepSeek-R1 paper). In the previous stack, these strategies were difficult to achieve because the optimal reasoning paths and backtracking mechanisms were not clear for large language models, and suitable solutions had to be explored through reward optimization.
Unlike the supervised fine-tuning phase and the human feedback-based reinforcement learning phase (which are relatively short and involve less computational load), reinforcement learning based on verifiable rewards involves long-term optimization training of objective, non-gameable reward functions. It has been shown that running reinforcement learning based on verifiable rewards can lead to significant capability improvements within unit costs, consuming a large amount of computational resources originally planned for pre-training. Therefore, the progress in the capabilities of large language models by 2025 mainly reflects how major AI labs have adapted to the enormous computational demands brought by this new technology. Overall, we see that the scale of the models remains roughly equivalent, but the time for reinforcement learning training has greatly increased. Another unique aspect of this new technology is that we have gained a new control dimension (along with the corresponding Scaling laws), namely controlling model capabilities as a function of computational load during testing by generating longer reasoning trajectories and increasing “thinking time.” OpenAI's o1 model (released at the end of 2024) is the first demonstration of a reinforcement learning model based on verifiable rewards, while the release of o3 (early 2025) marks a significant turning point that allows one to intuitively feel the qualitative leap.
2. Phantom Intelligence vs. Animal Sawtooth Intelligence
In 2025, I (and I believe the entire industry) began to understand the “form” of large language model intelligence from a more intuitive perspective for the first time. We are not “evolving or nurturing animals,” but rather “summoning ghosts.” The entire technology stack of large language models (neural architecture, training data, training algorithms, especially optimization objectives) is fundamentally different, so it is not surprising that we obtain entities in the realm of intelligence that are markedly different from biological intelligence; it is inappropriate to examine them from an animal's perspective. From the perspective of supervised information, human neural networks are optimized for tribal survival in jungle environments, while large language model neural networks are optimized to mimic human text, gain rewards in mathematical challenges, and win human approval in competitive arenas. As verifiable domains provide conditions for reinforcement learning based on verifiable rewards, the capabilities of large language models will “surge” near these domains, presenting an interesting, serrated performance characteristic overall. They may simultaneously be knowledgeable geniuses and confused, cognitively challenged elementary students, potentially leaking your data under inducement.
Human intelligence: blue, AI intelligence: red. I like this version of the meme (sorry I couldn't find the original tweet source) because it points out that human intelligence actually presents a jagged wave in its own unique way.
Related to this, in 2025 I developed a general indifference and distrust towards various benchmark tests. The core issue is that benchmark tests are inherently verifiable environments, making them susceptible to influences from reinforcement learning based on verifiable rewards and weaker forms generated through synthetic data. In the typical “score maximization” process, large language model teams inevitably construct training environments near the small embedded spaces of the benchmark tests, and cover these areas through “capability serration.” “Training on the test set” has become a new norm.
So what if we sweep all benchmark tests but still fail to achieve general artificial intelligence?
3. Cursor: A New Level of LLM Applications
One of the most impressive aspects of Cursor (besides its rapid rise this year) is that it convincingly reveals a new layer of “LLM applications” as people begin to talk about “Cursor in the XX field.” As I emphasized in my talk at Y Combinator this year, LLM applications like Cursor are fundamentally about integrating and orchestrating LLM calls for specific verticals:
In 2025, there has been a lot of discussion around the development space of this emerging application layer. Will large language model platforms dominate all applications, or is there still a vast world for large language model applications? I personally speculate that the positioning of large language model platforms will gradually trend towards cultivating “well-rounded university graduates,” while large language model applications will be responsible for organizing these “graduates,” fine-tuning them, and enabling them to truly become “professional teams” that can be deployed in real-world scenarios within specific vertical fields by providing private data, sensors, actuators, and feedback loops.
4.Claude Code: Running on Local AI
The emergence of Claude Code convincingly demonstrates the form of LLM agents for the first time, combining tool usage with reasoning processes in a cyclical manner to achieve more enduring solutions to complex problems. Furthermore, what impresses me about Claude Code is that it runs on the user's personal computer, deeply integrating with the user's private environment, data, and context. I believe OpenAI has made a misjudgment in this direction, as they have focused the development of code assistants and agents on cloud deployment, specifically in containerized environments orchestrated by ChatGPT, rather than on localhost environments. Although cloud-based clusters of agents may seem like “the ultimate form of general artificial intelligence,” we are currently in a transitional phase where capability development is uneven and progress is relatively slow. Under such realistic conditions, directly deploying agents on local computers that closely collaborate with developers and their specific work environments is a more reasonable path. Claude Code accurately grasps this priority and encapsulates it in a concise, elegant, and highly attractive command-line tool form, thereby reshaping the presentation of AI. It is no longer just a website like Google that requires access, but a little sprite or ghost that “resides” in your computer. This represents a brand new, unique paradigm for interacting with AI.
5. Vibe Coding
In 2025, AI crossed a critical capability threshold, making it possible to build various astonishing programs solely through English descriptions, without the need to worry about the underlying code. Interestingly, I coined the term “Vibe Coding” during a stream of consciousness tweet while taking a shower, never anticipating it would develop to this extent. Under the paradigm of vibe coding, programming is no longer strictly confined to highly trained professionals, but has become an activity that everyone can participate in. From this perspective, it is yet another illustration of the phenomenon I described in “Empowering People: How Large Language Models Are Changing Patterns of Technology Diffusion.” In stark contrast to all other technologies to date, ordinary people benefit more from large language models compared to professionals, businesses, and governments. However, vibe coding not only empowers ordinary people to access programming, but also enables professional developers to write more software that “would not have been realized” otherwise. When developing nanochat, I wrote a custom efficient BPE tokenizer in Rust through vibe coding, without relying on existing libraries or deeply learning Rust. This year, I also quickly implemented multiple project prototypes using vibe coding, solely to validate whether certain ideas were feasible. I even wrote an entire one-off application just to pinpoint a specific bug, as the code suddenly became free, ephemeral, malleable, and disposable. Vibe coding will reshape the ecology of software development and profoundly change the boundaries of professional definitions.
6.Nano banana: LLM graphical interface
Google's Gemini Nano banana is one of the most disruptive paradigm shifts of 2025. In my view, large language models are the next significant computational paradigm following the computers of the 1970s and 80s. Therefore, we will see similar innovations based on analogous fundamental causes, akin to the evolution of personal computing, microcontrollers, and even the internet. Especially at the level of human-computer interaction, the current “dialogue” mode with LLMs somewhat resembles entering commands into computer terminals in the 1980s. Text is the most primitive data representation form for computers (and LLMs), yet it is not the preferred method for humans (especially when inputting). Humans actually detest reading text; it is slow and laborious. Instead, humans tend to receive information through visual and spatial dimensions, which is precisely why graphical user interfaces emerged in traditional computing. Similarly, large language models should communicate with us in forms preferred by humans, through images, infographics, slides, whiteboards, animations, videos, web applications, and other mediums. The current early forms have already been realized through visual text embellishments such as emojis and Markdown (like headings, bolding, lists, tables, and other formatting elements). But who will truly build the graphical interface for large language models? From this perspective, the nano banana is precisely an early prototype of this future vision. It is worth noting that the breakthrough of nano banana lies not only in its image generation capability but also in the integrated ability formed by the intertwining of text generation, image generation, and world knowledge in the model weights.