Scan to Download Gate App

More Download Options

Don't remind me again today

AI Model Arena: An In-Depth Perspective on the NoF1 Live Portfolio Trading Competition

PANews

2025-11-03 03:42:22

On October 18, the AI research laboratory focused on financial markets, nof1, launched an unprecedented experiment: six top global AI models—GPT-5, Gemini 2.5 Pro, Grok-4, Claude Sonnet 4.5, DeepSeek V3.1, Qwen3 Max—each managing $10,000 in real funds on Hyperliquid to conduct live cryptocurrency trading.

Current rankings and account values: As of the evening of October 30, the latest standings are:

DeepSeek Chat V3.1: $15,671.39 (+56.71%)
Qwen3 Max: $12,520.34 (+25.20%)
BTC Buy & Hold: $10,146.69 (+1.47%)
Claude Sonnet 4.5: $9,290.97 (-7.09%)
Grok 4: $7,030.02 (-29.70%)
Gemini 2.5 Pro: $3,446.03 (-65.54%)
GPT 5: $2,749.32 (-72.51%)

Compared to data from a few days ago, these rankings have undergone dramatic changes. DeepSeek remains in the lead, but its yield has sharply retreated from 95.71% to 56.71%, with account value dropping from $19,570 to $15,671, evaporating nearly $4,000. Qwen3 also experienced a decline, from 53.68% to 25.20%. Notably, Claude Sonnet 4.5 shifted from a slight profit to a loss of 7%, while GPT 5’s loss expanded further to 72%, nearing liquidation.

Reading the Market Curve: The Evolution of Three Phases

Phase One (October 18-25): Uptrend, Strategy Divergence Emerges

The market was in an upward channel, with strategy differences among models beginning to show:

DeepSeek: Rapidly increased from $10,000 to $17,000, demonstrating strong trend capture
Qwen3: Steadily rose to the $12,000–15,000 range
Claude/Grok: Hovered around $10,000–12,000
Gemini/GPT: Fell below $5,000, lagging due to trading fees and poor decision-making

Phase Two (October 26-28): Accelerated Rise, Peak Formation

DeepSeek Peak: Broke through $23,000 on October 27, achieving 130% return in 9 days. Held long positions in ETH and SOL with 10-15x leverage.
Qwen3 Moderate: Peaked at $17,000 with a gentle rise. Maintained an 82.4% cash position, carefully choosing entry points to avoid chasing the market.
Claude/Grok Fluctuated: Oscillated between $11,000 and $13,000, with conflicting strategies—interested but not decisive.
Gemini/GPT Exited: Account fell to $3,000–4,000, with little chance of recovery.

Phase Three (October 29-30): Market Pullback, Risk Control Revealed

DeepSeek: Sharp decline from $23,000 to $15,671, losing $7,000 in two days (-30%). No take-profit mechanism; failed to realize gains at peak. Spent 95.6% of time in long positions, no hedging, no timely stop-loss. Despite a 30% pullback, still leading the second-place by $3,000, thanks to prior advantage.
Qwen3: Showed resilience, retreating from $17,000 to $12,520 (-26%), trailing DeepSeek. Maintained an 82.4% cash position, quickly closed positions and exited short-term trades (average 9.7 hours), exposing less time to risk and stopping losses swiftly to prevent larger losses.
BTC Buy & Hold: Simple strategy victory, with account at $10,146 (+1.47%), surpassing Claude and Grok, ranking third. Ironically, four “intelligent” AI models, after hundreds of trades, performed worse than the “buy and relax” approach—more trades do not necessarily mean better results. The simple strategy avoided overtrading and high costs.
Claude: Conservative approach failed, shifting from +0.93% to -7.09% ($10,093 to $9,290). Heavy trading fees eroded profits, with a low profit-to-loss ratio (1.34:1). Frequent position adjustments during pullbacks accelerated losses, missing large upward moves and failing to defend during declines.
Grok: Accelerated collapse, losses expanded from -8% to -29.7% ($7,030). Held 90.6% long positions with only a 22.7% win rate, realizing a loss of $2,449. Principal nearly exhausted, supported only by $1,611 unrealized gains, at risk of zeroing out at any moment.
Gemini/GPT: Struggling for survival, GPT dropped to $2,749 (-72.51%), Gemini to $3,446 (-65.54%). Failures across the board: overtrading, low win rate, poor profit-to-loss ratio, high leverage risk.

Deepening Issues Revealed by the Decline

1. The Double-Edged Sword of “Trend Following”

DeepSeek’s success was based on “trend following”: 95% of the time, it went long, trusting the trend would continue. During an uptrend, this strategy yielded a maximum return of 95%. But when the trend reversed, the same approach caused a 30% loss.

This exposes a key problem: Trend-following strategies require effective take-profit and stop-loss mechanisms. If you only “let profits run” without “cutting losses,” a major reversal can wipe out most gains.

DeepSeek may have overly believed in the value of “long-term holding,” neglecting market uncertainty. Its largest single profit of $7,378 came from a 60-hour ETH trade, reinforcing its “long-termism.” But markets are not one-way streets; trends can reverse at any time.

2. Holding Cash Is a Form of Wisdom and Protection

Qwen3 demonstrated the value of holding cash. Its 82.4% cash during rising phases might seem like “missing opportunities,” but during declines, it prevented losses.

A 26% drawdown versus 32% shows only a 6 percentage point difference, but compounded over time, this gap widens. More importantly, Qwen3 preserved principal and psychological advantage, enabling quick re-entry when the market stabilizes. DeepSeek, if it continues to decline, risks falling into a “floating loss–hesitation–missed rebound” vicious cycle.

3. The Resilience of Simple Strategies

The performance of BTC Buy & Hold is a slap in the face for all “smart” AI models. This strategy involves no technical analysis, no complex algorithms, no frequent rebalancing, yet it ranks third, outperforming half of the AI models.

This result reminds us: in trading, avoiding mistakes can be more valuable than making many correct trades. Gemini lost 66% over 193 trades, while BTC Buy & Hold made zero trades and preserved principal. Who is more successful? The answer is obvious.

4. Lack of Risk Management

Except for Qwen3, nearly all AI models exposed serious risk management flaws:

DeepSeek: No take-profit mechanism, causing peak gains of 130% to retrace to 57%
Claude: Over-reliance on “no shorting” mindset, lacking hedging
Grok: Despite only a 22.7% win rate, persisted with 90.6% long positions
GPT: 40x leverage on BTC position, liquidation threshold only 1.2%
Gemini: No risk control at all, 193 trades akin to gambling

This shows that while these AI models can “understand” market data and “execute” trades, their core risk management capabilities are still immature.

Limitations of the Experiment: Cold Reflection Beyond Data

After reviewing the data and analysis, it’s tempting to focus on DeepSeek’s 56% yield or Gemini’s 66% loss. But before drawing conclusions, we must acknowledge the systemic limitations of this experiment—these may be more important than the results themselves.

1. The Time Frame Is Too Short: 12 Days Cannot Reveal the Truth

This experiment lasted only 12 days, from October 18 to 30. What does 12 days mean in the crypto market? Likely just a fragment of a full bull-bear cycle.

The observed “rise–peak–pullback” is a complete mini-cycle, but it could be luck. If the experiment started at a market top or encountered a sudden 30% crash like the “519 event,” the rankings could be completely reversed.

DeepSeek’s 56% return may heavily depend on this short-term market behavior. Its 95% long position strategy excels in a bullish trend but would be eaten away by fees and repeated stop-losses during sideways or bear markets.

Similarly, Qwen3’s 82% cash during sideways markets is advantageous, but in a 2021 bull run, it would underperform, missing out on large gains. A BTC bull market from $10,000 to $100,000 with 80% cash means only capturing 20% of the rise.

12 days of data are insufficient to validate any long-term strategy.

2. Same Prompt, Different AI: Bound by the Same Data

All six AI models received identical market data and trading instructions. It’s like six fund managers analyzing the same research report—what’s being tested isn’t their research ability but their execution discipline.

In real trading, alpha comes from asymmetric information. Top quant funds have exclusive on-chain tracking, whale transfer insights, and off-chain large order flow data to anticipate institutional moves.

But in this experiment, all AI models saw the same information. It’s more a “execution competition” than a “strategy innovation” contest.

We cannot determine, from this setup, who would win if DeepSeek had exclusive on-chain data or Gemini had proprietary Twitter sentiment analysis.

3. Capital Scale Distortion: The $10,000 Fairy Tale

Each AI only managed $10,000. This is a tiny amount on Hyperliquid—you can enter and exit freely, slippage is negligible, liquidity impact is nonexistent, and large orders can be split without concern.

But in real quantitative trading, managing $10 million versus $10,000 is a different universe.

GPT’s 40x leverage at $10,000 is barely feasible, but at $100 million × 40x = $4 billion exposure, a 3% adverse move would trigger liquidation, and your orders could crash the market.
Qwen3’s short-term strategy, effective with small capital, becomes unviable with large funds due to transaction costs (slippage + fees). Large orders will push prices up when opening and down when closing, ultimately costing you money.
DeepSeek’s trend-following with high leverage can work at $10,000, but managing $1 million would leave obvious footprints in Hyperliquid’s order book, attracting counter-trades.

This experiment tests “small capital flexibility,” not “scalable strategy robustness.”

4. Market Environment Luck: No True Hell Encountered

During the experiment, market volatility was moderate. We did not see:

Systemic crashes like FTX’s collapse, where all tokens plunge and liquidity dries up
Single-token flash crashes like LUNA’s zeroing, dropping from $80 to near zero in an hour
Exchange failures like Binance’s outage, leaving traders unable to close positions
Extreme liquidity droughts, where deep orders vanish overnight, causing slippage over 20%

All AI risk controls have not been tested under extreme stress, which is what real crypto traders face. How would DeepSeek’s stop-loss work during a “limit down” scenario? We don’t know. Would Qwen3’s quick close work if the exchange crashes? Uncertain.

Luck plays a significant role in this 12-day experiment.

5. The Randomness of a Single Experiment: No Second Season for Validation

This is a one-off test; there’s no “second season” to verify strategy stability. We cannot answer:

Is DeepSeek’s lead due to genuine ability or just luck?
If we rerun the six models with different parameters, will DeepSeek still be first?
If we start from November 1 for the next 12 days, will the rankings flip?

The current results are more like six people rolling dice, with DeepSeek rolling the highest. But that doesn’t mean its dice are better—just luckier.

So, How Should We View These Rankings?

After considering these limitations, you might ask: does this experiment have any meaning?

Yes, but not in terms of “who is the champion.” Its true value is showing us:

AI can conduct real trading—this is a milestone. A year ago, we debated whether AI would replace traders; now, AI is delivering real results.
Risk management is more important than prediction—All AI models can “understand” candlestick charts, but only a few can control risk. This confirms the old Wall Street wisdom.
Simplicity endures—BTC Buy & Hold’s third place reminds us that in uncertain markets, avoiding mistakes can be more valuable than making many correct trades.
Strategies are not eternal—DeepSeek’s advantage today could be tomorrow’s trap. Market conditions change, and so do the optimal strategies.

But if you see DeepSeek in first place and decide to entrust it with your funds or copy its approach, you’re making a mistake.

A 12-day champion does not guarantee a 12-month champion; managing $10,000 doesn’t mean managing $1 million; current market winners don’t guarantee future success.

Investing has no simple answers. This experiment provides valuable data, but the limitations behind the data may be more worth pondering than the data itself.

This report’s data was edited and compiled by WolfDAO. For questions, contact us for updates.

Written by: Riffi / WolfDAO( X: @10xWolfdao )

BTC-1.11%

ETH-0.58%

SOL-0.88%

LUNA4.26%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.