エピソード

  • AI Improves at Self-improving
    2025/05/19

    AlphaEvolve is not the first system to exhibit self-improvement, but it may be the most impressive yet. AI is literally improving the hardware, architectures, data and training methods of AI itself. A deep dive into the paper, drawing on two previous interviews and 5 other papers. Plus a snippet on OpenAI’s new Codex system.

    Gray Swan: http://app.grayswan.ai/ai-explained

    AI Insiders ($9!): https://www.patreon.com/AIExplained

    Chapters:
    00:00 - Introduction
    00:27 - AlphaEvolve
    05:23 - Limitation
    06:10 - Achievements
    08:21 - Future Improvements
    13:30 - Quirks
    16:34 - Final Thoughts

    AlphaEvolve release: https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/

    Paper: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/AlphaEvolve.pdf

    Terence Tao Quote: https://mathstodon.xyz/@tao/114508029896631083

    Nature Article: https://www.nature.com/articles/s41586-022-05172-4
    MIT Article: https://www.technologyreview.com/2025/05/14/1116438/google-deepminds-new-ai-uses-large-language-models-to-crack-real-world-problems/
    AI Co-Scientist: https://arxiv.org/pdf/2502.18864

    OpenAI Codex: https://openai.com/index/introducing-codex/


    70% of Pull Requests: https://x.com/slow_developer/status/1920920456393028027

    Amodei Essay: https://www.darioamodei.com/essay/machines-of-loving-grace

    OpenAI Jason Wei Tweet: https://x.com/_jasonwei/status/1923091260354531612

    PromptBreeder: https://arxiv.org/pdf/2309.16797
    DrEureka: https://arxiv.org/pdf/2406.01967

    FT DeepMind: https://www.ft.com/content/4e497a91-670a-4f69-be4a-18e247daba3e



    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    続きを読む 一部表示
    18 分
  • o3 breaks (some) records, but AI becomes pay-to-win
    2025/04/25

    A green card, o3 vs Gemini 2.5, 6 Benchmarks and a whole bunch of my thoughts on what on earth is happening in AI, from here to 2030. Plus, how AI is becoming pay-to-win, and why. Crazy times, 14 mins probably wasn’t enough.

    https://app.grayswan.ai/ai-explained

    AI Insiders ($9!): https://www.patreon.com/AIExplained

    Chapters:
    00:00 - Introduction
    00:33 - FictionLiveBench
    01:37 - PHYBench
    02:14 - SimpleBench
    02:54 - Virology Capabilities Test
    03:13 - Mathematics Performance
    04:29 - Vision Benchmarks
    05:43 - V* and how o3 works
    06:44 - Revenue and costs for you
    08:54 - Expensive RL and trade-offs
    09:40 - How to spend the OOMs
    13:27 - Gray Swan Arena

    Green Card: https://techcrunch.com/2025/04/25/an-openai-researcher-who-worked-on-gpt-4-5-had-their-green-card-denied/
    PHYBench: https://arxiv.org/pdf/2504.16074Virologytest: https://www.virologytest.ai/
    How o3 Vision Works: https://arxiv.org/pdf/2312.14135 https://x.com/sainingxie/status/1912570624523829573
    Visual puzzles: https://neulab.github.io/VisualPuzzles/
    Fiction Bench: https://x.com/ficlive/status/1912863028141244850
    https://geobench.org/
    https://simple-bench.com/
    AIME 2025: https://openai.com/index/introducing-o3-and-o4-mini/
    USAMO: https://x.com/mbalunovic/status/1914398518896193747
    NaturalBench: https://linzhiqiu.github.io/papers/naturalbench/
    Where’s Waldo: https://uk.pinterest.com/pin/492792384225896298/
    IMO and AlphaProof:https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/
    Crazy Revenue: https://www.theinformation.com/articles/openai-forecasts-revenue-topping-125-billion-2029-agents-new-products-gain?rc=sy0ihq
    Number of Users: https://www.theinformation.com/briefings/googles-gemini-user-numbers-revealed-court?rc=sy0ihq
    Subscriptions pay to win: https://www.forbes.com/sites/paulmonckton/2025/04/23/google-leak-reveals-new-gemini-ai-subscription-levels/
    GPU Trade-offs: https://x.com/sama/status/1915098951067554030
    RL Scale-up Amodei: https://www.darioamodei.com/post/on-deepseek-and-export-controls
    Log-linear Returns: https://x.com/bobmcgrewai/status/1895228291981943265
    2030 Scaling: https://epoch.ai/blog/can-ai-scaling-continue-through-2030
    Model Size: https://x.com/slow_developer/status/1874554473256997201
    Adam on AGI: https://x.com/TheRealAdamG/status/1913998366632968381
    Papers on Patreon: https://arxiv.org/pdf/2502.01839
    https://arxiv.org/pdf/2504.13837
    Chollet Quote: https://x.com/fchollet/status/1912934762580447447
    OpenSim: https://opensim.stanford.edu/


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    続きを読む 一部表示
    15 分
  • o3 and o4-mini - they’re great, but easy to over-hype
    2025/04/16

    Critical analysis of the two most powerful new models behind ChatGPT, o3 and o4-mini. Not just the system cards, benchmarks, and my own tests, but some you may not have seen before. Yes, they can whip up amazing front-end in a few seconds, but you always have to ask what is in their data. Either way, they prove the gains from RL are just beginning…

    https://weave-docs.wandb.ai/?utm_source=sponsorship&utm_medium=simple_bench&utm_campaign=ai_explained

    AI Insiders ($9!): https://www.patreon.com/AIExplained


    Chapters:
    00:00 - o3 and o4-mini


    https://simple-bench.com/

    Plus, Teams and Pro, plus token count: https://x.com/btibor91/status/1912568994512662679

    System Card: https://openai.com/index/o3-o4-mini-system-card/

    Release Notes: https://openai.com/index/introducing-o3-and-o4-mini/

    https://deepmind.google/technologies/gemini/pro/

    https://x.com/DeryaTR_/status/1912558350794961168

    https://x.com/polynoamial/status/1912564068168450396

    API Pricing:https://openai.com/api/pricing/

    https://aider.chat/docs/leaderboards/


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    続きを読む 一部表示
    14 分
  • ‘Speaking Dolphin’ to AI Data Dominance, 4.1 + Kling 2: 7 Developments Critically Analysed
    2025/04/16

    This pod won’t just be about the release of GPT 4.1 in the last 48 hours, o3 build-up, Kling 2.0, a sneak-peak at the next OpenAI model, or even the new Dolphin language tool. It will be about 7 such stories that contextualise where we are in AI and what is happening.

    https://www.emergentmind.com/


    Chapters:

    00:00 - Introduction

    00:30 - Kling 2.0

    01:35 - GPT 4.1

    05:25 - o3 Build-up

    07:37 - ‘Product Company’

    09:31 - Safe Superintelligence

    10:54 - DolphinGemma

    13:16 - Data Dominance?


    Kling 2.0: https://app.klingai.com/global/release-notes


    Dolphin Gemma: https://blog.google/technology/ai/dolphingemma/?s=09


    https://openai.com/index/gpt-4-1/


    OpenAI o3 Build-up The Information: https://www.theinformation.com/articles/openais-latest-breakthrough-ai-comes-new-ideas?rc=sy0ihq


    Physical reasoning: https://x.com/a_karvonen/status/1911839968990814503


    Fiction Live.bench: https://x.com/ficlive/status/1911853409847906626


    Altman Ted: https://www.youtube.com/watch?v=5MWT_doo68k


    https://simple-bench.com/try-yourself


    https://aider.chat/docs/leaderboards/


    4.5: https://www.youtube.com/watch?v=6nJZopACRuQ


    Geospatial reasoning: https://research.google/blog/geospatial-reasoning-unlocking-insights-with-generative-ai-and-multiple-foundation-models/


    Pioneers: https://x.com/OpenAIDevs/status/1910017976256119151

    Evals: https://www.youtube.com/watch?v=scsW6_2SPC4

    Anthropic Updates: https://www.bloomberg.com/news/articles/2025-04-15/anthropic-is-readying-a-voice-assistant-feature-to-rival-openai?srnd=phx-ai

    https://x.com/sethsaler/status/1912188383457059301


    https://techcrunch.com/2025/04/12/openai-co-founder-ilya-sutskevers-safe-superintelligence-reportedly-valued-at-32b/

    https://ai.meta.com/blog/llama-4-multimodal-intelligence/

    https://deepmind.google/technologies/gemini/pro/

    https://research.google/blog/accelerating-scientific-breakthroughs-with-an-ai-co-scientist/

    https://blog.google/products/google-cloud/ironwood-tpu-age-of-inference/

    OpenAI Documentary: https://www.patreon.com/posts/one-machine-to-121940490

    続きを読む 一部表示
    20 分
  • AI CEO: ‘Stock Crash Could Stop AI Progress’, Llama 4 Anti-climax +‘Superintelligence in 2027’...
    2025/04/07

    The latest on Llama 4, and whether it signals a slowdown in AI, or solid progress. Plus, a deep dive on that viral prediction of superintelligence by 2027, and Amodei’s cautionary words on what could stop AI progress in its tracks. o3 news, and more, as well.

    Weights & Biases: https://weave-docs.wandb.ai/?utm_source=sponsorship&utm_medium=simple_bench&utm_campaign=ai_explained


    DeepSeek Doc: https://www.patreon.com/posts/openai-is-not-r1-125869969

    AI Insiders ($9!): https://www.patreon.com/AIExplained

    Chapters:
    00:00 - Introduction
    00:47 - Stock Crash
    02:28 - Llama 4
    10:55 - o3 News
    11:59 - OpenAI non-profit?
    13:13 - AI 2027

    Llama 4 Release: https://ai.meta.com/blog/llama-4-multimodal-intelligence/

    Dario Amodei Comments: https://www.youtube.com/watch?v=esCSpbDPJik

    Knowledge Cut-off: https://www.llama.com/docs/model-cards-and-prompt-formats/llama4_omni/

    Aider Polyglot: https://aider.chat/docs/leaderboards/

    Gemini 1.5: https://arxiv.org/pdf/2403.05530

    Fiction-LiveBench: https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87

    OpenAI Valuation: https://www.nytimes.com/2025/03/31/technology/openai-valuation-300-billion.html?login=smartlock&auth=login-smartlock

    OpenAI Cybersecurity: https://www.bloomberg.com/news/articles/2024-01-16/openai-working-with-us-military-on-cybersecurity-tools-for-veterans

    Deep research System Card: https://cdn.openai.com/deep-research-system-card.pdf

    https://openai.com/index/paperbench/

    AI 2027: https://ai-2027.com/

    METR Paper: https://arxiv.org/pdf/2503.14499

    OpenAI non-profit: https://openai.com/index/nonprofit-commission-guidance/

    NYT Piece: https://www.nytimes.com/2025/04/03/technology/ai-futures-project-ai-2027.html?unlocked_article_code=1.804._yKi.QhwOp15Q3tcU&smid=url-share&s=09

    Kokotajlo predictions 2021: https://www.lesswrong.com/posts/6Xgy6CAf2jqHhynHL/what-2026-looks-like

    https://simple-bench.com/


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    Podcast: https://aiexplainedopodcast.buzzsprout.com/

    続きを読む 一部表示
    24 分
  • Gemini 2.5 Pro - It’s a Smart Chatbot … (New Simple High Score)
    2025/03/28

    Gemini gets a new record on Simple Bench, and several other benchmarks. I’ll go deep to explore its nuances, including how it deceptively reverse engineers answers, does better on certain coding benchmarks than others, may have a universal ‘conceptual language’ …

    https://weave-docs.wandb.ai/?utm_source=sponsorship&utm_medium=simple_bench&utm_campaign=ai_explained

    … and more. Plus practical tips, a note on security and Kling vs Veo 2 guest appearance.


    AI Insiders ($9!): https://www.patreon.com/AIExplained

    Chapters:
    00:00 - Introduction
    00:36 - Fiction Bench
    02:41 - Practicality - YouTube urls + Security - cut-off date
    03:42 - Coding
    06:22 - WeirdML Bench
    07:01 - Simple Bench Record High
    11:23 - Reverse Engineering!
    13:22 - Anthropic Paper
    17:49 - 3 Caveats

    Gemini 2.5 Updated: https://deepmind.google/technologies/gemini/

    Fiction Live Bench: https://fiction.live/stories/Fiction-liveBench-Feb-19-2025/oQdzQvKHw8JyXbN87

    https://simple-bench.com/

    WeirdML: https://htihle.github.io/weirdml.html
    https://x.com/htihle/status/1905014058228625542

    Anthropic Thoughts: https://www.anthropic.com/research/tracing-thoughts-language-model
    https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-cot

    https://aistudio.google.com/prompts/new_chat

    Search Study: https://www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php

    Live bench: https://livebench.ai/#/
    Paper: https://arxiv.org/pdf/2406.19314

    LiveCode Bench: https://livecodebench.github.io/

    SWE-Verified: https://arxiv.org/pdf/2310.06770


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    続きを読む 一部表示
    21 分
  • Did AI Just Get Commoditized? Gemini 2.5, New DeepSeek V3, & Microsoft vs OpenAI
    2025/03/25

    Gemini 2.5 is out, on the same day as the new DeepSeek V3 (which should power Deepseek R2). Do both models prove AI is being commoditized? Let’s find out, on this blockbuster day of AI releases. Plus exclusives from the Information, Simple indications, Vista Bench, LM Arena and more…

    AI Insiders ($9!): https://www.patreon.com/AIExplained

    Chapters:
    00:00 - Introduction
    01:15 - Gemini 2.5 Benchmarks
    05:46 - Long Context, Simple indication
    07:08 - New Deepseek V3 -024
    09:11 - Microsoft MAI
    11:48 - 90% of code but new Claude jobs

    ‘World’s most powerful model’: https://x.com/OfficialLoganK/status/1904580368432586975

    Gemini 2.5 Release Notes: https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking

    ‘Commoditized’: https://the-decoder.com/microsoft-ceo-satya-nadella-says-ai-models-are-getting-commoditized/

    Microsoft Information report: https://www.theinformation.com/articles/microsofts-ai-guru-wants-independence-from-openai-thats-easier-said-than-done?rc=sy0ihq

    LMarena: https://x.com/lmarena_ai/status/1904581128746656099/photo/1

    Free for now: https://x.com/btibor91/status/1904578053537476628

    Vista Bench:https://scale.com/leaderboard/visual_language_understanding

    DeepSeek V3: https://huggingface.co/deepseek-ai/DeepSeek-V3-0324

    Claude Plays Pokemon: https://www.twitch.tv/claudeplayspokemon
    Amodei: 100% Coding: https://www.youtube.com/watch?v=esCSpbDPJik&t=3017s

    Anthropic Jobs: https://job-boards.greenhouse.io/anthropic/jobs/4020717008

    Microsoft Money from Onslaught: https://www.972mag.com/microsoft-azure-openai-israeli-army-cloud/

    https://simple-bench.com/

    Release Date Comments: https://x.com/zacharynado/status/1904647277861318979


    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    続きを読む 一部表示
    14 分
  • Manus AI - The Calm Before the Hypestorm … (vs Deep Research + Grok 3)
    2025/03/13

    Is Manus AI the memecoin of the AI world, or legit? I’ll compare it to OpenAI’s Deep Research, Operator, Grok 3 DeepSearch and more to find out. I’ll also let you in on some of the secrets of what makes a good hype campaign, the estimated costs of Manus AI, and where it is strong. Other news (yes, Gemini image editing and research hacking, I mean you), will have to wait for a few more hours, as millions enquire about Manus AI.

    https://app.grayswan.ai/arena

    AI Insiders ($9!): https://www.patreon.com/AIExplained
    Patreon Vid: https://www.patreon.com/posts/4-ai-trends-in-123857767

    Chapters:
    00:00 - Introduction
    00:46 - Hype Campaign
    02:40 - Single, Public Benchmark
    03:12 - What is Manus AI?
    04:22 - Test 1
    05:12 - Cost and Rate Limits
    06:15 - Test 2 vs Deep Research + Grok 3 DeepSearch
    08:24 - Test 3 (not AGI)
    11:10 - 4 Trends in AI in 2025
    11:37 - Hype Works

    Manus AI: https://manus.im/app

    Xiao Hong Interview: https://www.chinatalk.media/p/manus-chinas-latest-ai-sensation

    Gaia Benchmark: https://openreview.net/pdf?id=fibxvahvs3
    MIT Report: https://www.technologyreview.com/2025/03/11/1113133/manus-ai-review/

    Information Report: https://www.theinformation.com/articles/anthropics-claude-drives-strong-revenue-growth-while-powering-manus-sensation?rc=sy0ihq

    Hype Examples: https://x.com/Saboo_Shubham_/status/1898425707401031940
    https://x.com/EHuanglu/status/1899110687902978373
    https://x.com/AJs_AI/status/1898756132384178291

    Mistakes: https://x.com/TheXeophon/status/1898737178273829220

    Tools and Code: https://x.com/peakji/status/1898994802194346408

    https://operator.chatgpt.com/




    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    Podcast: https://aiexplainedopodcast.buzzsprout.com/

    続きを読む 一部表示
    13 分