PodXiv: The latest AI papers, decoded in 20 minutes.

エピソード

(LLM Benchmark-OpenAI) BrowseComp: Web Browsing Benchmark for AI Agents

2025/06/12

Welcome to our podcast! Today, we're diving into BrowseComp, a groundbreaking new benchmark for browsing agents, recently submitted to arXiv by Jason Wei and his team, with support acknowledged from the Simons Foundation. BrowseComp's novelty lies in its design: it's a simple yet challenging tool comprising 1,266 questions. These questions specifically demand that agents persistently navigate the internet to unearth hard-to-find, "entangled information". Despite the inherent difficulty, predicted answers are concise and easily verifiable, ensuring the benchmark remains practical and user-friendly. It uniquely measures the important core capability of exercising persistence and creativity in information discovery, serving as a vital analogue to programming competitions for coding agents.
Crucially, we also consider its limitations. BrowseComp is a useful but ultimately incomplete benchmark. It intentionally sidesteps common complexities found in real user queries, such as the generation of long answers or the resolution of ambiguity. Thus, while an excellent measure of web navigation proficiency, it doesn't cover the full spectrum of user interaction. Nevertheless, its primary application is to rigorously evaluate an agent's foundational browsing skills. This benchmark is poised to significantly advance the field of AI agents capable of intelligent web exploration.
Find the paper: arXiv:2504.12516 or https://doi.org/10.48550/arXiv.2504.12516.

続きを読む一部表示

10 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
Text2Tracks: Music Recommendation via Generative Retrieval

2025/06/11

Natural language prompts are changing how we ask for music recommendations. Users want to say things like, "Recommend some old classics for slow dancing?". But traditional LLMs often just generate song titles, which has drawbacks like needing extra steps to find the actual track and being inefficient.
Introducing Text2Tracks from Spotify! This novel research tackles prompt-based music recommendation using generative retrieval. Instead of generating titles, Text2Tracks is trained to directly output relevant track IDs based on your text prompt.
A critical finding is that how you represent the track IDs makes a huge difference. Using semantic IDs derived from collaborative filtering embeddings proved most effective, significantly outperforming older methods like using artist and track names. This approach boosts effectiveness (48% increase in Hits@10) and efficiency (7.5x fewer decoding steps).
While developing effective ID strategies was a key challenge explored, Text2Tracks ultimately outperforms traditional retrieval methods, making it a powerful new model particularly suited for conversational recommendation scenarios.
Paper link: https://arxiv.org/pdf/2503.24193

続きを読む一部表示

13 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
(LLM Explain-Apple) The Illusion of Thinking

2025/06/09

Generate 200 words description for my podcast. The description should convey the novelty, limitation and applications. Include organization name. Description should be helpful, relevant. Include paper link at the bottom

続きを読む一部表示

13 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
(LLM Scaling-Meta) MEGABYTE: Modelling Million-byte Sequences with Transformers

2025/06/08

Explore MEGABYTE from Meta AI, a novel multi-scale transformer architecture designed to tackle the challenge of modelling sequences of over one million bytes. Traditional large transformer decoders scale poorly to such lengths due to the quadratic cost of self-attention and the expense of large feedforward layers per position, limiting their application to long sequences like high-resolution images or books.
MEGABYTE addresses this by segmenting sequences into patches, employing a large global model to process relationships between patches and a smaller local model for prediction within patches. This design leads to significant advantages, including sub-quadratic self-attention cost, the ability to use much larger feedforward layers for the same computational budget, and improved parallelism during generation. Crucially, MEGABYTE enables tokenization-free autoregressive sequence modelling at scale, simplifying processing and offering an alternative to methods that can lose information or require language-specific heuristics.
The architecture demonstrates strong performance across various domains, competing with subword models on long context language modelling, achieving state-of-the-art density estimation on ImageNet, and effectively modelling audio from raw files. While promising, the current experiments are conducted at a scale below the largest state-of-the-art language models, indicating that future work is needed to fully explore scaling MEGABYTE to even larger models and datasets.
Learn how MEGABYTE is advancing the frontier of efficient, large-scale sequence modelling.
[https://proceedings.neurips.cc/paper_files/paper/2023/file/f8f78f8043f35890181a824e53a57134-Paper-Conference.pdf]

続きを読む一部表示

16 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
(LLM Scaling-Meta) Byte Latent Transformer: Patches Scale Better Than Tokens

2025/06/07

Tune in to explore the Byte Latent Transformer (BLT), a groundbreaking new architecture from FAIR at Meta. Unlike traditional large language models that rely on fixed vocabularies and tokenizers, BLT is tokenizer-free, learning directly from raw bytes. Its novelty lies in dynamically grouping bytes into patches based on data complexity, allowing it to allocate compute efficiently.
BLT matches or surpasses the performance of state-of-the-art tokenization-based models like Llama 3 at scale, while offering significant improvements in inference efficiency, potentially using up to 50% fewer flops. It also provides enhanced robustness to noisy inputs and superior character-level understanding, excelling in tasks like orthography, phonology, and low-resource machine translation. Furthermore, BLT introduces a new scaling dimension, enabling simultaneous increases in model and patch size while maintaining a fixed inference budget.
Current limitations include the need for further research on BLT-specific scaling laws and potentially improving wall-clock efficiency. Join us to learn how this dynamic, byte-level approach could shape the future of language models!
Find the paper here: https://arxiv.org/pdf/2412.09871

続きを読む一部表示

18 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
(LLM Explain-Stanford) From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

2025/06/03

Welcome to a deep dive into the fascinating world of AI and human cognition. Our focus today is on the arXiv paper, "From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning," authored by Chen Shani, Dan Jurafsky, Yann LeCun, and Ravid Shwartz-Ziv. This research introduces a novel information-theoretic framework, applying principles from Rate-Distortion Theory and the Information Bottleneck to analyse how Large Language Models (LLMs) represent knowledge compared to humans. By quantitatively comparing token embeddings from diverse LLMs against established human categorization benchmarks, the study offers unique insights into their respective strategies.
The findings reveal key differences. While LLMs are effective at statistical compression and forming broad conceptual categories that align with human judgement, they show a significant limitation: struggling to capture the fine-grained semantic distinctions crucial for human understanding. Fundamentally, LLMs display a strong bias towards aggressive compression, whereas human conceptual systems prioritise adaptive nuance and contextual richness, even if this results in lower compression efficiency by the measures used.
These insights illuminate critical distinctions between current AI and human cognitive architectures. The research has important implications, guiding pathways towards developing LLMs with conceptual representations more closely aligned with human cognition, potentially enhancing future AI capabilities. Tune in to explore this vital trade-off between compression and meaning.
Paper Link: https://doi.org/10.48550/arXiv.2505.17117

続きを読む一部表示

16 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
(LLM Explain-Anthropic) On the Biology of a Large Language Model

2025/06/01

"On the Biology of a Large Language Model" from Anthropic presents a novel investigation into the internal mechanisms of Claude 3.5 Haiku using circuit tracing methodology. Analogous to biological research, this approach employs tools like attribution graphs to reverse engineer the model's computational steps. The research offers insights into diverse model capabilities, such as multi-step reasoning, planning in poems, multilingual circuits, addition, and medical diagnoses. It also examines mechanisms underlying hallucinations, refusals, jailbreaks, and hidden goals. This work aims to reveal interpretable intermediate computations, highlighting its potential in areas like safety auditing.
However, the methods have significant limitations. They provide detailed insights for only a fraction of prompts, capture just a small part of the model's immense complexity, and rely on imperfect replacement models. They struggle with complex reasoning chains, long prompts, and explaining inactive features. A key challenge is understanding the causal role of attention patterns.
Despite these limitations, this research represents a valuable stepping stone towards a deeper understanding of how large language models function internally and presents a challenging scientific frontier.
Paper link: https://transformer-circuits.pub/2025/attribution-graphs/biology.html

続きを読む一部表示

16 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く
(LLM Security-Meta) LlamaFirewall: AI Agent Security Guardrail System

2025/05/31

Listen to this podcast to learn about LlamaFirewall, an innovative open-source security framework from Meta. As large language models evolve into autonomous agents capable of performing complex tasks like editing production code and orchestrating workflows, they introduce significant new security risks that existing measures don't fully address. LlamaFirewall is designed to serve as a real-time guardrail monitor, providing a final layer of defence against these risks for AI Agents.
Its novelty stems from its system-level architecture and modular, layered design. It incorporates three powerful guardrails: PromptGuard 2, a universal jailbreak detector showing state-of-the-art performance; AlignmentCheck, an experimental chain-of-thought auditor inspecting reasoning for prompt injection and goal misalignment; and CodeShield, a fast and extensible online static analysis engine preventing insecure code generation. These guardrails are tailored to address emerging LLM agent security risks in applications like travel planning and coding, offering robust mitigation.
However, CodeShield is not fully comprehensive and may miss nuanced vulnerabilities. AlignmentCheck requires large, capable models, which can be computationally costly, and faces the potential risk of guardrail injection. Meta is actively developing the framework, exploring future work like expanding to multimodal agents and improving latency. LlamaFirewall aims to provide a collaborative security foundation for the community.
Learn more here

続きを読む一部表示

17 分

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

無料で聴く

特集

カテゴリー別

エピソード

(LLM Benchmark-OpenAI) BrowseComp: Web Browsing Benchmark for AI Agents

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

Text2Tracks: Music Recommendation via Generative Retrieval

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

(LLM Explain-Apple) The Illusion of Thinking

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

(LLM Scaling-Meta) MEGABYTE: Modelling Million-byte Sequences with Transformers

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

(LLM Scaling-Meta) Byte Latent Transformer: Patches Scale Better Than Tokens

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

(LLM Explain-Stanford) From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

(LLM Explain-Anthropic) On the Biology of a Large Language Model

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

(LLM Security-Meta) LlamaFirewall: AI Agent Security Guardrail System

カートのアイテムが多すぎます

カートに追加できませんでした。

ウィッシュリストに追加できませんでした。

ほしい物リストの削除に失敗しました。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました