エピソード

  • How Denormalized is Building ‘DuckDB for Streaming’ with Apache DataFusion
    2024/09/13

    Summary
    In this episode, Kostas and Nitay are joined by Amey Chaugule and Matt Green, co-founders of Denormalized. They delve into how Denormalized is building an embedded stream processing engine—think “DuckDB for streaming”—to simplify real-time data workloads. Drawing from their extensive backgrounds at companies like Uber, Lyft, Stripe, and Coinbase. Amey and Matt discuss the challenges of existing stream processing systems like Spark, Flink, and Kafka. They explain how their approach leverages Apache DataFusion, to create a single-node solution that reduces the complexities inherent in distributed systems.


    The conversation explores topics such as developer experience, fault tolerance, state management, and the future of stream processing interfaces. Whether you’re a data engineer, application developer, or simply interested in the evolution of real-time data infrastructure, this episode offers valuable insights into making stream processing more accessible and efficient.


    Contacts & Links
    Amey Chaugule
    Matt Green
    Denormalized
    Denormalized Github Repo

    Chapters
    00:00 Introduction and Background
    12:03 Building an Embedded Stream Processing Engine
    18:39 The Need for Stream Processing in the Current Landscape
    22:45 Interfaces for Interacting with Stream Processing Systems
    26:58 The Target Persona for Stream Processing Systems
    31:23 Simplifying Stream Processing Workloads and State Management
    34:50 State and Buffer Management
    37:03 Distributed Computing vs. Single-Node Systems
    42:28 Cost Savings with Single-Node Systems
    47:04 The Power and Extensibility of Data Fusion
    55:26 Integrating Data Store with Data Fusion
    57:02 The Future of Streaming Systems
    01:00:18 intro-outro-fade.mp3

    Click here to view the episode transcript.


    続きを読む 一部表示
    1 時間 2 分
  • Unifying structured and unstructured data for AI: Rethinking ML infrastructure with Nikhil Simha and Varant Zanoyan
    2024/08/30


    Summary

    In this episode, we dive deep into the future of data infrastructure for AI and ML with Nikhil Simha and Varant Zanoyan, two seasoned engineers from Airbnb and Facebook. Nikhil and Varant share their journey from building real-time data systems and ML infrastructure at tech giants to launching their own venture.

    The conversation explores the intricacies of designing developer-friendly APIs, the complexities of handling both batch and streaming data, and the delicate balance between customer needs and product vision in a startup environment.

    Contacts & Links

    Nikhil Simha
    Varant Zanoyan
    Chronon project

    Chapters

    00:00 Introduction and Past Experiences
    04:38 The Challenges of Building Data Infrastructure for Machine Learning
    08:01 Merging Real-Time Data Processing with Machine Learning
    14:08 Backfilling New Features in Data Infrastructure
    20:57 Defining Failure in Data Infrastructure
    26:45 The Choice Between SQL and Data Frame APIs
    34:31 The Vision for Future Improvements
    38:17 Introduction to Chrono and Open Source
    43:29 The Future of Chrono: New Computation Paradigms
    48:38 Balancing Customer Needs and Vision
    57:21 Engaging with Customers and the Open Source Community
    01:01:26 Potential Use Cases and Future Directions

    Click here to view the episode transcript.

    続きを読む 一部表示
    1 時間 2 分
  • Stream processing, LSMs and leaky abstractions with Chris Riccomini
    2024/08/23

    Overview

    In this episode, we chat with Chris Riccomini about the evolution of stream processing and the challenges in building applications on streaming systems. We also chat about leaky abstractions, good and bad API designs, what Chris loves and hates about Rust and finally about his exciting new project that involves object storage and LSMs.

    Connect with Chris at:
    LinkedIn
    X
    Blog
    Materialized View Newsletter - His newsletter
    The missing README - His book
    SlateDB - His latest OSS Project

    Chapters
    00:00 Introduction and Background

    04:05 The State of Stream Processing Today

    08:53 The Limitations of SQL in Streaming Systems

    14:00 Prioritizing the Developer Experience in Stream Processing

    18:15 Improving the Usability of Streaming Systems

    27:54 The Potential of State Machine Programming in Complex Systems

    32:41 The Power of Rust: Compiling and Language Bindings

    34:06 The Shift from Sidecar to Embedded Libraries Driven by Rust

    35:49 Building an LSM on Object Storage: Cost-Effective State Management

    39:47 The Unbundling and Composable Nature of Databases

    47:30 The Future of Data Systems: More Companies and Focus on Metadata


    Click here to view the episode transcript.

    続きを読む 一部表示
    53 分