Fast, Universal Data Access with Apache Arrow ADBC

ラン#pyconjp_4初級英語

13:30 - 14:0030min

DAY 1

09/26

FRI

Data scientists and AI engineers today have more tools than ever at their disposal. But more choices has also led to fragmentation, even when it comes to the basic task of loading data. With Apache Arrow ADBC (Arrow Database Connectivity), anyone working with databases in Python can access systems from BigQuery to PostgreSQL to Snowflake using familiar DB-API interfaces and maximum performance. We’ll show how ADBC makes it easy and fast to work with different data sources and different engines (like pandas and Polars). We’ll also look under the hood, with an introduction to the Apache Arrow ecosystem at the heart of the modern data science and AI ecosystems.

Resources

https://github.com/lidavidm/pyconjp2025

Slide deck

https://pretalx.com/media/pycon-jp-2025/submissions/Q7YYNZ/resources/Fast_Univer_VW8iPm0.pdf

トーク詳細 / Description

Apache Arrow is an open source project developing a high-performance, language-independent standard for in-memory representation of columnar data. Since its start in 2016, it has expanded to provide other related projects, from efficient file formats for storing data tables to RPC frameworks for developing data services. Many vendors and open source projects depend on or interoperate via Arrow or one of its subprojects, including pandas, Polars, PyIceberg, Snowflake, Apache Spark, and more.

One of the Arrow subprojects is ADBC (Arrow Database Connectivity), providing language-independent APIs for interacting with data systems (including databases and data warehouses) using Arrow data. The project aims to replace JDBC and ODBC for data science, OLAP, and AI/ML applications, as they demand more performance than existing implementations can fundamentally provide.

ADBC embraces polyglot development environments and zero-copy interfaces, which allows Pythonistas to more easily benefit from cutting-edge developments in data science and AI/ML. Thanks to Apache Arrow, ADBC users in Python can leverage drivers written in C++, Go, Rust, and other languages, giving them high-performance access to Snowflake, Databricks, BigQuery, PostgreSQL. Moreover, users can continue using familiar APIs (DB-API 2.0), while developers don’t have to provide Python-specific bindings. Arrow and ADBC also let Python users interoperate easily with libraries like Polars that have become popular for data science.

この題材を選んだ理由やきっかけ

I am a long-time contributor to Apache Arrow and one of the creators of the ADBC subproject. I would like to spread awareness of Arrow and ADBC as I think they will become more and more useful to modern data science/data engineering workflows, especially as more vendors join the ecosystem. Microsoft, Snowflake, Databricks, and other vendors have already adopted Arrow and ADBC in their products.

オーディエンスが持って帰れる具体的な知識やノウハウ

(1) An introduction to what Apache Arrow is and the broader ecosystem (2) How to use ADBC to work with different databases (bulk ingest datasets/bulk export datasets) (3) A demonstration of the performance improvements ADBC provides

オーディエンスに求める前提知識

Some basic familiarity with any data science or data engineering tasks, e.g. using dataframes in pandas or Polars, basic SQL, or working with DuckDB.

David Li

プロフィール

I am an engineer at Columnar Technologies. Previously I worked at Voltron Data (formerly known as Ursa Computing) and Two Sigma Investments. I'm a longtime open source contributor; I have worked on the Apache Arrow project since 2019 and was one of the creators of the ADBC subproject. I am currently a PMC member of the Arrow project.

@pyconjapan チケットを購入