David Li
Data scientists and AI engineers today have more tools than ever at their disposal. But more choices has also led to fragmentation, even when it comes to the basic task of loading data. With Apache Arrow ADBC (Arrow Database Connectivity), anyone working with databases in Python can access systems from BigQuery to PostgreSQL to Snowflake using familiar DB-API interfaces and maximum performance. We’ll show how ADBC makes it easy and fast to work with different data sources and different engines (like pandas and Polars). We’ll also look under the hood, with an introduction to the Apache Arrow ecosystem at the heart of the modern data science and AI ecosystems.
Apache Arrow is an open source project developing a high-performance, language-independent standard for in-memory representation of columnar data. Since its start in 2016, it has expanded to provide other related projects, from efficient file formats for storing data tables to RPC frameworks for developing data services. Many vendors and open source projects depend on or interoperate via Arrow or one of its subprojects, including pandas, Polars, PyIceberg, Snowflake, Apache Spark, and more.
One of the Arrow subprojects is ADBC (Arrow Database Connectivity), providing language-independent APIs for interacting with data systems (including databases and data warehouses) using Arrow data. The project aims to replace JDBC and ODBC for data science, OLAP, and AI/ML applications, as they demand more performance than existing implementations can fundamentally provide.
ADBC embraces polyglot development environments and zero-copy interfaces, which allows Pythonistas to more easily benefit from cutting-edge developments in data science and AI/ML. Thanks to Apache Arrow, ADBC users in Python can leverage drivers written in C++, Go, Rust, and other languages, giving them high-performance access to Snowflake, Databricks, BigQuery, PostgreSQL. Moreover, users can continue using familiar APIs (DB-API 2.0), while developers don’t have to provide Python-specific bindings. Arrow and ADBC also let Python users interoperate easily with libraries like Polars that have become popular for data science.
プロフィール
I am an engineer at Columnar Technologies. Previously I worked at Voltron Data (formerly known as Ursa Computing) and Two Sigma Investments. I'm a longtime open source contributor; I have worked on the Apache Arrow project since 2019 and was one of the creators of the ADBC subproject. I am currently a PMC member of the Arrow project.