Wes McKinney (@wesmckinn) Twitter Tweets • TwiCopy

DuckDB

10 months ago

New blog post: Query Engines: Gatekeepers of the Parquet File Format In this post, Laurens Kuiper argues that we are wasting a lot of bits by not using the Parquet format to its full extent – a limitation caused by the lack of support for Parquet features in some systems.

thumb_up_off_alt196

chat_bubble_outline5

repeat39

shareShare

Wes McKinney

@wesmckinn

10 months ago

So many of these things were just a twinkle in our eye in 2015, but happy to see them coming together so nicely!

thumb_up_off_alt11

chat_bubble_outline1

repeat0

shareShare

Wes McKinney

@wesmckinn

8 months ago

Insightful post on why Apache Iceberg may not be a one-size-fits-all solution when it comes to a table format to manage large multimodal ML/AI datasets

thumb_up_off_alt28

chat_bubble_outline0

repeat3

shareShare

Anthony Goldbloom

@antgoldbloom

8 months ago

I've been using a data science agent called Vincent for the past few months and really like it! It works natively with Jupyter notebooks in VSCode: marketplace.visualstudio.com/items?itemName… Write a prompt and creates a first draft of the notebook. Data science use cases are narrow enough that it

thumb_up_off_alt41

chat_bubble_outline2

repeat5

shareShare

Neon - Serverless Postgres

@neondatabase

8 months ago

We’ve partnered with ParadeDB to bring pg_search to all Neon databases. 💥 This extension delivers Elasticsearch-grade full text search without leaving Postgres. Benchmark results here 👇, summary in 🧵 neon.tech/blog/pgsearch-…

thumb_up_off_alt89

chat_bubble_outline4

repeat8

shareShare

Akshay Agrawal

@akshaykagrawal

8 months ago

I've spent the past 3 years working with myles and Dylan Madisetti to fix Python notebooks — version with Git, run as scripts, reuse as modules. Why marimo stores notebooks as Python, not JSON: marimo.io/blog/python-no…

thumb_up_off_alt93

chat_bubble_outline4

repeat22

shareShare

Rerun

@rerundotio

8 months ago

1/ We just raised $17M to build the multimodal data stack for Physical AI! 🚀 Lead: Point Nine 🇺🇦 With: @CostanoaVC, Sunflower Capital, seedcamp Angels including: Guillermo Rauch, Eric Jang, Oliver Cameron, Wes McKinney , Nicolas Dessaigne , Arnav Bimbhet Thesis: rerun.io/blog/physical-…

thumb_up_off_alt145

chat_bubble_outline11

repeat27

shareShare

Steve Yegge

@steve_yegge

8 months ago

Hi all, I just dropped a new blog post: sourcegraph.com/blog/revenge-o… This one's a beehive-kicker for sure. Hope you like it and find it enlightening, even if you don't agree with all of it.

thumb_up_off_alt164

chat_bubble_outline11

repeat42

shareShare

Bessemer

@bessemervp

8 months ago

The lakehouse paradigm represents a radical transformation in data architectures, welcoming in an era of unprecedented interoperability. The next wave of multi-billion-dollar infrastructure giants are here ⤵️ Read on from Janelle Teng & Lauri Moore: bvp.com/atlas/roadmap-…

thumb_up_off_alt22

chat_bubble_outline3

repeat6

shareShare

Pete Soderling

@petesoder

8 months ago

Take the ferry to Data Council, but beware the DATA KRAKEN. Open water. No traffic. Just Wi-Fi, a full bar and a smooth ride. p.s. Your Clipper Card works on the ferry. Add to your Apple Wallet. p.p.s. Blue Bottle Coffee at the Ferry Building opens at 6:30am. 📅 April 22-24 |

Take the ferry to <a href="/DataCouncilAI/">Data Council</a>, but beware the DATA KRAKEN.

Open water. No traffic. Just Wi-Fi, a full bar and a smooth ride.

p.s. Your Clipper Card works on the ferry. Add to your Apple Wallet.
p.p.s. <a href="/bluebottleroast/">Blue Bottle Coffee</a> at the Ferry Building opens at 6:30am.

📅 April 22-24 |

thumb_up_off_alt20

chat_bubble_outline1

repeat9

shareShare

Wes McKinney

@wesmckinn

8 months ago

I’m excited about xorq! Ibis and DataFusion brought together to orchestrate multi-engine data pipelines, all powered by ApacheArrow github.com/xorq-labs/xorq

thumb_up_off_alt99

chat_bubble_outline1

repeat13

shareShare

ABC

@ubunta

7 months ago

xorq - An exciting tool in Modern Data Engineering, built on top of Ibis, Datafusion and technically ApacheArrow xorq was developed to give Python developers a more ergonomic way to build, cache, and serve pipelines—without getting locked into a single engine. 1. Simplifying

thumb_up_off_alt30

chat_bubble_outline0

repeat5

shareShare

Andrew Lamb

@andrewlamb1111

7 months ago

Worlds Fastest TPCH Data Generator, courtesy of ApacheDataFusion 's community. Scale Factor 100 in under 2 minutes on Macbook air. Open Source, no dependency Rust. Thanks to CMU Database Group and Wan Shen Lim (@wslim.bsky.social) for the inspiration datafusion.apache.org/blog/2025/04/1… youtube.com/watch?v=UYIC57…

Worlds Fastest TPCH Data Generator, courtesy of <a href="/ApacheDataFusio/">ApacheDataFusion</a> 's community.

Scale Factor 100 in under 2 minutes on Macbook air.

Open Source, no dependency Rust.

Thanks to <a href="/CMUDB/">CMU Database Group</a> and <a href="/lmwnshn/">Wan Shen Lim (@wslim.bsky.social)</a> for the inspiration

datafusion.apache.org/blog/2025/04/1…
youtube.com/watch?v=UYIC57…

thumb_up_off_alt94

chat_bubble_outline3

repeat12

shareShare

Bauplan

@bauplan_labs

7 months ago

🚀 Introducing Bauplan A serverless, code-native platform for building data and AI pipelines — directly on your object store. No clusters. No notebooks. No GUI based workflows. Just Python + SQL + S3. 👉 bauplanlabs.com/blog/hello-bau…

thumb_up_off_alt79

chat_bubble_outline6

repeat24

shareShare

Andrew Lamb

@andrewlamb1111

7 months ago

20x faster TPCH data generator availably via pip install: pip install tpchgen-cli Blog from Kevin Liu: kevinjqliu.github.io/blog/posts/tpc…

thumb_up_off_alt26

chat_bubble_outline0

repeat6

shareShare

CedarDB

@cedar_db

6 months ago

CedarDB Community Edition is here! Download CedarDB Community Edition today - no paywall, no signup, just pure performance. Read more about our CedarDB on our blog: cedardb.com/blog/launch/

thumb_up_off_alt47

chat_bubble_outline0

repeat10

shareShare

Andrew Lamb

@andrewlamb1111

6 months ago

😍 > To the ApacheDataFusion Community: The intermediate representation of the SQL compiler is the DataFusion logical plan which has proven to be pragmatic, extensible, and easy to work with in all the right ways. github.com/dbt-labs/dbt-f…

thumb_up_off_alt75

chat_bubble_outline0

repeat10

shareShare

Wes McKinney

@wesmckinn

6 months ago

With last week's DuckLake announcement and prior explorations of a DuckDB-powered data lake such as "DuckHouse" (Flight + DuckDB using xorq), we are heading in some interesting directions: juhache.substack.com/p/from-duckdb-…

thumb_up_off_alt60

chat_bubble_outline1

repeat7

shareShare

Andrew Lamb

@andrewlamb1111

5 months ago

Project from someone at Apple about building an distributed in memory cache using ApacheDataFusion LinkedIn: linkedin.com/posts/andrey-v… Design: docs.google.com/document/d/1xj…

Project from someone at Apple about building an distributed in memory cache using <a href="/ApacheDataFusio/">ApacheDataFusion</a>

LinkedIn: linkedin.com/posts/andrey-v…

Design: docs.google.com/document/d/1xj…

thumb_up_off_alt128

chat_bubble_outline2

repeat17

shareShare

Anthony Goldbloom

@antgoldbloom

5 months ago

Nice bumping into two data science legends (Hadley Wickham and Wes McKinney) at the Databricks conference on the 18th birthday of ggplot2

Nice bumping into two data science legends (<a href="/hadleywickham/">Hadley Wickham</a> and <a href="/wesmckinn/">Wes McKinney</a>) at the Databricks conference on the 18th birthday of ggplot2

thumb_up_off_alt59

chat_bubble_outline1

repeat4

shareShare