Apache Parquet (@apacheparquet) 's Twitter Profile
Apache Parquet

@apacheparquet

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
It provides high performance compression

ID: 1342646282

linkhttps://parquet.apache.org calendar_today10-04-2013 19:07:49

364 Tweet

8,8K Followers

26 Following

Julien Le Dem (@j_) 's Twitter Profile Photo

I tweeted this ten years ago today. At the time I didn’t quite realize how much impact this little side project would have. To ten years of Parquet! Thanks to all the people who came along for the ride.

Julien Le Dem (@j_) 's Twitter Profile Photo

It’s happened! The Apache Parquet Java implementation repo I now called parquet-java. Thank you Andrew Lamb for the nudge! This further clarifies that Parquet is used far beyond the Hadoop ecosystem. Maybe whoever created this repo could have thought of this to start with.

Andrew Lamb (@andrewlamb1111) 's Twitter Profile Photo

Turns out Apache Parquet Bloom filters are better than I think many people understand. Trevor Hilton found that for a cost of 2K-8K per row group on high cardinality predicate columns, you can filter all but the exact row group of interest.

Shubham Chaudhary (@ylogx) 's Twitter Profile Photo

Working with a 10Gig csv data. Pandas read_csv took 16mins to load the csv into memory. Converted to Apache Parquet with ApacheArrow. It took 30 secs to read into pyarrow table and 16 sec to convert to pandas dataframe. 16mins => 46sec! tech.blue-yonder.com/efficient-data…

Working with a 10Gig csv data. Pandas read_csv took 16mins to load the csv into memory. Converted to <a href="/ApacheParquet/">Apache Parquet</a> with <a href="/ApacheArrow/">ApacheArrow</a>. It took 30 secs to read into pyarrow table and 16 sec to convert to pandas dataframe. 

16mins =&gt; 46sec!

tech.blue-yonder.com/efficient-data…
Mustafa Akın (@mustafaakin) 's Twitter Profile Photo

You do not need Spark to create Apache Parquet files, you can use plain Java and it can even fit in AWS Lambda for a serverless solution: engineering.opsgenie.com/analyzing-aws-…

Julien Le Dem (@j_) 's Twitter Profile Photo

If you’re a company using open source projects and not sure how to contribute, a release engineer would be a tremendous help. It’s hard to do this properly part time. I have a specific project in mind, if you need a hint.

lucien fregosi (@lulufrego) 's Twitter Profile Photo

Great benchmark between Apache Parquet on #hdfs and Apache Kudu blog.clairvoyantsoft.com/guide-to-using… In short kudu is faster than Parquet for random access Querys like CRUD operations but slower for analytics queries.