Replacing a Fox with a Duck... Beware that's a killer be

Level Extreme platform

Subscription

Corporate profile

Products & Services

Support

Legal

Français

Replacing a Fox with a Duck... Beware that's a killer be

Message

From

05/04/2023 05:40:29

Daniel Gadenne
France

02/04/2023 16:24:42

John Ryan
Captain-Cooker Appreciation Society
Taumata Whakatangi ..., New Zealand

General information

Forum:

Python

Category:

Other

Title:

Re: Replacing a Fox with a Duck... Beware that's a killer be

Miscellaneous

Thread ID:

01686381

Message ID:

01686451

Views:

Hi John,

Thanks for following on this non-vfp thread. Yes I am transitioning out to a new data-mangling environment after so many years, starting fox 1985 or 1984 if I remember correcty. That's long ago. Worth taking the time to discuss the matter a bit!

>Right... since with 600 million rows, if there's (say) 200 bytes per row then presumably (unless my math is way off!) you'll need around 110GB Memory

I did check the benchmarking figures provided by the duckdb. I tend to trust them based on what they promised and delivered over their first two years of existence. I started to look at them before they became a tech star on github and elsewhere. You can trust them both as persons and as a organization (not really funded by the way!).

Back to tests, I just grasped that their base machine is a formidable apple machine with brilliant memory access. You certainly understand why! Both the amount of memory at hand and the speed of access will matter here.

You may be right on this 0.6 sec time. There may be some errors here. they may be some sort of "lazy loading" here. But as a whole, you run a test and I expect you will be blown away in terms of "the time it takes to load csv (or better parquet) files from scratch and sql-mangle them!

One thing is for sure, the "big data" tech is now accommodating "zero-copy integration". This happens here between "DuckDB and "Apache Arrow". And this is an impressive tech that is starting to power AI, genomic research, data science and more. Interesting times as the say:-)

An already ancient introduction to "zero copy":
https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a

The "zero-copy" integration between this modern days' fox (duckdb) and arrow (I'll use it in the coming years);
https://duckdb.org/2021/12/03/duck-arrow.html

Numpy, numba, duckdb, arrow.. I have learnt quite a few things dip my toe in the recent big data water à la python recently :-)

Daniel

Map

View

Click here to load this message in the networking platform