Replacing a Fox with a Duck... Beware that's a killer be

Level Extreme platform

Subscription

Corporate profile

Products & Services

Support

Legal

Français

Replacing a Fox with a Duck... Beware that's a killer be

Message

From

30/03/2023 06:45:29

Daniel Gadenne
France

29/03/2023 22:58:21

John Ryan
Captain-Cooker Appreciation Society
Taumata Whakatangi ..., New Zealand

General information

Forum:

Python

Category:

Other

Title:

Re: Replacing a Fox with a Duck... Beware that's a killer be

Miscellaneous

Thread ID:

01686381

Message ID:

01686423

Views:

Hi John,

>It is definitely loading all 600m lines, or could there be a lazy load like VFP's that pulls chunks of data as required?

No, we are indeed talking immediate a full immediate load into some sort of "rushmore 2023 tech", a local database that you will be able to explore sql-wise. The data store can be handled as a set of "in-memory" cursors (that's how I currently use it) or as a potentially long-term persistent disk-based db file.

The compelling advantage: we are talking a different "order of magnitude" speed as compared to most alternatives, including server-based ones. This speed advantage is of course fully dependent on the memory your database will be allowed to access...

Every bit of your machine configuration (cpu+caching+ram+ssd+drive) will certainly have an impact. But I ran my own set of tests on the legacy hardware I still use here. And found the results still mind-blowing. Of course not the kind of speed you get on recent best of breed (looks like apple hardware is currently on top), still more than impressive, just astounding!

As I said, "order of magnitude" change. You believe you're back in the late 80s-early 90s when the hardware was changing gear so fast. John, you certainly remember as I do :-)

Why is it this fast, technically:
1- the database format is "columnar", a paradigm shift in terms of "olap",
https://en.wikipedia.org/wiki/Column-oriented_DBMS
2- the engine does parallelize massively on any recent CPUs in a very effective way (only on large amounts of data of course not a couple of records...); that applies to a lot of operations including import (csv-s, parquet files), and arbitrary data mangling stuff,
https://duckdb.org/2022/03/07/aggregate-hashtable.html
3- the code base is recent, open source and the dev team brilliant, modest and open to external inputs, both in terms of specs (I can testify!) and C++ code base,
https://github.com/duckdb/duckdb/pull/5194

why is this fast, marketing-wise (why they launched the project):
1-the duckdb team has discovered that the current "big data" "data scientist" community has shunted sql engines, for some reasons; most of sql engines are server-based, page-record-based beast that certainly not olap-centric,
2- they feel that an equivalent of "sqlite" could bring sql flexibility into this loop (big data, data science, genomics, scientific gis...) when it comes to handle gigabytes in a comfortable way.

In short, they thought of some sort of 2023 rushmore engine :-)

Duckdb cannot provide every bit one may need in terms of data mangling. I discovered that record-based operations are certainly not as fast on a columnar db engine as what you can reach on more classic page-and-record or plain btrees engines à la dbase....

You also can feed the beast from "big data" sources fast s using impressive "zero-copy" technologies. "zero-copy"? A tech I discovered during my short introductory voyage into this new IT eco-system. The way duckdb uses it to tackle apache arrow datasets (parquet files and others):

https://duckdb.org/2021/12/03/duck-arrow.html

You can laso push the output straight into to-day's fast array technologies such as python numpy, much in the way I currently push vfp cursors into vfp cursors.

Numpy python array technology that can certainly overcome our beloved vfp array performance as well by a large, large margin.

Making my workstation cpu as hot as gaming stuff... That's something I was never able to do in VFP where all my stuff runs single-threaded. As a pythonista, I can run the "numba" llvm engine to run fully parallelized operations on those array using all workstation cores with zero C code, zero understanding of openMP. I'm no coding genius just a vfp-er. But I try to treat myself :-)

Daniel

Map

View

Click here to load this message in the networking platform