Level Extreme platform
Subscription
Corporate profile
Products & Services
Support
Legal
Français
Replacing a Fox with a Duck... Beware that's a killer be
Message
From
30/03/2023 20:06:51
 
 
To
30/03/2023 06:45:29
General information
Forum:
Python
Category:
Other
Miscellaneous
Thread ID:
01686381
Message ID:
01686427
Views:
53
>Why is it this fast, technically:
>1- the database format is "columnar", a paradigm shift in terms of "olap",
>https://en.wikipedia.org/wiki/Column-oriented_DBMS
>2- the engine does parallelize massively on any recent CPUs in a very effective way (only on large amounts of data of course not a couple of records...); that applies to a lot of operations including import (csv-s, parquet files), and arbitrary data mangling stuff,

Hmmm, SSD bandwidth exploding over HD first on SATA, then via PCI-2, PCI-3, PCI-4, PCI-5
probably is needed to feed multiple Threads / Cores. Old HD needing to position before buffered read
might have less stellar perf on CSV read

>https://duckdb.org/2022/03/07/aggregate-hashtable.html

Blast of the past! Dunno if you remember, but about a dozen years ago ther was a DB competition from a magazine. One user decided to enter vfp code, reached 2nd place.
Biggest hurdle was to implement fast solution for a distinct count of given table.
I implemented a few versions with CRC-sys and a few hash functions in C.
Etecnologica back then tried to build a vfp clone and asked me if they could "borrow" the idea ;-))
DuckDB going into detail on hash collision for me is clear proof of their smarts.

>3- the code base is recent, open source and the dev team brilliant, modest and open to external inputs, both in terms of specs (I can testify!) and C++ code base,
>https://github.com/duckdb/duckdb/pull/5194

To much on the page - what to look for ?
>
>why is this fast, marketing-wise (why they launched the project):
>1-the duckdb team has discovered that the current "big data" "data scientist" community has shunted sql engines, for some reasons; most of sql engines are server-based, page-record-based beast that certainly not olap-centric,

UUUUhm... HANA for me was the first real prod over to column based against. Runs with SQL.
Is not cheap, is not open source, dunno how hard to get university/student license.

Postgres has option for column store enhancement

SQL Server has "columnstore" vs. "rowstore" and "columnstore index" with compression (and probably hashes).
Perhaps Kevin can Rush in, as he worked/taught in BI on SQL server.

Oracle I only wish the POX on but probably has columnstores as well...

I am certain you can get more than halfway up to speed with these tools
- my guess is the Python / R integration luring the data scientists..

>2- they feel that an equivalent of "sqlite" could bring sql flexibility into this loop (big data, data science, genomics, scientific gis...) when it comes to handle gigabytes in a comfortable way.

Yupp, the "local engine" in process feel probably does a lot

>In short, they thought of some sort of 2023 rushmore engine :-)
>
>Duckdb cannot provide every bit one may need in terms of data mangling. I discovered that record-based operations are certainly not as fast on a columnar db engine as what you can reach on more classic page-and-record or plain btrees engines à la dbase....

See my response OLTP / OLAP and "Ludicrous Speed" in GUI based "human" changes ;-)

>You also can feed the beast from "big data" sources fast s using impressive "zero-copy" technologies. "zero-copy"? A tech I discovered during my short introductory voyage into this new IT eco-system. The way duckdb uses it to tackle apache arrow datasets (parquet files and others):
>
>https://duckdb.org/2021/12/03/duck-arrow.html

Yeah, Read only via Streams can give speedups as Java streams taught me
>
>You can laso push the output straight into to-day's fast array technologies such as python numpy, much in the way I currently push vfp cursors into vfp cursors.
>Numpy python array technology that can certainly overcome our beloved vfp array performance as well by a large, large margin.

Now that is because vfp arrays are just a block of memory with traditional vfp memvars, needing a calculated offset with dimensionality and if accessing a string pointing somewhere random into memory.
Compare that to Pascal/Modula ARRAY of CHAR[20] or a C pointer into fixed string sizes...

Even fixed size vars like numbers have to check the dynamic type flag on every access.
NumPy is just a block of memory read as double and accessed by pointer arithmetics. No contest...

>Making my workstation cpu as hot as gaming stuff... That's something I was never able to do in VFP where all my stuff runs single-threaded. As a pythonista, I can run the "numba" llvm engine to run fully parallelized operations on those array using all workstation cores with zero C code, zero understanding of openMP. I'm no coding genius just a vfp-er. But I try to treat myself :-)

Yeah, vfp was born to single CPU and single thread parents ;-)

regards
thomas
Previous
Next
Reply
Map
View

Click here to load this message in the networking platform