Hi John,
Thanks for following on this non-vfp thread. Yes I am transitioning out to a new data-mangling environment after so many years, starting fox 1985 or 1984 if I remember correcty. That's long ago. Worth taking the time to discuss the matter a bit!
>Right... since with 600 million rows, if there's (say) 200 bytes per row then presumably (unless my math is way off!) you'll need around 110GB Memory
I did check the benchmarking figures provided by the duckdb. I tend to trust them based on what they promised and delivered over their first two years of existence. I started to look at them before they became a tech star on github and elsewhere. You can trust them both as persons and as a organization (not really funded by the way!).
Back to tests, I just grasped that their base machine is a formidable apple machine with brilliant memory access. You certainly understand why! Both the amount of memory at hand and the speed of access will matter here.
You may be right on this 0.6 sec time. There may be some errors here. they may be some sort of "lazy loading" here. But as a whole, you run a test and I expect you will be blown away in terms of "the time it takes to load csv (or better parquet) files from scratch and sql-mangle them!
One thing is for sure, the "big data" tech is now accommodating "zero-copy integration". This happens here between "DuckDB and "Apache Arrow". And this is an impressive tech that is starting to power AI, genomic research, data science and more. Interesting times as the say:-)
An already ancient introduction to "zero copy":
https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1aThe "zero-copy" integration between this modern days' fox (duckdb) and arrow (I'll use it in the coming years);
https://duckdb.org/2021/12/03/duck-arrow.htmlNumpy, numba, duckdb, arrow.. I have learnt quite a few things dip my toe in the recent big data water à la python recently :-)
Daniel