Replacing rushmore in 2020? The duck option:-)

Plateforme Level Extreme

Abonnement

Profil corporatif

Produits & Services

Support

Légal

English

Replacing rushmore in 2020? The duck option:-)

Message

20/02/2020 17:45:23

John Fabiani
J. F. Computer Solutions
Californie, États-Unis

19/02/2020 10:46:08

Daniel Gadenne
France

Information générale

Forum:

Visual FoxPro

Catégorie:

Autre

Titre:

Re: Replacing rushmore in 2020? The duck option:-)

Divers

Thread ID:

01673095

Message ID:

01673140

Vues:

Hi John,

You will find my first quick test below. Bear with my python which is rusty since I stopped coding python in 2002 and jump back on board very recently with this view to test duckdb. I just loaded a couple of csv-s into an in-memory db, much faster than disk-based on my current tests machine (no fast ssd for sure).

I bumped on a couple of bugs in duckdb at this stage. The call for an in-memory database is supposed to be aligned with sqlite with the following instruction for building in-memory: "import duckdbé and then "con = duckdb.connect(':memory:')". But the in-memory wording failed me. But duckdb.connect('') worked with the same result ie in memory work with some file caching when required. I am currently looking for an equivalent to CREATE CURSOR i-e in memory VFP and and then SQL based data-mangling. All this with warp speed:-)

This is a very green alpha project of course. Still a few other quirks with the sql parser (that they took from postgresql with its specific grammar) as well. But I felt comforted by my first tests and, more important, the structure and no-nonsense target of the project itself, sheer speed on large-large datasets with sql.

Daniel

The rough testing code on python is below. You will need an install of duckb and I had to install a C++. I suppose that relates the fact the project in alpha since they are looking for a no-frill, zero-dependancy install. It looks like numpy may be required as well. Not sure though:

import time,duckdb
t0 = time.time()
con = duckdb.connect('')
c = con.cursor()
c.execute("CREATE TABLE big_data(person_id VARCHAR, base DECIMAL(10,2),\
          grade varchar,gender varchar,age numeric(6,2),\
          job_family varchar,M1 numeric,M2 numeric,M3 numeric,\
          grade1 varchar,gender1 varchar,age1 smallint,\
          function varchar,M4 numeric,M5 numeric)")
print('build in memory structure {0:.4f}'.format(time.time() - t0))

started = time.time()
data_loads = 1
for x in range(data_loads):
        c.execute("COPY big_data from 'c:\\test\data100k.txt'")
print('load 100k lines 10 MB csv - {}x: {} sec.'.format(data_loads,time.time() - started))

started = time.time()
c.execute("CREATE INDEX person_id ON big_data (person_id)")
print('build btree {0:.4f}'.format(time.time() - started))

started = time.time()
for x in range(10):
        output = c.execute("select count(*) from big_data").fetchall()
print('select count(*) 10x avg:{0:.4f} sec.'.format((time.time() - started)/10))

started = time.time()
for x in range(10):
        output = c.execute("select count(*) from big_data where job_family='Manufacturing'").fetchall()
print('select count(*) arbitrary textfield selection 10x avg:{0:.4f} sec.'.format((time.time() - started)/10))

started = time.time()
for x in range(10):
	output = c.execute("select * from big_data").fetchall()
print('select * from big_data in basic python 10x avg:{0:.4f} sec.'.format((time.time() - started)/10))

started= time.time()
for x in range(10):
	output = c.execute("select * from big_data").fetchnumpy()
print('select * from big_data in numpy array 10x avg:{0:.4f} sec.'.format((time.time() - started)/10))

started = time.time()
for x in range(10):
        output = c.execute("select * from big_data").fetchdf()
print('select * from big_data into pandas frame 10x avg:{0:.4f} sec.'.format((time.time() - started)/10))

started = time.time()
for x in range(10):
	output = c.execute("select base from big_data").fetchnumpy()
print('select single numeric from big_data in numpy array 10x avg:{0:.4f} sec.'.format((time.time() - started)/10))

started = time.time()
for x in range(10):
	output = c.execute("select base from big_data order by 1").fetchnumpy()
print('select single numeric from big_data ordered in numpy array 10x avg:{0:.4f} sec.'.format((time.time() - started)/10))

started = time.time()
for x in range(10):
	output = c.execute("select function from big_data").fetchnumpy()
print('select function from big_data in numpy array 10x avg:{0:.4f} sec.'.format((time.time() - started)/10))

started = time.time()
for x in range(10):
	output = c.execute("select function,base from big_data").fetchnumpy()
print('select text and numeric from big_data in numpy array 10x avg:{0:.4f} sec.'.format((time.time() - started)/10))

started = time.time()
for x in range(10):
	output = c.execute("select M1,M2 from big_data").fetchnumpy()
print('select two numerical fields from big_data in numpy array 10x avg:{0:.4f} sec.'.format((time.time() - started)/10))

started = time.time()
for x in range(10):
        output = c.execute("select * from big_data order by 2 limit 10000").fetchnumpy()
print('select top 10,000 * from big_data order by numeric no-index in numpy array avg:{0:.4f} sec.'.format((time.time() - started)/10))

started = time.time()
for x in range(10):
        output = c.execute("select * from big_data order by 2 limit 10000").fetchdf()
print('select top 10,000 * from big_data order by numeric no-index in pandas frame avg:{0:.4f} sec.'.format((time.time() - started)/10))

started = time.time()
output = c.execute("select job_family,count(*) as cnt from big_data group by job_family order by 2 desc").fetchnumpy()
print('select text_field,count(*) as cnt from big_data group by text_field order by 2 desc in numpy array {0:.4f} sec.'.format(time.time() - started))

started = time.time()
output = c.execute("select job_family,count(*) as cnt from big_data group by job_family order by 2 desc").fetchdf()
print('select text_field,count(*) as cnt from big_data group by text_field order by 2 desc in pandas frame {0:.4f} sec.'.format(time.time() - started))

started = time.time()
output = c.execute("select base,person_id from big_data where job_family='Manufacturing' order by 1").fetchnumpy()
print('select numeric_field from big_data arbitrary textfield selection {0:.4f} sec.'.format(time.time() - started))

started = time.time()
idnump = c.execute("select person_id from big_data limit 100000").fetchnumpy()
for iRecordId in range(100000):
	iCurrentId = idnump["person_id"][iRecordId]
	output = c.execute("select person_id,function from big_data where person_id='{}'".format(iCurrentId)).fetchall()
print('100-K short SQL index-filtered calls avg: time:{0:.4f} sec.'.format((time.time() - started)))


What platform are you using, what version of python 3????

Johnf

Répondre

Fil

Voir

Click here to load this message in the networking platform