Level Extreme platform
Subscription
Corporate profile
Products & Services
Support
Legal
Français
How to parse 4 GB json or XML file
Message
From
11/11/2022 04:54:54
 
General information
Forum:
Visual FoxPro
Category:
Coding, syntax & commands
Miscellaneous
Thread ID:
01685297
Message ID:
01685302
Views:
63
In addition to Martinas hints:

Unless your task needs to run as fast as possible,
I'd leverage the existing structure of a JSON or XML file
if I can depend on it being well formed.
as this seems to be necessary for your code


For the VFPA solution at least:
      rida = Alltrim(FGETS(fp))
        do case
           case at('"ariregistri_kood":', Rida)=1
.....
as you can stay in your used vfp thinking.
Then add check that no complex string is in line to be added (partially in some cases).

As this probably does NOT eliminate all the possible edge cases
I'd probably do a sanity intermediate step: to enhance security of coded solution
filter out all properties you do not want to persist, but keep well formed structure.

Create a new text file writing out anything sporting your 3 tags or
a line defining a structure element like object start/end, array start/end...
using exclusive filter: drop any 1 property line NOT being 1 of your targets
or having structure elements.

From looking at the example data this should compress the import file well under the 2 GB,
with enough room for future source file increases.

You can either use vfp with external large file functions(WINAPI)/methods(FSO)
or any other giving sporting such file access natively,
Benefit: in case of future trouble problematic location is easier to eyeball/identify
if you then use a parser checking well formed structure of imported file.

Or at last add rudimentary checking for each line having 0..1 properties.

my 0.22 € from parsing files created by others
thomas

> Are two ways:
>- VFPA 10.1 can works with files greather 2GiB
>- use API: CreateFile(), ReadFile(), CloseHandle()
>
>MartinaJ
>
>>3 GB json file contains array of objects in form
>>
>>
    [
>>        {
>>            "ariregistri_kood":16372442,
>>            "nimi":"000 Holdings OÜ",
>>            "yldandmed":{
>>                "ettevotteregistri_nr":null,
>>                "esmaregistreerimise_kpv":"23.11.2021",
>>                "kustutamise_kpv":null,
>>                "staatus":"R",
>>                "staatus_tekstina":"Registrisse kantud",
>>                "piirkond":5,
>>                "piirkond_tekstina":"Tartu",
>>                "piirkond_tekstina_pikk":"Tartu Maakohtu registriosakond",
>>                "evks_registreeritud":null,
>>                "evks_registreeritud_kande_kpv":null,
>>                "oiguslik_vorm":"OÜ",
>>                "oiguslik_vorm_nr":5,
>>                "oiguslik_vorm_tekstina":"Osaühing",
>>                "oigusliku_vormi_alaliik":null,
>>                "oigusliku_vormi_alaliik_tekstina":"",
>>                "asutatud_sissemakset_tegemata":true,
>>                "loobunud_vorminouetest":null,
>>                "on_raamatupidamiskohustuslane":false,
>>                "tegutseb":null,
>>                "tegutseb_tekstina":"Jah",
>>                "staatused":[
>>                    {
>>                        "kaardi_piirkond":5,
>>                        "kaardi_nr":1,
>>                        "kaardi_tyyp":"R",
>>                        "kande_nr":1,
>>                        "staatus":"R",
>>                        "staatus_tekstina":"Registrisse kantud",
>>                        "algus_kpv":"23.11.2021"
>>                    }
>>                ],
>>                "arinimed":[
>>                    {
>>                        "kirje_id":9864760,
>>                        "kaardi_piirkond":5,
>>                        "kaardi_nr":1,
>>                        "kaardi_tyyp":"R",
>>                        "kande_nr":1,
>>                        "sisu":"000 Holdings OÜ",
>>                        "algus_kpv":"23.11.2021",
>>                        "lopp_kpv":null
>>                    }
>>                ],
>>                "juhatuse_asukoha_aadressid":[
>>                    {
>>                        "kirje_id":9864751,
>>                        "kaardi_piirkond":5,
>>                        "kaardi_nr":1,
>>                        "kaardi_tyyp":"R",
>>                        "kande_nr":1,
>>                        "riik":"AUS",
>>                        "riik_tekstina":"Austraalia",
>>                        "tanav_maja_korter":"313A/133 GOULBURN STREET, Surry Hills, NSW",
>>                        "aadress_ads__tyyp":"2",
>>                        "postiindeks":"2010",
>>                        "algu_kpvs":"23.11.2021",
>>                        "lopp_kpv":null
>>                    }
>>                ],
>>                "kontaktisiku_aadressid":[
>>                    {
>>                        "kirje_id":9864757,
>>                        "kaardi_piirkond":5,
>>                        "kaardi_nr":1,
>>                        "kaardi_tyyp":"R",
>>                        "kande_nr":1,
>>                        "riik":"EST",
>>                        "riik_tekstina":"Eesti",
>>                        "ehak":"0298",
>>                        "ehak_nimetus":"Kesklinna linnaosa, Tallinn, Harju maakond",
>>                        "tanav_maja_korter":"Harju maakond, Tallinn, Kesklinna linnaosa, Ahtri tn 12",
>>                        "aadress_ads__ads_oid":"ME00656588",
>>                        "aadress_ads__adr_id":2113048,
>>                        "aadress_ads__ads_normaliseeritud_taisaadress":"Harju maakond, Tallinn, Kesklinna linnaosa, Ahtri tn 12",
>>                        "aadress_ads__ads_normaliseeritud_taisaadress_tapsustus":null,
>>                        "aadress_ads__koodaadress":"377840298000005X600001EOT00000000",
>>                        "aadress_ads__adob_id":null,
>>                        "aadress_ads__tyyp":null,
>>                        "postiindeks":"10151",
>>                        "algus_kpv":"23.11.2021",
>>                        "lopp_kpv":null
>>                    }
>>                ],
>>                "oiguslikud_vormid":[
>>                    {
>>                        "kirje_id":9864754,
>>                        "kaardi_piirkond":5,
>>                        "kaardi_nr":1,
>>                        "kaardi_tyyp":"R",
>>                        "kande_nr":1,
>>                        "sisu":"OÜ",
>>                        "sisu_nr":5,
>>                        "sisu_tekstina":"Osaühing",
>>                        "algus_kpv":"23.11.2021",
>>                        "lopp_kpv":null
>>                    }
>>                ],
>>                "kapitalid":[
>>                    {
>>                        "kirje_id":9864758,
>>                        "kaardi_piirkond":5,
>>                        "kaardi_nr":1,
>>                        "kaardi_tyyp":"R",
>>                        "kande_nr":1,
>>                        "kapitali_suurus":"2500.00",
>>                        "kapitali_valuuta":"EUR",
>>                        "kapitali_valuuta_tekstina":"euro",
>>                        "algus_kpv":"23.11.2021",
>>                        "lopp_kpv":null
>>                    }
>>                ],
>>                "majandusaastad":[
>>                    {
>>                        "kirje_id":9864753,
>>                        "kaardi_piirkond":5,
>>                        "kaardi_nr":1,
>>                        "kaardi_tyyp":"R",
>>                        "kande_nr":1,
>>                        "maj_aasta_algus":"01.01",
>>                        "maj_aasta_lopp":"31.12",
>>                        "algus_kpv":"23.11.2021",
>>                        "lopp_kpv":null
>>                    }
>>                ],
>>                "pohikirjad":[
>>                    {
>>                        "kirje_id":9864752,
>>                        "kaardi_piirkond":5,
>>                        "kaardi_nr":1,
>>                        "kaardi_tyyp":"R",
>>                        "kande_nr":1,
>>                        "kinnitamise_kpv":"19.11.2021",
>>                        "muutmise_kpv":null,
>>                        "selgitus":null,
>>                        "algus_kpv":"23.11.2021",
>>                        "lopp_kpv":null,
>>                        "sisaldab_erioigusi":false
>>                    }
>>                ],
>>                "sidevahendid":[
>>                    {
>>                        "kirje_id":9864759,
>>                        "liik":"EMAIL",
>>                        "liik_tekstina":"Elektronposti aadress",
>>                        "sisu":"me@karlssonandreas.com",
>>                        "lopp_kpv":null,
>>                        "kaardi_piirkond":5,
>>                        "kaardi_nr":1,
>>                        "kaardi_tyyp":"R",
>>                        "kande_nr":1
>>                    }
>>                ],
>>                "teatatud_tegevusalad":[
>>                    {
>>                        "kirje_id":9000591556,
>>                        "emtak_kood":"73111",
>>                        "emtak_tekstina":"Reklaamiagentuurid",
>>                        "emtak_versioon":2,
>>                        "emtak_versioon_tekstina":null,
>>                        "nace_kood":"73.11",
>>                        "on_pohitegevusala":true,
>>                        "algus_kpv":"22.11.2021",
>>                        "lopp_kpv":null
>>                    }
>>                ],
>>                "esitab_kasusaajad":true
>>            }
>>        },
>>        {
>>            "ariregistri_kood":12754230,
>>            "nimi":"001 group OÜ",
>>            "yldandmed":{
>>                "ettevotteregistri_nr":null,
>>                "esmaregistreerimise_kpv":"17.11.2014",
>>                "kustutamise_kpv":null,
>>                "staatus":"R",
>>                "staatus_tekstina":"Registrisse kantud",
>>    ...
>>
>>Every property is in separate line. Cursor containing 3 columns:
>>
>>    ariregistri_kood I
>>    emtak_kood C(5),
>>    emtak_tekstina M
>>
>>should created from this file.
>>
>>I tried to parse it using
>>
>>
create cursor tala ( ariregistri_kood I, emtak_kood V(5), emtak_tekstina M )
>>    fp  = fopen( '4gbfile.json' )
>>    if fp<0
>>      messagebox('error')
>>      return
>>      endif
>>      
>>    DO WHILE !FEOF(fp)
>>      rida = FGETS(fp)
>>      do case
>>        case  '"ariregistri_kood":' $ Rida
>>          insert into tala (regnr) values ( stre( rida, '"ariregistri_kood":' ) )
>>    
>>    	* "emtak_kood":"73111",
>>        case  '"emtak_kood":' $ Rida
>>          repl emtak_kood with stre( rida, '"emtak_kood":"', '"' )
>>          
>>    	* "emtak_tekstina":"Reklaamiagentuurid",
>>        case  '"emtak_tekstina":' $ Rida
>>          repl emtak_tekstina with stre( rida, '"emtak_tekstina":"', '"' )
>>    endcase
>>      enddo
>>
>>
>>But got empty cursor since feof() returns immediately true for big file.
>>Tried also nfjson parser from https://github.com/VFPX/nfJson
>>but got out of memory error.
>>
>>How to parse this file to get 3 properties?
>>
>>Same data is also avaliable as xml file. If paring xml is more reasonable, xml can also parsed.
>>
>>Posted also in https://stackoverflow.com/questions/74396070/how-to-parse-3-gb-json-file
Previous
Reply
Map
View

Click here to load this message in the networking platform