Performance regression when reading scoring files during matching #20

nebfield · 2024-05-29T13:16:39Z

pygscatalog/pgscatalog.match/src/pgscatalog/match/lib/_arrow.py

Lines 6 to 21 in e88b41f

    
           def loose(record_batches, schema, tmpdir=None): 
        
               """Loose an arrow :) Stream compressed text files into temporary arrow files 
        
               polars + arrow = very fast reading and processing 
        
               """ 
        
               if tmpdir is None: 
        
                   tmpdir = tempfile.mkdtemp() 
        
               arrowpath = tempfile.NamedTemporaryFile(dir=tmpdir, delete=False) 
        
               with pa.OSFile(arrowpath.name, "wb") as sink: 
        
                   with pa.RecordBatchFileWriter(sink=sink, schema=schema) as writer: 
        
                       for batch in record_batches: 
        
                           writer.write(batch) 
        
               return arrowpath

We used to parse CSV files with polars and save IPC files, it's super fast 🚀

Streaming pyarrow batches is terribly slow in comparison (when working on UK Biobank). I think i was worried about memory usage when I wrote this.

This might also drop the pyarrow dependency in pgscatalog.core

The text was updated successfully, but these errors were encountered:

nebfield added the bug Something isn't working label May 29, 2024

nebfield linked a pull request Jun 2, 2024 that will close this issue

Fix pgscatalog.match performance regression #22

Merged

smlmbrt closed this as completed Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression when reading scoring files during matching #20

Performance regression when reading scoring files during matching #20

nebfield commented May 29, 2024

Performance regression when reading scoring files during matching #20

Performance regression when reading scoring files during matching #20

Comments

nebfield commented May 29, 2024