A data loader that produces a parquet file

Hello,
Firstly, Observable Framework is really nice, so many things included from the start and aesthetically lovely. thanks

I have a question about data loaders. I want to use DuckDBClient on a parquet file e.g.

DuckDBClient.of({afm: FileAttachment("./data/poem_data.parquet")}

but the parquet file is created by poem_data.parquet.py.

The only way I can get OF to read this is to run the poem_data.parquet.py file beforehand producing the parquet file on disk which is then read successfully by the above FileAttachment.

How would I get OF to trigger the file creation?

thanks again!
Saptarshi

Hi, poem_data.parquet.py must not create a file on disk, but instead must send the parquet payload to stdout. (It can create a temporary file on disk then “cat” it, but what Framework reads is the stdout.)

I’m afraid I don’t have an example with python, but see Source code | FPDN which does this is bash with duckdb.

Also, is poem_data.parquet.py in the data directory (in the same place that you expect the poem_data.parquet file to exist, but with the additional .py extension)?

Also, if you generated a poem_data.parquet file manually, note that this will take precedence over the adjacent data loader poem_data.parquet.py. You should delete or move the static file you created if you want the data loader to run on-demand.

Yes, all correct. I was thinking that I would have to write the parquet file to stdout(binary) and you confirmed it.
thanks again

1 Like

I came across this thread trying to solve the same problem. Here is a snippet that worked for me. This assumes we have the data in a pandas data frame df

import sys
import tempfile

#### Some query or something that generates df as a pandas data frame
with tempfile.TemporaryFile() as f:
    df.to_parquet(f)
    f.seek(0)
    sys.stdout.buffer.write(f.read())