DuckDB: load multiple remote parquet files dinamically

Hi! Pretty new to Observable and JavaScript!
I’m trying to load multiple parquet files hosted on Github dinamically in a single table in a DuckDB instance. I’m following DuckDB’s documentation, using list parameter:

-- use list parameter to read 3 parquet files and treat them as a single table
SELECT * FROM read_parquet(['file1.parquet', 'file2.parquet', 'file3.parquet']);
dataFiles = array of urls
dataList = JSON.stringify(dataFiles)
db = DuckDBClient.of()
db.query(`CREATE TABLE data AS SELECT * FROM read_parquet(${dataList});`)

I’m getting the following error: “Error: Binder Error: TABLE FUNCTION parameter cannot contain column names”

If I remove the read_parquet command and keep only a single url it works. using read_parquet even with a single url gets me the same error.

How should I proceed to load multiple files in a single table?

Have a look at this example for a wrapper that you can import into your notebook:

However, this example might be closer to what you’re looking for:

1 Like

Thank you! The second link solved it. And I had to change the double quotes to single quotes inside the SQL statement.

Hi @jazzyjones, I am having a similar issue with loading multiple (local) files. Would you be willing to share your solution? The responses here deal with accessing remote files, but not multiple files. Thanks, Ben

@bletcher Are you talking about accessing multiple local files in a notebook? You can create a file input that lets you select multiple local files by calling Inputs.file({multiple: true}).

Hi @mootari, thanks for the reply. In Observable Framework, I am trying to read several local parquet files into duckDB. I have parquet files (created in R) for each site and users will select sites from a dropdown. I would like to include the selected sites in the database, but I am not sure the best way to proceed. (I realize that reading all the data into duckDB could work, but the database will have many sites and be quite large - 100’s GB).

The following works, but seem less flexible than the approach suggested by @jazzyjones above and listed in the duckDB guide.

const db2 = DuckDBClient.of({ 
  file1: FileAttachment('./data/parquet/bySiteID/siteID=PA_01FL/part-0.parquet'),
  file2: FileAttachment('./data/parquet/bySiteID/siteID=PA_02FL/part-0.parquet')
});
const data2Combined = db2.sql`
  SELECT * FROM file1
  UNION ALL
  SELECT * FROM file2
`;

So, I imagine something more like this:

const dataFiles = ['./data/parquet/bySiteID/siteID=PA_01FL/part-0.parquet', './data/parquet/bySiteID/siteID=PA_02FL/part-0.parquet'];
const dataList = dataFiles.map(d => `'${d}'`).join(', ');

const dbDuck = await DuckDBClient.of();

await dbDuck.sql`CREATE TABLE dataIn
AS SELECT *
FROM read_parquet(${dataFiles})`;
const dbDuckQuery = dbDuck.query(`SELECT * FROM dataIn`)

But read_parquet() is not finding the parquet files (even when I hard code them) and in Framework it seems to make sense to try to use FileAttachment().

I am curious how others would proceed - I am fairly new to duckDB. Thanks! Ben

@bletcher The first step would be to declare an input that lets users select files. There’s an example for that here:

https://observablehq.observablehq.cloud/framework-example-input-select-file/

If you want to allow multiple files to be selected, you can set the multiple option to true. (And like @mootari said, if you want local files rather than file attachments, you can use Inputs.file.)

The next part would be wiring this up to a DuckDBClient instance. I don’t think you can use db.sql for this because DuckDB-Wasm doesn’t support placeholders for arguments to the read_parquet function. But it’s easy enough to workaround that limitation by constructing the appropriate SQL yourself and calling db.query instead:

const db = await DuckDBClient.of();

await db.query(
  `CREATE TABLE rows
AS SELECT *
FROM read_csv(${JSON.stringify(files.map((f) => f.href))}, AUTO_DETECT=TRUE)`
);

const sql = db.sql.bind(db);

Here I’m reading CSV files, but I imagine it would work the same way with read_parquet. Note that I also redeclared sql so that I can use Framework’s built-in SQL code blocks.

Now I can write a SQL code block:

SELECT * FROM rows

And get a nice table:

2 Likes

Thanks so much @mbostock and @mootari! And thanks for everything you do for the community.

1 Like