DuckDB: load multiple remote parquet files dinamically

jazzyjones · March 8, 2023, 4:32pm

Hi! Pretty new to Observable and JavaScript!
I’m trying to load multiple parquet files hosted on Github dinamically in a single table in a DuckDB instance. I’m following DuckDB’s documentation, using list parameter:

-- use list parameter to read 3 parquet files and treat them as a single table
SELECT * FROM read_parquet(['file1.parquet', 'file2.parquet', 'file3.parquet']);

dataFiles = array of urls
dataList = JSON.stringify(dataFiles)
db = DuckDBClient.of()
db.query(`CREATE TABLE data AS SELECT * FROM read_parquet(${dataList});`)

I’m getting the following error: “Error: Binder Error: TABLE FUNCTION parameter cannot contain column names”

If I remove the read_parquet command and keep only a single url it works. using read_parquet even with a single url gets me the same error.

How should I proceed to load multiple files in a single table?

mootari · March 8, 2023, 6:21pm

Have a look at this example for a wrapper that you can import into your notebook:

However, this example might be closer to what you’re looking for:

jazzyjones · March 8, 2023, 6:49pm

Thank you! The second link solved it. And I had to change the double quotes to single quotes inside the SQL statement.

bletcher · May 30, 2024, 12:52am

Hi @jazzyjones, I am having a similar issue with loading multiple (local) files. Would you be willing to share your solution? The responses here deal with accessing remote files, but not multiple files. Thanks, Ben

mootari · May 31, 2024, 8:38am

@bletcher Are you talking about accessing multiple local files in a notebook? You can create a file input that lets you select multiple local files by calling Inputs.file({multiple: true}).

bletcher · May 31, 2024, 3:52pm

Hi @mootari, thanks for the reply. In Observable Framework, I am trying to read several local parquet files into duckDB. I have parquet files (created in R) for each site and users will select sites from a dropdown. I would like to include the selected sites in the database, but I am not sure the best way to proceed. (I realize that reading all the data into duckDB could work, but the database will have many sites and be quite large - 100’s GB).

The following works, but seem less flexible than the approach suggested by @jazzyjones above and listed in the duckDB guide.

const db2 = DuckDBClient.of({ 
  file1: FileAttachment('./data/parquet/bySiteID/siteID=PA_01FL/part-0.parquet'),
  file2: FileAttachment('./data/parquet/bySiteID/siteID=PA_02FL/part-0.parquet')
});

const data2Combined = db2.sql`
  SELECT * FROM file1
  UNION ALL
  SELECT * FROM file2
`;

So, I imagine something more like this:

const dataFiles = ['./data/parquet/bySiteID/siteID=PA_01FL/part-0.parquet', './data/parquet/bySiteID/siteID=PA_02FL/part-0.parquet'];
const dataList = dataFiles.map(d => `'${d}'`).join(', ');

const dbDuck = await DuckDBClient.of();

await dbDuck.sql`CREATE TABLE dataIn
AS SELECT *
FROM read_parquet(${dataFiles})`;

const dbDuckQuery = dbDuck.query(`SELECT * FROM dataIn`)

But read_parquet() is not finding the parquet files (even when I hard code them) and in Framework it seems to make sense to try to use FileAttachment().

I am curious how others would proceed - I am fairly new to duckDB. Thanks! Ben

mbostock · June 1, 2024, 12:46am

@bletcher The first step would be to declare an input that lets users select files. There’s an example for that here:

https://observablehq.observablehq.cloud/framework-example-input-select-file/

If you want to allow multiple files to be selected, you can set the multiple option to true. (And like @mootari said, if you want local files rather than file attachments, you can use Inputs.file.)

The next part would be wiring this up to a DuckDBClient instance. I don’t think you can use db.sql for this because DuckDB-Wasm doesn’t support placeholders for arguments to the read_parquet function. But it’s easy enough to workaround that limitation by constructing the appropriate SQL yourself and calling db.query instead:

const db = await DuckDBClient.of();

await db.query(
  `CREATE TABLE rows
AS SELECT *
FROM read_csv(${JSON.stringify(files.map((f) => f.href))}, AUTO_DETECT=TRUE)`
);

const sql = db.sql.bind(db);

Here I’m reading CSV files, but I imagine it would work the same way with read_parquet. Note that I also redeclared sql so that I can use Framework’s built-in SQL code blocks.

Now I can write a SQL code block:

SELECT * FROM rows

And get a nice table:

bletcher · June 5, 2024, 8:38pm

Thanks so much @mbostock and @mootari! And thanks for everything you do for the community.

Topic		Replies	Views
Can I create a table from parquet in duckdb with "DuckDBClient.of"? Community Help	8	661	November 29, 2022
Populating a DuckDb database cell with remote binary file (parquet) Community Help	2	282	January 10, 2023
DuckDB is not returning rows for Parquet files Community Help	15	995	July 31, 2024
how to use duckdb and DuckDBClient Community Help	9	346	July 26, 2024
Loading a duckdb database Community Help	7	535	March 23, 2024

DuckDB: load multiple remote parquet files dinamically

Related topics