Hi! Pretty new to Observable and JavaScript!
I’m trying to load multiple parquet files hosted on Github dinamically in a single table in a DuckDB instance. I’m following DuckDB’s documentation, using list parameter:
-- use list parameter to read 3 parquet files and treat them as a single table
SELECT * FROM read_parquet(['file1.parquet', 'file2.parquet', 'file3.parquet']);
dataFiles = array of urls
dataList = JSON.stringify(dataFiles)
db = DuckDBClient.of()
db.query(`CREATE TABLE data AS SELECT * FROM read_parquet(${dataList});`)
I’m getting the following error: “Error: Binder Error: TABLE FUNCTION parameter cannot contain column names”
If I remove the read_parquet command and keep only a single url it works. using read_parquet even with a single url gets me the same error.
How should I proceed to load multiple files in a single table?
Have a look at this example for a wrapper that you can import into your notebook:
However, this example might be closer to what you’re looking for:
1 Like
Thank you! The second link solved it. And I had to change the double quotes to single quotes inside the SQL statement.
Hi @jazzyjones, I am having a similar issue with loading multiple (local) files. Would you be willing to share your solution? The responses here deal with accessing remote files, but not multiple files. Thanks, Ben
@bletcher Are you talking about accessing multiple local files in a notebook? You can create a file input that lets you select multiple local files by calling Inputs.file({multiple: true})
.
Hi @mootari, thanks for the reply. In Observable Framework, I am trying to read several local parquet files into duckDB. I have parquet files (created in R) for each site
and users will select sites from a dropdown. I would like to include the selected sites in the database, but I am not sure the best way to proceed. (I realize that reading all the data into duckDB could work, but the database will have many sites and be quite large - 100’s GB).
The following works, but seem less flexible than the approach suggested by @jazzyjones above and listed in the duckDB guide.
const db2 = DuckDBClient.of({
file1: FileAttachment('./data/parquet/bySiteID/siteID=PA_01FL/part-0.parquet'),
file2: FileAttachment('./data/parquet/bySiteID/siteID=PA_02FL/part-0.parquet')
});
const data2Combined = db2.sql`
SELECT * FROM file1
UNION ALL
SELECT * FROM file2
`;
So, I imagine something more like this:
const dataFiles = ['./data/parquet/bySiteID/siteID=PA_01FL/part-0.parquet', './data/parquet/bySiteID/siteID=PA_02FL/part-0.parquet'];
const dataList = dataFiles.map(d => `'${d}'`).join(', ');
const dbDuck = await DuckDBClient.of();
await dbDuck.sql`CREATE TABLE dataIn
AS SELECT *
FROM read_parquet(${dataFiles})`;
const dbDuckQuery = dbDuck.query(`SELECT * FROM dataIn`)
But read_parquet()
is not finding the parquet files (even when I hard code them) and in Framework it seems to make sense to try to use FileAttachment()
.
I am curious how others would proceed - I am fairly new to duckDB. Thanks! Ben
@bletcher The first step would be to declare an input that lets users select files. There’s an example for that here:
https://observablehq.observablehq.cloud/framework-example-input-select-file/
If you want to allow multiple files to be selected, you can set the multiple option to true. (And like @mootari said, if you want local files rather than file attachments, you can use Inputs.file.)
The next part would be wiring this up to a DuckDBClient instance. I don’t think you can use db.sql for this because DuckDB-Wasm doesn’t support placeholders for arguments to the read_parquet function. But it’s easy enough to workaround that limitation by constructing the appropriate SQL yourself and calling db.query instead:
const db = await DuckDBClient.of();
await db.query(
`CREATE TABLE rows
AS SELECT *
FROM read_csv(${JSON.stringify(files.map((f) => f.href))}, AUTO_DETECT=TRUE)`
);
const sql = db.sql.bind(db);
Here I’m reading CSV files, but I imagine it would work the same way with read_parquet. Note that I also redeclared sql
so that I can use Framework’s built-in SQL code blocks.
Now I can write a SQL code block:
SELECT * FROM rows
And get a nice table:
2 Likes
Thanks so much @mbostock and @mootari! And thanks for everything you do for the community.
1 Like