strange data loader error (Framework)

rjbudzynski · May 22, 2024, 7:36pm

I wanted to use a data loader to do some preprocessing on a parquet file, so I wrote a bash script like:

cd -- $(dirname "$0") 
duckdb <<'END_OF_SQL' 
-- some sql here, producing a table tbl
copy tbl to '/dev/stdout' (format parquet, codec zstd);
END_OF_SQL

Unfortunately this does not work. That is, it works just fine when launched from the command line, but not as a data loader. The error as seen in Observable is

IO Error: Cannot open file "/dev/stdout": No such device or address

I can work around this by modifying the copy tbl line as

copy tbl to 'data/tbl.parquet' (format parquet, codec zstd);

and adding

cat data/tbl.parquet

as the last line in the file.

A little puzzling, isn’t it?

mbostock · May 22, 2024, 8:17pm

I don’t see that error, but I do see a different error FATAL Error: fsync failed! which is this DuckDB issue here (from another Observable Framework user):

github.com/duckdb/duckdb

"FATAL Error: fsync failed!" when `COPY ... TO "/dev/stdout"` called through nodejs script

opened 10:38AM - 26 Feb 24 UTC

srenatus

under review

### What happens? When I run the following duckdb script with a shell script,… it'll succeed and output binary parquet data to stdout: ```sh #!/bin/bash duckdb :memory: <<EOF copy (select * from range(100000) tbl(i)) to '/dev/stdout' (format parquet, use_tmp_file false) EOF ``` *But* when I run the same script via a nodejs script, it'll somehow get wired up so that duckdb calls `fsync` on the target fd and it fails -- see "To Reproduce" ### Why? For context -- why would someone want to do that kind of thing? -- this problem occurred to me when using a similar duckdb call in a bash script, to use it with https://github.com/observablehq/framework. Turns out there are two workarounds: 1. `COPY ... TO 'tmp.parquet'` followed by (in the bash script) `cat tmp.parquet && rm tmp.parquet` 2. leaving the `COPY TO` intact, but adding a `exit 0` afterwards: the duckdb call would still fail with the `fsync failed`, but the output is apparently written to the pipe before, and the `exit 0` will make the node script believe everything was OK. But this isn't desirable to keep around, since it could hide other problems. ### To Reproduce ```js # t.js import {spawn} from "node:child_process"; import {mkdir, open, readFile, rename, unlink} from "node:fs/promises"; import {WriteStream, createReadStream, existsSync, statSync} from "node:fs"; async function exec(cmd, args, output) { const subprocess = spawn(cmd, args, {windowsHide: true, stdio: ["ignore", "pipe", "inherit"]}); subprocess.stdout.pipe(output); const code = await new Promise((resolve, reject) => { subprocess.on("error", reject); subprocess.on("close", resolve); }); if (code !== 0) { throw new Error(`loader exited with code ${code}`); } } const tempPath = "test.tmp"; const tempFd = await open(tempPath, "w"); exec("sh", ["test.sh"], tempFd.createWriteStream({highWaterMark: 1024 * 1024})); // same result for the direct call: // const d = "duckdb"; // exec(d, [":memory", "-c", "copy (select * from range(10000) tbl(i)) to '/dev/stdout' (format parquet, use_tmp_file false)"], tempFd.createWriteStream({highWaterMark: 1024 * 1024})); ``` With the script above and nodejs 21, we'll get this: ``` $ node t.js Error: FATAL Error: fsync failed! file:///Users/stephan/Sources/xyz/t.js:14 throw new Error(`loader exited with code ${code}`); ^ Error: loader exited with code 1 at exec (file:///Users/stephan/Sources/xyz/t.js:14:11) Node.js v21.6.2 ``` ℹ️ I've started looking into how to fix the problem, and this change seemed promising, ```diff diff --git a/src/common/local_file_system.cpp b/src/common/local_file_system.cpp index 3f2fc92114..c8843508d8 100644 --- a/src/common/local_file_system.cpp +++ b/src/common/local_file_system.cpp @@ -591,7 +591,7 @@ bool LocalFileSystem::ListFiles(const string &directory, const std::function<voi void LocalFileSystem::FileSync(FileHandle &handle) { int fd = handle.Cast<UnixFileHandle>().fd; - if (fsync(fd) != 0) { + if ((fsync(fd) != 0) && (errno != ENOTSUP)) { throw FatalException("fsync failed!"); } } ``` **however** I was unable (after trying for some time) to reproduce the nodejs setup with `tools/shell/tests/conftest.py`. ### OS: macos ### DuckDB Version: 0.10.0 ### DuckDB Client: CLI ### Full Name: Stephan Renatus ### Affiliation: Styra Inc ### Have you tried this on the latest [nightly build](https://duckdb.org/docs/installation/?version=main)? ~I have tested with a nightly build~ I've done a release build from main and used that. ### Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there? - [X] Yes, I have

I believe the bug is that DuckDB is trying to fsync on stdout, which isn’t supported.

If you don’t mind playing a little fast and loose by ignoring all DuckDB errors, you could try this instead to avoid the temp file:

duckdb :memory: <<EOF || true
copy (select * from range(100) tbl(i)) to '/dev/stdout' (format parquet, codec zstd)
EOF

We’re spawning the process here:

github.com

observablehq/framework/blob/30ee4aacefa646c4f9d5d9814e540c6456d5bd38/src/dataloader.ts#L315


      
             */
            private readonly args: string[];
          
            constructor({command, args, ...options}: CommandLoaderOptions) {
              super(options);
              this.command = command;
              this.args = args;
            }
          
            async exec(output: WriteStream): Promise<void> {
              const subprocess = spawn(this.command, this.args, {windowsHide: true, stdio: ["ignore", "pipe", "inherit"]});
              subprocess.stdout.pipe(output);
              const code = await new Promise((resolve, reject) => {
                subprocess.on("error", reject);
                subprocess.on("close", resolve);
              });
              if (code !== 0) {
                throw new Error(`loader exited with code ${code}`);
              }
            }
          }

Perhaps there’s a different way to spawn a process that avoids this error?

mythmon · May 22, 2024, 9:12pm

I am running into this problem as well.

I have a very minimal dataloader, data/test.csv.sh with the contents

#!/usr/bin/env sh
duckdb -c "COPY (SELECT 1) TO STDOUT WITH (FORMAT CSV);"

running it directly with ./src/data/test.csv.sh works as expected, but when Framework runs it I get the error Cannot open file "/dev/stdout".

rjbudzynski · May 22, 2024, 9:15pm

I forgot to mention, I’m using Linux, while you’re on MacOS (my guess), so that would explain the different errors.

I should also stress that the script where duckdb writes directly to /dev/stdout runs correctly from the command line and reports no errors. I of course checked by redirecting stdout to a file that it produces a valid parquet file with the expected content.

mbostock · May 22, 2024, 9:20pm

I’ve figured out how to fix this in Framework with a very minimal change to how we direct the output of the child process to a file:

Assuming this lands, we should get a fix released in the next day or two.

Topic		Replies	Views
A data loader that produces a parquet file Community Help	4	217	November 27, 2024
DuckDB: load multiple remote parquet files dinamically Community Help	7	1241	June 5, 2024
Populating a DuckDb database cell with remote binary file (parquet) Community Help	2	280	January 10, 2023
duckdb error within Observable Framework Community Help	5	602	May 22, 2024
Use read_csv in duckdb call for a locale file Community Help	2	185	April 1, 2024

strange data loader error (Framework)

Related topics