I wanted to use a data loader to do some preprocessing on a parquet file, so I wrote a bash script like:
cd -- $(dirname "$0")
duckdb <<'END_OF_SQL'
-- some sql here, producing a table tbl
copy tbl to '/dev/stdout' (format parquet, codec zstd);
END_OF_SQL
Unfortunately this does not work. That is, it works just fine when launched from the command line, but not as a data loader. The error as seen in Observable is
IO Error: Cannot open file "/dev/stdout": No such device or address
I can work around this by modifying the copy tbl
line as
copy tbl to 'data/tbl.parquet' (format parquet, codec zstd);
and adding
cat data/tbl.parquet
as the last line in the file.
A little puzzling, isn’t it?
I don’t see that error, but I do see a different error FATAL Error: fsync failed!
which is this DuckDB issue here (from another Observable Framework user):
opened 10:38AM - 26 Feb 24 UTC
under review
### What happens?
When I run the following duckdb script with a shell script,… it'll succeed and output binary parquet data to stdout:
```sh
#!/bin/bash
duckdb :memory: <<EOF
copy (select * from range(100000) tbl(i)) to '/dev/stdout' (format parquet, use_tmp_file false)
EOF
```
*But* when I run the same script via a nodejs script, it'll somehow get wired up so that duckdb calls `fsync` on the target fd and it fails -- see "To Reproduce"
### Why?
For context -- why would someone want to do that kind of thing? -- this problem occurred to me when using a similar duckdb call in a bash script, to use it with https://github.com/observablehq/framework.
Turns out there are two workarounds:
1. `COPY ... TO 'tmp.parquet'` followed by (in the bash script) `cat tmp.parquet && rm tmp.parquet`
2. leaving the `COPY TO` intact, but adding a `exit 0` afterwards: the duckdb call would still fail with the `fsync failed`, but the output is apparently written to the pipe before, and the `exit 0` will make the node script believe everything was OK. But this isn't desirable to keep around, since it could hide other problems.
### To Reproduce
```js
# t.js
import {spawn} from "node:child_process";
import {mkdir, open, readFile, rename, unlink} from "node:fs/promises";
import {WriteStream, createReadStream, existsSync, statSync} from "node:fs";
async function exec(cmd, args, output) {
const subprocess = spawn(cmd, args, {windowsHide: true, stdio: ["ignore", "pipe", "inherit"]});
subprocess.stdout.pipe(output);
const code = await new Promise((resolve, reject) => {
subprocess.on("error", reject);
subprocess.on("close", resolve);
});
if (code !== 0) {
throw new Error(`loader exited with code ${code}`);
}
}
const tempPath = "test.tmp";
const tempFd = await open(tempPath, "w");
exec("sh", ["test.sh"], tempFd.createWriteStream({highWaterMark: 1024 * 1024}));
// same result for the direct call:
// const d = "duckdb";
// exec(d, [":memory", "-c", "copy (select * from range(10000) tbl(i)) to '/dev/stdout' (format parquet, use_tmp_file false)"], tempFd.createWriteStream({highWaterMark: 1024 * 1024}));
```
With the script above and nodejs 21, we'll get this:
```
$ node t.js
Error: FATAL Error: fsync failed!
file:///Users/stephan/Sources/xyz/t.js:14
throw new Error(`loader exited with code ${code}`);
^
Error: loader exited with code 1
at exec (file:///Users/stephan/Sources/xyz/t.js:14:11)
Node.js v21.6.2
```
ℹ️ I've started looking into how to fix the problem, and this change seemed promising,
```diff
diff --git a/src/common/local_file_system.cpp b/src/common/local_file_system.cpp
index 3f2fc92114..c8843508d8 100644
--- a/src/common/local_file_system.cpp
+++ b/src/common/local_file_system.cpp
@@ -591,7 +591,7 @@ bool LocalFileSystem::ListFiles(const string &directory, const std::function<voi
void LocalFileSystem::FileSync(FileHandle &handle) {
int fd = handle.Cast<UnixFileHandle>().fd;
- if (fsync(fd) != 0) {
+ if ((fsync(fd) != 0) && (errno != ENOTSUP)) {
throw FatalException("fsync failed!");
}
}
```
**however** I was unable (after trying for some time) to reproduce the nodejs setup with `tools/shell/tests/conftest.py`.
### OS:
macos
### DuckDB Version:
0.10.0
### DuckDB Client:
CLI
### Full Name:
Stephan Renatus
### Affiliation:
Styra Inc
### Have you tried this on the latest [nightly build](https://duckdb.org/docs/installation/?version=main)?
~I have tested with a nightly build~ I've done a release build from main and used that.
### Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- [X] Yes, I have
I believe the bug is that DuckDB is trying to fsync
on stdout, which isn’t supported.
If you don’t mind playing a little fast and loose by ignoring all DuckDB errors, you could try this instead to avoid the temp file:
duckdb :memory: <<EOF || true
copy (select * from range(100) tbl(i)) to '/dev/stdout' (format parquet, codec zstd)
EOF
We’re spawning the process here:
*/
private readonly args: string[];
constructor({command, args, ...options}: CommandLoaderOptions) {
super(options);
this.command = command;
this.args = args;
}
async exec(output: WriteStream): Promise<void> {
const subprocess = spawn(this.command, this.args, {windowsHide: true, stdio: ["ignore", "pipe", "inherit"]});
subprocess.stdout.pipe(output);
const code = await new Promise((resolve, reject) => {
subprocess.on("error", reject);
subprocess.on("close", resolve);
});
if (code !== 0) {
throw new Error(`loader exited with code ${code}`);
}
}
}
Perhaps there’s a different way to spawn a process that avoids this error?
I am running into this problem as well.
I have a very minimal dataloader, data/test.csv.sh
with the contents
#!/usr/bin/env sh
duckdb -c "COPY (SELECT 1) TO STDOUT WITH (FORMAT CSV);"
running it directly with ./src/data/test.csv.sh
works as expected, but when Framework runs it I get the error Cannot open file "/dev/stdout"
.
I forgot to mention, I’m using Linux, while you’re on MacOS (my guess), so that would explain the different errors.
I should also stress that the script where duckdb writes directly to /dev/stdout runs correctly from the command line and reports no errors. I of course checked by redirecting stdout to a file that it produces a valid parquet file with the expected content.
I’ve figured out how to fix this in Framework with a very minimal change to how we direct the output of the child process to a file:
observablehq:main
← observablehq:mbostock/fix-spawn-stdout
opened 09:18PM - 22 May 24 UTC
Fixes https://github.com/duckdb/duckdb/issues/10849.
Test data loader:
```… sh
duckdb :memory: <<EOF
copy (select * from range(10) tbl(i)) to '/dev/stdout' (format parquet, codec zstd)
EOF
```
Assuming this lands, we should get a fix released in the next day or two.
1 Like