DuckDB is not returning rows for Parquet files

massyn · June 24, 2024, 9:04pm

Hi all, I’ve been using Observable Framework for a few weeks, and I ran into a strange behaviour with SQL.

I have a parquet file that I download from Google BigQuery created from a Python data loader. From what I can tell, the Parquet file is fine, and has all the fields, the right schema, the right data, the works.

With my code, I pop this SQL in… This works, and confirms that the whole thing is ok.

select
    resource
from
    metric_detail
where
    owner = ${owner}

I’d like to expand the columns. So I run the actual query I want…

select
    resource,
    title
from
    metric_detail
where
    owner = ${owner}

This is where the problem comes in – I get 0 rows. Upon further testing, I tried this…

select
    *
from
    metric_detail
where
    owner = ${owner}

The headings update, and they show all the columns, yet again, no data returned.

I did observe something interesting. The query

select distinct title
from metric_detail

did give me a weird error in the browser console.

Error: Invalid Error: TProtocolException: Invalid data
    at U.onMessage (_esm.js:7:11151)

Some additional info – changing the data format from parquet to json has worked. Json is not desirable as it is way too big.

Any suggestions on what might be the issue with the parquet file?

mootari · June 24, 2024, 10:12pm

Does the problem go away if you reload the page? And does changing the value for owner dynamically (e.g. via a text input) cause similar problems?

massyn · June 24, 2024, 10:27pm

I tried reloading, restarting, the works. No change. It is not a problem with the SQL code. It is something to do with the ingestion of a parquet file format. Changing to json solved the issue, so I know it is not an SQL issue.

Something with the way Observable is ingesting the Parquet file is causing it to do something weird with the file. I viewed the file with a Parquet explorer. All the fields are there, all the data is there.

mootari · June 24, 2024, 10:47pm

Can you narrow down what’s causing the issue? E.g., is it the way you select columns? Or is it the specific column types?

massyn · June 25, 2024, 1:42am

Not yet. Column types are VARCHAR. When SELECT’ing those columns, it returns 0 rows. The WHERE is not even in play at all. I will be focussing my efforts on how the parquet is being generated, forcing column types, and seeing if it has any impact.

massyn · June 26, 2024, 8:48pm

Hi all, started with a new round of testing.

Querying the parquet file directly in DuckDb worked exactly as expected. All queries worked, and seems that the Parquet file is healthy, and DuckDB is fully capable of reading and parsing it.

Next up, I created a new test.md file, and queried both the json and parquet file in a sql block, simply just a select * from table for each of them.

On first run, everything worked as expected. Both json and parquet data files were able to display the result of the SQL query.
I navigated to one of the other pages in the project, and it still worked.
Navigated back to my test page, and the Parquet query failed. It did not return any rows.

Conclusion thus far

This is not a data or a SQL issue. This has something to do with concurrency, or caching of the parquet data in memory. I don’t believe it is a DuckDB issue, because even while this issue is occurring, I am able to query the Parquet file directly.

massyn · June 28, 2024, 7:55am

Still no luck… Upon further investigation, I did come across this issue in Github.

github.com/duckdb/duckdb

Reading CSV and parquet file loaded over https initially fails but succeeds after several attempts

opened 11:24AM - 23 Mar 23 UTC

closed 08:47AM - 18 Oct 23 UTC

mskyttner

### What happens? Intermittent issues with reading CSV file and parquet files o…ver http(f)s, using duckdb CLI v0.7.1 b00b93f0b1 and duckdb-CLI v0.7.2-dev986 b6eb596089. Maybe this is related to the server providing the files (?), in this case https://csvbase.com/calpaterson/iris.csv and https://csvbase.com/calpaterson/iris.parquet, using SQL like: ```sql from 'https://csvbase.com/calpaterson/iris.csv' limit 3; from 'https://csvbase.com/calpaterson/iris.parquet' limit 3; ``` Error messages can be: ``` // v0.7.1 Error: IO Error: Server did not send Content-Length header, can not read from this file. Error: Invalid Error: stoll // v0.7.2-dev986 Error: Invalid Input Error: No magic bytes found at end of file 'https://csvbase.com/calpaterson/iris.parquet' Error: Invalid Error: TProtocolException: Invalid data Segmentation fault (core dumped) ``` Eventually the SQL statement succeeds, often on the third attempt (v0.7.1). Or the statement succeeds intially, but repeating it causes errors (v0.7.2-dev986). Possibly related to #5924 ### To Reproduce ```bash root@e5b47b35d70a:/data# duckdb -- Loading resources from /root/.duckdbrc v0.7.1 b00b93f0b1 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database. D from 'https://csvbase.com/calpaterson/iris.csv' limit 3; Error: IO Error: Server did not send Content-Length header, can not read from this file. D from 'https://csvbase.com/calpaterson/iris.csv' limit 3; Error: Invalid Error: stoll D from 'https://csvbase.com/calpaterson/iris.csv' limit 3; ┌────────────────┬──────────────┬─────────────┬──────────────┬─────────────┬─────────────┐ │ csvbase_row_id │ sepal length │ sepal width │ petal length │ petal width │ class │ │ int64 │ double │ double │ double │ double │ varchar │ ├────────────────┼──────────────┼─────────────┼──────────────┼─────────────┼─────────────┤ │ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ Iris-setosa │ │ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ Iris-setosa │ │ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ Iris-setosa │ └────────────────┴──────────────┴─────────────┴──────────────┴─────────────┴─────────────┘ D from 'https://csvbase.com/calpaterson/iris.parquet' limit 3; Error: IO Error: Server did not send Content-Length header, can not read from this file. D from 'https://csvbase.com/calpaterson/iris.parquet' limit 3; Error: Invalid Error: stoll D from 'https://csvbase.com/calpaterson/iris.parquet' limit 3; ┌────────────────┬──────────────┬─────────────┬──────────────┬─────────────┬─────────────┐ │ csvbase_row_id │ sepal length │ sepal width │ petal length │ petal width │ class │ │ int64 │ double │ double │ double │ double │ varchar │ ├────────────────┼──────────────┼─────────────┼──────────────┼─────────────┼─────────────┤ │ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ Iris-setosa │ │ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ Iris-setosa │ │ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ Iris-setosa │ └────────────────┴──────────────┴─────────────┴──────────────┴─────────────┴─────────────┘ ``` When using duckdb-CLI v0.7.2-dev986 b6eb596089, downloaded from https://github.com/duckdb/duckdb/actions/runs/4497501281: It works much better initially, but there are still some errors reported when doing multiple fetches: ```bash ./duckdb -unsigned v0.7.2-dev986 b6eb596089 Enter ".help" for usage hints. D install 'httpfs.duckdb_extension'; D from 'https://csvbase.com/calpaterson/iris.csv' limit 3; ┌────────────────┬──────────────┬─────────────┬──────────────┬─────────────┬─────────────┐ │ csvbase_row_id │ sepal length │ sepal width │ petal length │ petal width │ class │ │ int64 │ double │ double │ double │ double │ varchar │ ├────────────────┼──────────────┼─────────────┼──────────────┼─────────────┼─────────────┤ │ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ Iris-setosa │ │ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ Iris-setosa │ │ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ Iris-setosa │ └────────────────┴──────────────┴─────────────┴──────────────┴─────────────┴─────────────┘ D from 'https://csvbase.com/calpaterson/iris.parquet' limit 3; ┌────────────────┬──────────────┬─────────────┬──────────────┬─────────────┬─────────────┐ │ csvbase_row_id │ sepal length │ sepal width │ petal length │ petal width │ class │ │ int64 │ double │ double │ double │ double │ varchar │ ├────────────────┼──────────────┼─────────────┼──────────────┼─────────────┼─────────────┤ │ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ Iris-setosa │ │ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ Iris-setosa │ │ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ Iris-setosa │ └────────────────┴──────────────┴─────────────┴──────────────┴─────────────┴─────────────┘ D from 'https://csvbase.com/calpaterson/iris.parquet' limit 3; Error: Invalid Input Error: No magic bytes found at end of file 'https://csvbase.com/calpaterson/iris.parquet' D from 'https://csvbase.com/calpaterson/iris.parquet' limit 3; Error: Invalid Error: TProtocolException: Invalid data D from 'https://csvbase.com/calpaterson/iris.parquet' limit 3; Error: Invalid Error: TProtocolException: Invalid data D from 'https://csvbase.com/calpaterson/iris.csv' limit 3; Segmentation fault (core dumped) ``` ### OS: duckdb CLI (amd-64) running on Debian (docker container) ### DuckDB Version: CLI v0.7.1 b00b93f and CLI v0.7.2-dev986 b6eb596089 ### DuckDB Client: CLI ### Full Name: Markus Skyttner ### Affiliation: KTH Royal Institute of Technology ### Have you tried this on the latest `master` branch? - [X] I agree ### Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there? - [X] I agree

Not sure if it is related, but will keep digging.

mootari · June 28, 2024, 10:13am

Could you share the Parquet file that you’re testing with, or maybe even publish a small test repo?

massyn · June 28, 2024, 10:26am

Seems to be related to this issue…

github.com/duckdb/duckdb-wasm

Error while reading again a parquet file after browser reload

opened 03:19PM - 04 Mar 24 UTC

ericemc3

### What happens? Executing twice the same request, after reloading the shell p…age, yields an error. ### To Reproduce in https://shell.duckdb.org/, execute : ``` FROM 'https://static.data.gouv.fr/resources/communes-2023-format-parquet/20240122-085355/communes2023.parquet' SELECT codgeo WHERE epci = '200039865' ; ``` then reload the browser and execute that same query again. On windows and with Chrome or Edge, i get: `Invalid Error: TProtocolException: Invalid data` `codgeo `column, which is also the first of the dataset, seems to be responsible. With Firefox, no issue. ### OS: Win11 ### DuckDB Version: 10.0.0 ### DuckDB Client: shell wasm1.28.1-dev159.0 ### Full Name: eric mauviere ### Affiliation: icem7 ### Have you tried this on the latest [nightly build](https://duckdb.org/docs/installation/?version=main)? I have tested with a nightly build ### Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there? - [X] Yes, I have

Busy with more testing. Seems to be Windows related.

massyn · June 28, 2024, 10:45am

How to reproduce the issue

Start up an Observable Framework project that has at least 1 parquet file being consumed
Have at least 2 pages that will query the Parquet file
Run the Framework dev instance
- npm run dev
Navigate to the first page using Google Chrome on Windows 11.
Navigate to the 2nd page - The error message will appear

Error: Invalid Error: TProtocolException: Invalid data will be shown.

Workaround

Use Firefox
Use json or csv files instead of parquet

Related topics

github.com/duckdb/duckdb-wasm

Error while reading again a parquet file after browser reload

opened 03:19PM - 04 Mar 24 UTC

ericemc3

### What happens? Executing twice the same request, after reloading the shell p…age, yields an error. ### To Reproduce in https://shell.duckdb.org/, execute : ``` FROM 'https://static.data.gouv.fr/resources/communes-2023-format-parquet/20240122-085355/communes2023.parquet' SELECT codgeo WHERE epci = '200039865' ; ``` then reload the browser and execute that same query again. On windows and with Chrome or Edge, i get: `Invalid Error: TProtocolException: Invalid data` `codgeo `column, which is also the first of the dataset, seems to be responsible. With Firefox, no issue. ### OS: Win11 ### DuckDB Version: 10.0.0 ### DuckDB Client: shell wasm1.28.1-dev159.0 ### Full Name: eric mauviere ### Affiliation: icem7 ### Have you tried this on the latest [nightly build](https://duckdb.org/docs/installation/?version=main)? I have tested with a nightly build ### Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there? - [X] Yes, I have

massyn · July 1, 2024, 10:59am

I noticed I am not the only one who found this issue… “No results” after refreshing site when using some parquet files · Issue #1470 · observablehq/framework (github.com)

mootari · July 1, 2024, 11:08am

As mentioned earlier, if you want someone else to look into this as well then it would be helpful to have a test repository with the exact setup required to reproduce the bug.

massyn · July 1, 2024, 11:14am

I agree… The problem is the data file I have that is not working contains sensitive data, and I am not able to reproduce this with a separate parquet file. What adds to the issue, is once I run it in cognito mode, or clear the browser cache completely, it works fine. So on face value it is not an issue with the parquet file in itself, but rather how the data gets cached.

We do have another user with the same issue. I will continue to follow the issue on Github.

mootari · July 1, 2024, 11:38am

Does the size of the file matter? E.g., can you still reproduce the problem with only a subset of the data?

And if part of the data suffices for a repro, could you perhaps scramble/obfuscate it?

Edit: Nvm, I just saw that the issue author shared repro steps for DuckDB’s “weather.parquet” example file: "No results" after refreshing site when using some parquet files · Issue #1470 · observablehq/framework · GitHub

massyn · July 2, 2024, 7:55am

Here is my test site where the same issue is reproduced, using the weather.parquet file from your site DuckDB Client for Observable / CMU Data Interaction Group | Observable

http://static.massyn.net.s3-website-ap-southeast-2.amazonaws.com/observablehq-test/

Full source code are available here

To reproduce the issue

Use Google Chrome on a Windows system
Open Testing parquet bug | Test
Click on Page 2
You may need to go back to the Index page.
You will notice the table changes to “No results”, when there should be a result.

GuillaumeChretien · July 31, 2024, 5:34pm

I have the same issue in a different context… so +1 !

Topic		Replies	Views
DuckDB: load multiple remote parquet files dinamically Community Help	7	1261	June 5, 2024
Populating a DuckDb database cell with remote binary file (parquet) Community Help	2	281	January 10, 2023
Reading parquet file multiple times. Is there a better way? Community Help	1	83	July 30, 2024
Can I create a table from parquet in duckdb with "DuckDBClient.of"? Community Help	8	657	November 29, 2022
Use read_csv in duckdb call for a locale file Community Help	2	192	April 1, 2024