DuckDB is not returning rows for Parquet files

Hi all, I’ve been using Observable Framework for a few weeks, and I ran into a strange behaviour with SQL.

I have a parquet file that I download from Google BigQuery created from a Python data loader. From what I can tell, the Parquet file is fine, and has all the fields, the right schema, the right data, the works.

With my code, I pop this SQL in… This works, and confirms that the whole thing is ok.

select
    resource
from
    metric_detail
where
    owner = ${owner}

I’d like to expand the columns. So I run the actual query I want…

select
    resource,
    title
from
    metric_detail
where
    owner = ${owner}

This is where the problem comes in – I get 0 rows. Upon further testing, I tried this…

select
    *
from
    metric_detail
where
    owner = ${owner}

The headings update, and they show all the columns, yet again, no data returned.

I did observe something interesting. The query

select distinct title
from metric_detail

did give me a weird error in the browser console.

Error: Invalid Error: TProtocolException: Invalid data
    at U.onMessage (_esm.js:7:11151)

Some additional info – changing the data format from parquet to json has worked. Json is not desirable as it is way too big.

Any suggestions on what might be the issue with the parquet file?

1 Like

Does the problem go away if you reload the page? And does changing the value for owner dynamically (e.g. via a text input) cause similar problems?

I tried reloading, restarting, the works. No change. It is not a problem with the SQL code. It is something to do with the ingestion of a parquet file format. Changing to json solved the issue, so I know it is not an SQL issue.

Something with the way Observable is ingesting the Parquet file is causing it to do something weird with the file. I viewed the file with a Parquet explorer. All the fields are there, all the data is there.

Can you narrow down what’s causing the issue? E.g., is it the way you select columns? Or is it the specific column types?

Not yet. Column types are VARCHAR. When SELECT’ing those columns, it returns 0 rows. The WHERE is not even in play at all. I will be focussing my efforts on how the parquet is being generated, forcing column types, and seeing if it has any impact.

Hi all, started with a new round of testing.

Querying the parquet file directly in DuckDb worked exactly as expected. All queries worked, and seems that the Parquet file is healthy, and DuckDB is fully capable of reading and parsing it.

Next up, I created a new test.md file, and queried both the json and parquet file in a sql block, simply just a select * from table for each of them.

  • On first run, everything worked as expected. Both json and parquet data files were able to display the result of the SQL query.
  • I navigated to one of the other pages in the project, and it still worked.
  • Navigated back to my test page, and the Parquet query failed. It did not return any rows.

Conclusion thus far

This is not a data or a SQL issue. This has something to do with concurrency, or caching of the parquet data in memory. I don’t believe it is a DuckDB issue, because even while this issue is occurring, I am able to query the Parquet file directly.

Still no luck… Upon further investigation, I did come across this issue in Github.

Not sure if it is related, but will keep digging.

Could you share the Parquet file that you’re testing with, or maybe even publish a small test repo?

Seems to be related to this issue…

Busy with more testing. Seems to be Windows related.

How to reproduce the issue

  • Start up an Observable Framework project that has at least 1 parquet file being consumed
  • Have at least 2 pages that will query the Parquet file
  • Run the Framework dev instance
    • npm run dev
  • Navigate to the first page using Google Chrome on Windows 11.
  • Navigate to the 2nd page - The error message will appear

Error: Invalid Error: TProtocolException: Invalid data will be shown.

Workaround

  • Use Firefox
  • Use json or csv files instead of parquet

Related topics

I noticed I am not the only one who found this issue… “No results” after refreshing site when using some parquet files · Issue #1470 · observablehq/framework (github.com)

As mentioned earlier, if you want someone else to look into this as well then it would be helpful to have a test repository with the exact setup required to reproduce the bug.

I agree… The problem is the data file I have that is not working contains sensitive data, and I am not able to reproduce this with a separate parquet file. What adds to the issue, is once I run it in cognito mode, or clear the browser cache completely, it works fine. So on face value it is not an issue with the parquet file in itself, but rather how the data gets cached.

We do have another user with the same issue. I will continue to follow the issue on Github.

Does the size of the file matter? E.g., can you still reproduce the problem with only a subset of the data?

And if part of the data suffices for a repro, could you perhaps scramble/obfuscate it?

Edit: Nvm, I just saw that the issue author shared repro steps for DuckDB’s “weather.parquet” example file: "No results" after refreshing site when using some parquet files · Issue #1470 · observablehq/framework · GitHub

Here is my test site where the same issue is reproduced, using the weather.parquet file from your site DuckDB Client for Observable / CMU Data Interaction Group | Observable

http://static.massyn.net.s3-website-ap-southeast-2.amazonaws.com/observablehq-test/

Full source code are available here

To reproduce the issue

  • Use Google Chrome on a Windows system
  • Open Testing parquet bug | Test
  • Click on Page 2
  • You may need to go back to the Index page.
  • You will notice the table changes to “No results”, when there should be a result.
1 Like

I have the same issue in a different context… so +1 !