Do Parquet files in DuckDB connections use HTTP ranged requests?

jimjam-slam · February 9, 2023, 5:05am

I’m really excited by the introduction of DuckDB in the Observable Standard Library! It makes working with a wide range of data really consistent (and tbh I need to sharpen up my SQL skills ).

One reason I particularly like DuckDB is all of the recent talk about Parquet files. I already publish a bit of data in this format, and the idea of pulling data straight down from a remote Parquet file is really appealing.

That said, it would be amazing if I wasn’t downloading the whole Parquet file when I did it.

I’m under the understanding that DuckDB-WASM may have the ability to do ranged requests on remote Parquet files, although I’m not sure whether this capability requires the HTTPFS extension (or whether that extension comes with Observable’s DuckDB client).

Going by my own tests, it doesn’t look like ranged requests are happening: the entire 17 MB Parquet file here is downloaded. I find the same thing using the Observable Standard Library in Quarto, even if I point it to a URL that responds with accept-range: "bytes".

Is this something that does currently work with some adjustment, or is it something that the team is thinking of adding?

jimjam-slam · February 9, 2023, 5:48am

Looks like I should’ve been checking the console the whole time!

falling back to full HTTP read for: https://static.observableusercontent.com/files/2c2c…

Looks like the file attachment isn’t providing the right headers for range requests. Perhaps access-control-allow-headers: range? Is that something the team is able to change in time?

(For my own use in Quarto, I can control where I download data from, and I’m guessing Observable has attachment size limits that make range requests less useful. It’d still be nice, though!)

mootari · February 9, 2023, 1:45pm

You can already do range requests for file attachments, but DuckDB’s detection method seems flawed. What DuckDB does is to issue a HEAD request with a Range header and then check the response status for 206 Partial Content:

github.com

duckdb/duckdb-wasm/blob/cceb0ed/packages/duckdb-wasm/src/bindings/runtime_browser.ts#L156-L168


      
          const xhr = new XMLHttpRequest();
          if (file.dataProtocol == DuckDBDataProtocol.S3) {
              xhr.open('HEAD', getHTTPUrl(file.s3Config, file.dataUrl!), false);
              addS3Headers(xhr, file.s3Config, file.dataUrl!, 'HEAD');
          } else {
              xhr.open('HEAD', file.dataUrl!, false);
          }
          xhr.setRequestHeader('Range', `bytes=0-`);
          xhr.send(null);
          
          
// Supports range requests
          const contentLength = xhr.getResponseHeader('Content-Length');
          if (xhr.status == 206 && contentLength !== null) {

This in itself itself seems strange, as the HTTP spec apparently says to ignore the Range header for anything but GET.

A response for an actual range request looks like this:

curl -H "range: bytes=0-0" -I -X GET  https://static.observableusercontent.com/files/f4fd5c681574c1ff7050d59fd5fa64f7bc30d28763a6246afa5786d07181a34c07ddc47f614ce00669b69c09aabdbe77fbada74d9c06a590e8712cf19a95c1c3
HTTP/2 206 
content-type: text/csv
content-length: 1
last-modified: Tue, 10 May 2022 02:27:13 GMT
x-amz-server-side-encryption: AES256
content-encoding: gzip
accept-ranges: bytes
server: AmazonS3
access-control-allow-origin: *
access-control-allow-headers: range
access-control-allow-methods: GET
access-control-max-age: 86400
date: Thu, 09 Feb 2023 13:29:24 GMT
cache-control: max-age=43200, s-maxage=30
etag: "31b4db03fe21d4bbd036c22205a4fe55"
x-cache: RefreshHit from cloudfront
via: 1.1 def5acc189db6e2856a956225d5cd100.cloudfront.net (CloudFront)
x-amz-cf-pop: FRA56-P6
x-amz-cf-id: Vg7v4eVKWY9X5LEMCRJgig1PxJsUwuBFD6ri1iqF8jivO7u_3OVT0Q==
content-range: bytes 0-0/3058

However, if we issue a HEAD request (which fails in fetch, likely because the response is compressed), we get:

HTTP/2 200 
content-type: text/csv
content-length: 3058
date: Thu, 09 Feb 2023 13:32:00 GMT
last-modified: Tue, 10 May 2022 02:27:13 GMT
etag: "31b4db03fe21d4bbd036c22205a4fe55"
x-amz-server-side-encryption: AES256
cache-control: max-age=43200, s-maxage=30
content-encoding: gzip
accept-ranges: bytes
server: AmazonS3
access-control-allow-origin: *
access-control-allow-headers: range
access-control-allow-methods: GET
access-control-max-age: 86400
x-cache: Miss from cloudfront
via: 1.1 dd09b3b5f5b8dc626e1ba6804a73af40.cloudfront.net (CloudFront)
x-amz-cf-pop: FRA56-P6
x-amz-cf-id: IjtpZTK58lw69L6RZx4hfIxo0ILf5SgpEVZ6DVmHHgCQGwI3xaSPxw==

To be honest, I don’t know what the solution here would be.

Topic		Replies	Views
Populating a DuckDb database cell with remote binary file (parquet) Community Help	2	280	January 10, 2023
Can I create a table from parquet in duckdb with "DuckDBClient.of"? Community Help	8	649	November 29, 2022
DuckDB: load multiple remote parquet files dinamically Community Help	7	1234	June 5, 2024
Extensions (spatial extension) with duckdb Community Help	8	814	March 8, 2024
Duckdb spatial extension Cloud	2	83	September 6, 2024

Do Parquet files in DuckDB connections use HTTP ranged requests?

Related topics