🏠 back to Observable

How to unzip and extract text form a downloaded corpus in zip


#1

I am currently planning on using the JSZip library as seen here;

I think the JSZip has ability to unzip:

If I can read the Zip as an ArrayBuffer
https://stuk.github.io/jszip/documentation/examples/read-local-file-api.html

And chain this with the result of a GET request;
then await the results somehow

For example to download this corpus and read the text files from inside it
https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/abc.zip

Any other recommendations?


#2

I think this approach works well so far:

The bulk of the code is:

d3
    .buffer('https://cors-anywhere.herokuapp.com/' + 'https://github.com/nltk/nltk_data/raw/gh-pages/packages/corpora/abc.zip')
    .then(arrayBuffer=> { 
        let zip = JSZip();
        return zip.loadAsync(arrayBuffer);
    })
    .then(zip=>{
      return zip.file(`abc/${abc_rural_science_choice}.txt`).async('string');
    })

#3

Yep! I was just working together an example and you beat me to it :slight_smile: The only little protip I should contribute beyond that is that you can skip the cors-anywhere step by using raw.githubusercontent:

This:

https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/abc.zip 

Will work just as well (and more reliably) than the cors-anywhere route. It’s a little tricky to dig up this URL for binary files because GitHub doesn’t show the Raw link that it does for text files, but using direct GitHub URLs removes a little bit of complexity and means you don’t have to rely on cors-anywhere.


#4

Another tiny thing is that you can use unpkg instead of bundle.run to load JSZip—you just need to target the UMD bundle provided in dist. Here’s an example:


#5

This is consistent with Tom’s tutorial on require/modules , right?

Now that you’ve found the repository, look through its code: does it have a UMD or AMD build somewhere in its package that you just need to require?

. Great I incorporated this change, thanks.


#6

Great thanks; so simple…
I thought I was using “raw”, because it is mentioned in the URL:
https://github.com/nltk/nltk_data/raw/gh-pages/packages/corpora/abc.zip

However, I see when I visit that site; I am 302-redirected to this request, which uses the URL as you say!
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/abc.zip

The response to this True “raw.githubusercontent” request, is a 200 with Access-Control-Allow-Origin:*, preventing the CORS security exception.

I incorporated this change, thanks!


#7

For others who visit this question; note relevant help in the Introduction to Data notebook

Now you have a link to your file like this:

https://gist.github.com/mbostock/4063570/raw/11847750012dfe5351ee1eb290d2a254a67051d0/flare.csv

Unfortunately, this link doesn’t support CORS, so you’ll need to replace the gist.github.com domain with gist.githubusercontent.com