🏠 back to Observable

How to unzip and extract text form a downloaded corpus in zip


I am currently planning on using the JSZip library as seen here;

I think the JSZip has ability to unzip:

If I can read the Zip as an ArrayBuffer

And chain this with the result of a GET request;
then await the results somehow

For example to download this corpus and read the text files from inside it

Any other recommendations?


I think this approach works well so far:

The bulk of the code is:

    .buffer('https://cors-anywhere.herokuapp.com/' + 'https://github.com/nltk/nltk_data/raw/gh-pages/packages/corpora/abc.zip')
    .then(arrayBuffer=> { 
        let zip = JSZip();
        return zip.loadAsync(arrayBuffer);
      return zip.file(`abc/${abc_rural_science_choice}.txt`).async('string');


Yep! I was just working together an example and you beat me to it :slight_smile: The only little protip I should contribute beyond that is that you can skip the cors-anywhere step by using raw.githubusercontent:



Will work just as well (and more reliably) than the cors-anywhere route. It’s a little tricky to dig up this URL for binary files because GitHub doesn’t show the Raw link that it does for text files, but using direct GitHub URLs removes a little bit of complexity and means you don’t have to rely on cors-anywhere.


Another tiny thing is that you can use unpkg instead of bundle.run to load JSZip—you just need to target the UMD bundle provided in dist. Here’s an example:


This is consistent with Tom’s tutorial on require/modules , right?

Now that you’ve found the repository, look through its code: does it have a UMD or AMD build somewhere in its package that you just need to require?

. Great I incorporated this change, thanks.


Great thanks; so simple…
I thought I was using “raw”, because it is mentioned in the URL:

However, I see when I visit that site; I am 302-redirected to this request, which uses the URL as you say!

The response to this True “raw.githubusercontent” request, is a 200 with Access-Control-Allow-Origin:*, preventing the CORS security exception.

I incorporated this change, thanks!


For others who visit this question; note relevant help in the Introduction to Data notebook

Now you have a link to your file like this:


Unfortunately, this link doesn’t support CORS, so you’ll need to replace the gist.github.com domain with gist.githubusercontent.com