are there non-portable npm modules? or can something be done about them?

I am trying to learn web scraping using Apify SDK. The basic example for this utility loads in two modules, then executes a short scraping script:
module 1) apify
module 2) request-promise

The module require debugger succeeds in loading these modules via the Global Leaks Pattern:

Apify = require('apify@0.17.0/build/index.js').catch(() => window["_events"])

requestPromise = require('request-promise@4.2.5/lib/rp.js').catch(() => window["Bluebird"])

However, when plugging in the basic example scraping script, it fails with an error TypeError: Cannot read property 'main' of undefined.

The console identifies that the apify module attempts to load subsequent ‘require’ operations, but that it fails—presumably b/c they are written to use node + npm and need alternative require patterns in Observable?

To be specific, one of two places the console is telling me that it is failing is when loading this index.js (apparently from within the apify module loaded via npm):

"use strict";

var _events = _interopRequireDefault(require("events"));
var _log = _interopRequireDefault(require("apify-shared/log"));
var _consts = require("apify-shared/consts");
var _actor = require("./actor");
var _autoscaled_pool = _interopRequireDefault(require("./autoscaling/autoscaled_pool"));
var _basic_crawler = _interopRequireDefault(require("./crawlers/basic_crawler"));
var _cheerio_crawler = _interopRequireDefault(require("./crawlers/cheerio_crawler"));
var _dataset = require("./dataset");
var _events2 = _interopRequireWildcard(require("./events"));
var _key_value_store = require("./key_value_store");
var _puppeteer = require("./puppeteer");
var _puppeteer_crawler = _interopRequireDefault(require("./crawlers/puppeteer_crawler"));
var _puppeteer_pool = _interopRequireDefault(require("./puppeteer_pool"));
var _request = _interopRequireDefault(require("./request"));
var _request_list = require("./request_list");
var _request_queue = require("./request_queue");
var _settings_rotator = _interopRequireDefault(require("./settings_rotator"));
var _utils = require("./utils");
var _puppeteer_utils = require("./puppeteer_utils");
var _utils_social = require("./utils_social");
var _enqueue_links = require("./enqueue_links/enqueue_links");
var _pseudo_url = _interopRequireDefault(require("./pseudo_url"));
var _live_view_server = _interopRequireDefault(require("./live_view/live_view_server"));
var _utils_request = require("./utils_request");
var _session_pool = require("./session_pool/session_pool");
var _session = require("./session_pool/session");
... [I cut the rest for space considerations]

So what to do? Is it possible to get around this by manually requiring in each of the failed modules within modules? Or is this not advisable?

I’d greatly appreciate any insights.

Here’s my attempt at reproducing the example:

Thank you!

Other References Consulted:

require.js : WHY AMD?

Not that this has anything to do with my specific question, but if ever in the future anyone is looking into web scraping on Observable, here’s an easy-to-use point of departure:

In 2018, there was a case brought against IFC in the US Supreme Court that resulted in its loss of certain (specific) immunities. ADB co-financed this project with IFC. For fun, I did a quick scrape of the associated project data on ADB’s website (roughly following Basil’s approach):

One thing I learned in this is that ADB’s project data pages aren’t all the same, and these scraping tools break pretty easily. :frowning: oh well. In case it’s helpful to anything, I’ll share it here (unlikely that I’ll ever ‘publish’ it in my feed, as it’s pretty case-specific).

I am saddened at how difficult it is to cull data from the web. For all this work, sometimes I feel I was working faster & more efficiently by just copying and pasting with my mouse. Grr.

Anyway - if anyone can help with my module woes - that’d be awesome!! :pray:

Hi Aaron, I’ve been hoping someone with more module / Apify experience would answer, but since no one has, I guess I will make an attempt:

Yes, that’s correct. Apify can only be used with node on your computer or a server; there’s no way to run it in the browser. While some node packages can be bundled to run in the browser using tools like browserify, webpack and rollup, this is not feasible for Apify, which has to be able to launch and control instances of headless Chrome.

1 Like

Thank you, Bryan. I had a feeling this was the case, but haven’t gotten as far as I would like into understanding modules just yet to fully confirm. During this experience, however, I have enjoyed reading about CommonJS and AMD and the related require documentation. I haven’t quite figured out how to tell which is which when reading module information on npm, so I was sorta of wondering if that was part of the issue. I also tried requiring in the many modules listed in that index file independently, but that turned out to be a somewhat absurd exercise…

I also owe @mhebrard a follow-up in the thread save a map (Openlayers), where he identifies the module html-to-image as not loading correctly… and I have been suspecting it’s something of the same issue (?)…

Thank you for your time and for helping me to stop messing around too much on this one! :slight_smile: I still hope to improve my scraping capacity…and I really need to figure out to DRY up this process, since it’s a total pain to re-create each of these cells, as I’ve done in the ADB project scraping notebook.

Cheers!!!