what's a good approach for identifying broken links?

I was wondering if anyone had tips for an easy way to identify and flag broken HTML links?

I’ve assembled several literature lists over the years and am slowly learning how to present them using Tom’s Tables utility. Since the time of collecting these documents, however, many of their URL links aren’t working now. It’d be terrific if I could these broken links would self-identify as such—sort of like is done here with highlighting using TinyMCE (although ideas for other approaches also welcome).

I’ve googled around in the attempt to answer this question, finding this Stack Overflow thread and also the broken-link-checker node.js package on npm … but I’m not finding solutions that I can get working in Observable.

If anyone can help point me toward an easy to understand solution, I’d appreciate it!

… and you won’t, because CORS limitations will prevent you from checking these links inside a notebook iframe.
Off the top of my head I can think of these alternative approaches:

  1. Use one of the many online link checker tools that are provided by online SEO services. There’s a huge amount of these, so you may have to dig a bit until you find one that:
    • checks external links
    • handles dynamic pages (or allows you to paste text/html)
    • is free
  2. Use a browser extension that scans links for you. Prefer extensions that are hosted on Github, be wary of those that are offered by SEO companies.
  3. Use a JS bookmarklet that you can run in the current tab context. Requires programming and might not be worth the effort. No real benefit over extensions.
  4. Set up a glitch.me server that you can pass a list of links (or a blob of HTML) to check (e.g. via a POST request), and use the broken-link-checker package there. A benefit would be that you can integrate “check links” button into your notebooks (and even make it accessible only to yourself).

If you have several non-public notebooks that can contain stale links, you might also need a crawler. I’d recommend an external one that can handle dynamic pages, because scanning the actual sources would likely be a painful process (html, markdown, etc).

1 Like

Thank you, Fabian. I’ll see if I can get the glitch.me solution working, though it would be ideal if the links could be identified in the table without have to build in a ‘check me’ button.

Regarding CORS – I noticed these errors in the console log with the SO approach. I didn’t get too far in breaking apart and attempting the TinyMCE approach, but I am curious as to why the codepen notebook would highlight bad links, but potentially not in Observable? Is codepen structured differently in terms of CORS permissions?

Because the plugin in the codepen is a “premium” plugin that uses a cloud service to perform the checks, similar to point 4 in the list above.

I would strongly advise against that. Not only do you have to query your external API (which is rate limited on glitch.me), but each check would have to perform at least a HEAD request against each single resource (i.e. link).

Edit: An intermediate approach would be to create a checksum (hash) of the links that you pass to your link checker API, and then aggressively cache the result on the server side using the checksum as cache ID.

1 Like

Interesting! Thanks for the insights!