tips for distilling arrays from Markdown?

Say that I have painstakingly captured the names and URLs of web articles as markdown files, and now I wish to evaluate this work as ‘data’. How does one go about asking JavaScript to create an array of link names and target URLs? That is, if I have

[test link](https://example.com)

How do I identify the internal text test link as one object (say with the name “title”), and the URL https://example.com as another object (say with the name “url”)?

Using MarkdownIt, I’ll get an array where there’s a token of type inline containing the information I’m after, but it’s sandwiched between a lot of other information and the token type inline is the same for both parts of the URL.

Any quick and dirty ways to generate data from markdown text by searching a file and pulling out URL information?

Something using markdown-it or another parser is going to be the most robust solution, but as you noticed you’ll have to write some code to “walk” the result and pull out the parts you need.

A really quick and dirty way is to use regexes. Here’s a function which takes a markdown string and spits out an array of objects with title and url fields:

function mdLinks(str) {
  const matches = str.matchAll(/\[([^\]]*)\]\(([^)]*)\)/g);
  const result = [];
  for (const match of matches) {
    result.push({title: match[1], url: match[2]});
  }
  return result;
}

If your input isn’t too crazy, this might be good enough, though be aware that it is much more fragile than a good parser solution. For instance, this will choke if it’s given markdown where the link text contains square brackets.

1 Like

I’d probably just convert the markdown to a DOM and then use .querySelectorAll(‘a’) on it. That way you can also handle inlined HTML.

Note that you can simplify the regex by using lazy matching:

const matches = str.matchAll(/\[(.+?)\]\((.+?)\)/g);

I’d recommend to always use lazy matching as it can drastically reduce backtracking.

2 Likes

Thanks for these tips!

Unfortunately, I am still too new to all of this to get either approach working well. In both cases, my hang-up seems to have something to do with getting cells to ‘communicate’ with one another.

Bryan’s regex example uses a function of str. The output seems to be the same regardless of whether or not I create a markdown cell named str with the link values (that is, it returns a function, and I don’t know how to draw out an array from it).

Trying out Fabian’s solution, I can work through a series of steps that work for an entire notebook using document, but I can’t seem to focus this on a single input cell.

Here’s a test notebook with both approaches:

Any further guidance?

And by the way - you folks are amazing! Thanks for your kind and generous help!

1 Like

Here’s what I got: https://observablehq.com/d/b9b7c24d28c5ff9e

I made two changes:

  • I made both solutions use the Markdown-rendered HTML, since Markdown wraps inline elements in a <p>, whereas the html template tag doesn’t. This is why the second solution was failing — querySelector/querySelectorAll only look at the elements children, not the element itself.
  • I added a new cell to call the function from the first solution.

Thank you Jed!

There are a few things I’m not quite getting. On your inclusion of mdLinks(str) to call the regex function - the cell I am seeing returns ‘TypeError: str.matchAll is not a function’. I also tried Array.from(mdLinks) but this came up empty.

For the DOM conversion and parsing - I still have a lot of work ahead of me to become comfortable with the API (a tip from Bryan months back - I keep reading!). In the HTML / DOM approach how does one identify the link text? I’m hoping to come away with an array of both object values for each link. [Thanks to Bryan, for showing how to map these in Markdown with regex. It’s another tool on my list to learn].

Thank you for kindly clarified the error caused by the missing <p></p> elements. This is very helpful.

Ah, I should’ve been more clear. The function mdLinks takes as input a string str. In your notebook you passed it the value of a cell that uses md, but the output of md is an HTML element, not a string. I see how “markdown string” could have been confusing - by that I just meant a string which contains markdown code.

To fix, don’t use md in the str cell, like this:

1 Like

Thanks Bryan - works like a charm!

As for the HTML / DOM approach, I still haven’t determined if the link text has an attribute name that can be used with getAttribute (although it doesn’t seem so).

Looks like even going with the Fabian’s approach, I’d need to use something like regex to grab text from between the <a></a> for title?

You can get the link text (or rather link content, as links can contain other elements) via .textContent. I’ve set up a complete example here:

String.matchAll() is still a proposal, and support is spotty. A compatible implementation might look like this:

function getLinks(text) {
  const pattern = /\[(?<text>.+?)\]\((?<href>.+?)\)/g, matches = [];
  let m;
  while(m = pattern.exec(text)) matches.push(m.groups);
  return matches;
}

This version uses named capture groups, so that the returned array will look something like this:

[{text: "My link text", href: "http://example.com"}, ...]
2 Likes

Ok… nearly there:

I’ve managed to use regex to isolate the text between <a></a> using the HTML approach:

isolateLinkText= function(exampleLink) {
    return exampleLink.match(/<a [^>]+>([^<]+)<\/a>/)[1];
}

And this is great for a single link, but I haven’t yet managed to bring this all together so that both the title and the url are joined together as attributes of a single array object.

Ah! Looks like @mootari just posted an example! Thank you! … I’ll read through it now.

I appreciate all the help!

Don’t use regular expressions to parse HTML. Regexes don’t have the capability to properly handle nesting etc. There’s a famous stackoverflow answer that drives the point home:

If you absolutely must parse an HTML string you might want to look into DOMParser. Edit, for completeness: There are of course other methods, like the good old-fashioned .innerHTML and templates. In general you should let the browser handle the parsing and then use the standard Element APIs to query elements and fetch attributes.

2 Likes

Brilliant! Thank you for the clear example. I’ve also managed to tie this all back together to re-create the links from the generated array following both your and @j-f1 's examples for linking strings (though I need to refresh my understanding of loops so that I can do this for more than one object at a time).

This is terrific. Now I can happily and quickly transform my cumbersome notes into tables without copying and pasting into a spreadsheet! I’m so excited.

Big thanks to everyone!

3 Likes

For the sake of the conversation, here’s how I would do this with remark and a small utility project. The Remark ecosystem is awesome: it’s a Markdown parser and set of libraries that are great for transforming and querying Markdown – it can also convert to HTML like other libraries, but that’s only one of the tasks. I use it for docbox, documentation.js, and we used it a ton at Mapbox.

4 Likes