Fixate values at time of publishing.

Hey all,
Is there a way to somehow fix values when the notebook is published?

I have a use-case where I generate UUIDs inside the notebook, but they should stay constant within a publication, but change whenever a new publish occurs.

The cleanest solution would probably if cells could be marked as “static” in that they don’t get recomputed on reload in a published notebook, and just inherit the value they last had when the publishing occurred.

Does anybody have any other ideas on how to do this? Ideas and feedback most welcome :smiley:

Best wishes,
Jan-Paul

You could comment out the code in that cell and paste the static value you want instead:

Yeah, that’s how I currently do it, but that manual process seems error prone :confused:
One could do something similar for multiple uuid where one would generate a huge blob of uuids at once, so at least there is only one manual interaction required, but still.

I’ve also considered feeding the v4 random source based on some property of the published notebook, but afaik there is no way to get the publications ID from within a cell, and this would also heavily degrade the quality of the randomness.

1 Like

May I ask what the purpose of these UUIDs is? How/where/for what are they processed?

Sure :smiley:
They are used as entity identifiers in an ontology.

We’re doing NLP and Robotics research and I’m writing a middleware to communicate between the different components. The project this is happening in is heavily focused around symbolic and explainable AI, so there is a lot of knowledge graph and ontology engineering involved. Basically all the components in the system communicate via the extension of a globally shared, append only, triple space, on which they can subscribe with queries.

We’ve decided to take a different path than the classical RDF and SemWeb route though, because frankly they are unsexy, over-engineered, and you encounter huge social barriers when trying to get people to use that stuff (understandably so).
We’ve therefore build our own knowledge graph system, tailored towards simplicity and minimalism, in the hope that this will make it easier to spread adoption within the project.

One of the fundamental components of our new system is using Observable for basically everything visual, because it allows researchers to not only quickly explore our reference implementation in a literate programming way, but it also makes it easy for them to contribute their own visualisations and enables pretty fast paced and reproducible science (at least that’s the long term goal :wink: ).

One of these uses for Observable is ontology engineering. Instead of having to boot up a tool like Protege (which ironically is also the only tool), researchers should be able to simply clone a Base Ontology Notebook, and continue experimenting from there. Using a literate programming style for their ontology, which not only gives good documentation but also makes it easy to scientifically publish the resulting work.

One problem we’ve encountered with RDF and it’s URI based naming scheme is that people interpret ontological concepts differently, and that sometimes the semantics of the concepts change a lot over time.
Both result in two parts of the system having a different understanding of the data which is not ideal. It also results in incompatibilities once you have systems of different generations collaborating.

So what we’ve decided to do (because hey UUID’s are cheap), is that whenever you publish a new version of your ontology, you don’t change the existing one, you simply append a new one with entirely new UUIDs.
The system will happily run with both ontologies at the same time, with the old components consuming and producing the old style data and the new stuff using the new concepts.

And this is why it would be great to be able to fixate values on publish. It would allow everyone visiting the notebook to get the up to date ontology, with compatible UUIDs, while enforcing a change of UUIDs whenever a change is made.

I believe that manually updating your UUIDs is the safest choice. However, you could also use seeded UUIDs that you derive from the notebook’s pinned slug.

To obtain a pinned slug for your notebook programmatically, you can use:

import {PINNED} from '691ae3b95f02db79@1157'

Check out the source notebook for some examples:

1 Like

Ah thanks!

Yeah I considered deriving the IDs from the slug, but I didn’t see a way of obtaining it.
So thanks for doing all the hard work :smiley:

It’ll probably not provide enough entropy, but for a proof of concept it should be ok.

Maybe we’ll get “cached”/“locked”-cells at some point in the future.

It should, if you use the full notebook URI? (I mean, ultimately it probably depends on the library/algorithm.)

Edit: Apparently they’re called Namespace UUIDs, aka version 3 or version 5 UUIDs:

Well, that stays mostly the same between invocations right?
Also I’m mostly worrying about the pseudo-random source,
those are generally not as random as one would like, especially with “not super high entropy” seeds.

The PRNG from crypto.getRandomValues gets seeded from dev/random for that reason, so it’s potentially backed by a hardware random source.

Edit:
Yeah, I’d probably generate a v4 with the seed for it’s random source. v3 is MD5 ^^’’’, v5 is doable, but since we don’t care about recreating the same UUID from the same source (we use content addressed stuff elsewhere but with a 256bit key not 128 :D) we might as well squeeze all the randomness we can get out of the few bits we have ^^’

Not if you include the version in the URL, which changes for every published revision.

There is no randomness in v3/v5 UUIDs, as they are md5/sha-1 based.

The point is that you can create UUIDs that are forever linked to a specific revision of your notebook and will update automatically for every new version.

I’ve created an example here:

1 Like

Yeah yeah I know what you mean but v3 and v5 are essentially the same as v4 except that they use a different random source, namely the entropy in the provided key that is then hashed.

So essentially you can use the same theoretical framework for both of them, which is count the number of bits of input entropy.

I don’t know how observable computes the slug, but by the looks of it it’s a 64 bit integer, so I’d guess it’s a primary key of a table somewhere.

When deciding to use that as a large part of your random source you’re essentially relying on trust towards a central authority as your “guarantee” that the provided keys won’t collide.

This is essentially the same scheme as a UUIDv1, which used timestamps and the MAC address as the “random source”. V1 is not commonly used anymore, one of the reasons being that it leaks the MAC address, but more importantly because MAC addresses, even though they’re meant to be globally unique by the authority of the IEEE, are commonly spoofed or implemented badly by cheap hardware.
Using a purely random ID therefore has less of a change of colliding, because you use vastly more bits.

So while I completely agree that this is probably a viable scheme, with more than enough entropy in the ids, it’s not as reliable as a v4, with more places where things can go wrong :smiley:

Edit:
This article has an interesting take on it.

Essentially the probability of a RAM error occurring which would turn the slugs @3 into a @1 during computation is much higher compared to the probability of two 122 bit numbers colliding.
Which doesn’t mean that the hashing approach isn’t viable of course because that probability is still astronomically small, especially considering that one rarely adds new ontological data, compared to the instance data itself :smiley:

So thanks again for providing a way to get at the slug! :smiley: