This is one of the major problems with two digit years. If it’s possible to get the data with a better date format, I’d recommend it. But given this data, there is something we can do.
To make this problem reasonable, we can assume that all dates are in the past 100 years. So if we find a date in the future, we simply subtract 100 years from it, and it should get back into range.
The heart of that idea is this modification to your data loading cell:
top5000 = {
let data = await FileAttachment("Top5000.csv").csv();
let now = new Date();
return data.map((d) => {
let rel_date = parser(d.rel_date);
if (rel_date > now) rel_date.setFullYear(rel_date.getFullYear() - 100);
return { ...d, rel_date };
});
}
You can see it integrated in with a full notebook here:
Hi Again I have another problem which i thought i sorted but I haven’t, I would love to get your help again
In the data, each object has multiple genres (gen). I want to limite the “gen” to one for each artist in order to make the data easier to work with.
My solution: Turn the “gen” property string into array using the foreach & split() method. I have been successful for a small section of the data ( gens: “Conscious Hip Hop”)
But it hasn’t gone to plan when using this method for the whole data set
I’d recommend editing the top5000 cell to do that work. You can modify the body of the map function there like this to take the first genre:
let rel_date = parser(d.rel_date);
if (rel_date > now) rel_date.setFullYear(rel_date.getFullYear() - 100);
let genre = d.gens.split(', ')[0];
return { ...d, rel_date, genre };
In general, I find that forEach is almost never what I want. It’s usually more convenient to use map if you want to produce a modified array, or for-of loops if you just want to iterate.
It’s the same dataset. My issue is there are too many values under the “gens” column. For example, the genre “Rock” has soo many sub-genres.
How do I consolidate the data so there are clear genres without the subs. I have already started do it manually, but is there a way to fast track this process?
If not, I will continue or find an easier data set as I will have to go through 2744 columns