Reading in multi-row header CSV

Hello, I’m wondering how to read in a multi-row header with d3.csvParse. The context is I’m analyzing some survey data, and checkbox responses come back in a second row in the .csv file. The csv format (simplified) looks something like so:

race, , , age, occupation
black, white, asian, response, response
“”,“white”,"", 20, “engineer”
“black”, “white”, “”, 25, “doctor”
… more rows

It seems like this would be a relatively common use case yet I can’t find any examples that incorporate a 2 row header with nested data.

Thanks in advance!

This seems like one of an endless possible set of niche formats. Maybe you can just parse it out explicitly, to match your own specification?

data = `race,,,age,occupation
black,white,asian,response,response
"","white","",20,"engineer"
"black","white","",25,"doctor"`
parsed_data = d3.csvParseRows(data).slice(2).map((row) => {
  var race = row.splice(0,3), [age, occupation] = row;
  return ({age: +age, occupation: occupation, race: race.filter(x=>x)})
})

Results:

[
  {
    "age": 20,
    "occupation": "engineer",
    "race": [
      "white"
    ]
  },
  {
    "age": 25,
    "occupation": "doctor",
    "race": [
      "black",
      "white"
    ]
  }
]

Of course, if you have 50 similar formats, you can write some slightly more generic and automatic parser.

6 Likes

Wow, I am always amazed by what @jrus and others can do with just a few lines of code!

I am constantly working with table data that has nested headers, and it’s always a pain. Earlier I floated an idea of trying to develop some sort of tool that could be trained to interpret nested headers, and to return unique ‘nested keys’, as your observations accords with my own experience, @davidjayfrancis - there really aren’t many published ways for doing this ‘automatically’.

What I was after was something like this (using your example):

race, , , age, occupation
black, white, asian, response, response
“”,“white”,"", 20, “engineer”
“black”, “white”, “”, 25, “doctor”

becomes:

race_black, race_white, race_asian, age, occupation
"", "white", "", "20", "engineer"

After a lot of research, I’ve concluded that this is a very, very difficult problem to solve. I have therefore taken to ripping out all my data headers and manually re-writing them, as it’s the fastest way for me to solve the problem. :frowning: Sorry that this may not be helpful! I just wanted to let you know you’re not alone in this pursuit!

… and i hope this is taken as friendly, and i don’t know your data source or personal views, but as I was typing out your example, I couldn’t help reacting to the ‘race’ categorization… as ‘race’ is a very contested construct. In case you’re interested:

https://www.americananthro.org/ConnectWithAAA/Content.aspx?ItemNumber=2583

2 Likes

We can break this down:

d3.csvParseRows(data) – just split the data into lines, and split each line by commas, and interpret quotation marks, without further processing

.slice(2) – throw away the first two header rows

.map((row) => { – make a new array where we pull out each row and do something with it

var race = row.splice(0,3), [age, occupation] = row; – yank the first three entries from the row array and call their new array ‘race’. assign the variables ‘age’ and ‘occupation’ to the last two entries.

return ({ – make a new object to return for each row

age: +age, occupation: occupation, – put age (as a number) and occupation (left as a string) into the object

race: race.filter(x=>x) – put the ‘race’ array in the object, throwing away empty entries

And re-assembled:

parsed_data = d3.csvParseRows(data).slice(2).map((row) => {
  var race = row.splice(0,3), [age, occupation] = row;
  return ({age: +age, occupation: occupation, race: race.filter(x=>x)})
})
4 Likes

Wow, thank you @jrus for this helpful breakdown !

1 Like

@jrus’ explanation formatted as inline comments, to give more context:

parsed_data =
  // just split the data into lines, and split each line by commas,
  // and interpret quotation marks, without further processing
  d3.csvParseRows(data)
    // throw away the first two header rows
    .slice(2)
    // make a new array where we pull out each row and do something with it
    .map((row) => {
      var
          // yank the first three entries from the row array and call their new array ‘race’.
          race = row.splice(0,3),
          // assign the variables ‘age’ and ‘occupation’ to the last two entries.
          [age, occupation] = row;

      // make a new object to return for each row
      return ({
        // put age (as a number) and occupation (left as a string) into the object
        age: +age,
        occupation: occupation,
        // put the ‘race’ array in the object, throwing away empty entries
        race: race.filter(x=>x)
      })
    })
5 Likes

Thanks! All the feedback has been super helpful. Re: ‘race categorization’, I actually agree. Personal (and apparently scientific concensus) is that race is a made up construct for folks that want to make sense of people that look different. That said, we went with Census classification, and added a bunch of regional identifiers for “ethnicity”.

2 Likes

This is excellent @jrus! Thank you, appreciate the explanation.