Reading in multi-row header CSV

davidjayfrancis · February 13, 2021, 8:54am

Hello, I’m wondering how to read in a multi-row header with d3.csvParse. The context is I’m analyzing some survey data, and checkbox responses come back in a second row in the .csv file. The csv format (simplified) looks something like so:

race, , , age, occupation
black, white, asian, response, response
“”,“white”,"", 20, “engineer”
“black”, “white”, “”, 25, “doctor”
… more rows

It seems like this would be a relatively common use case yet I can’t find any examples that incorporate a 2 row header with nested data.

Thanks in advance!

jrus · February 13, 2021, 11:10am

This seems like one of an endless possible set of niche formats. Maybe you can just parse it out explicitly, to match your own specification?

data = `race,,,age,occupation
black,white,asian,response,response
"","white","",20,"engineer"
"black","white","",25,"doctor"`

parsed_data = d3.csvParseRows(data).slice(2).map((row) => {
  var race = row.splice(0,3), [age, occupation] = row;
  return ({age: +age, occupation: occupation, race: race.filter(x=>x)})
})

Results:

[
  {
    "age": 20,
    "occupation": "engineer",
    "race": [
      "white"
    ]
  },
  {
    "age": 25,
    "occupation": "doctor",
    "race": [
      "black",
      "white"
    ]
  }
]

Of course, if you have 50 similar formats, you can write some slightly more generic and automatic parser.

aaronkyle · February 13, 2021, 1:43pm

Wow, I am always amazed by what @jrus and others can do with just a few lines of code!

I am constantly working with table data that has nested headers, and it’s always a pain. Earlier I floated an idea of trying to develop some sort of tool that could be trained to interpret nested headers, and to return unique ‘nested keys’, as your observations accords with my own experience, @davidjayfrancis - there really aren’t many published ways for doing this ‘automatically’.

What I was after was something like this (using your example):

race, , , age, occupation
black, white, asian, response, response
“”,“white”,"", 20, “engineer”
“black”, “white”, “”, 25, “doctor”

becomes:

race_black, race_white, race_asian, age, occupation
"", "white", "", "20", "engineer"

After a lot of research, I’ve concluded that this is a very, very difficult problem to solve. I have therefore taken to ripping out all my data headers and manually re-writing them, as it’s the fastest way for me to solve the problem. Sorry that this may not be helpful! I just wanted to let you know you’re not alone in this pursuit!

… and i hope this is taken as friendly, and i don’t know your data source or personal views, but as I was typing out your example, I couldn’t help reacting to the ‘race’ categorization… as ‘race’ is a very contested construct. In case you’re interested:

https://www.americananthro.org/ConnectWithAAA/Content.aspx?ItemNumber=2583

jrus · February 13, 2021, 3:46pm

We can break this down:

d3.csvParseRows(data) – just split the data into lines, and split each line by commas, and interpret quotation marks, without further processing

.slice(2) – throw away the first two header rows

.map((row) => { – make a new array where we pull out each row and do something with it

var race = row.splice(0,3), [age, occupation] = row; – yank the first three entries from the row array and call their new array ‘race’. assign the variables ‘age’ and ‘occupation’ to the last two entries.

return ({ – make a new object to return for each row

age: +age, occupation: occupation, – put age (as a number) and occupation (left as a string) into the object

race: race.filter(x=>x) – put the ‘race’ array in the object, throwing away empty entries

And re-assembled:

parsed_data = d3.csvParseRows(data).slice(2).map((row) => {
  var race = row.splice(0,3), [age, occupation] = row;
  return ({age: +age, occupation: occupation, race: race.filter(x=>x)})
})

aaronkyle · February 13, 2021, 6:03pm

Wow, thank you @jrus for this helpful breakdown !

mootari · February 13, 2021, 7:22pm

@jrus’ explanation formatted as inline comments, to give more context:

parsed_data =
  // just split the data into lines, and split each line by commas,
  // and interpret quotation marks, without further processing
  d3.csvParseRows(data)
    // throw away the first two header rows
    .slice(2)
    // make a new array where we pull out each row and do something with it
    .map((row) => {
      var
          // yank the first three entries from the row array and call their new array ‘race’.
          race = row.splice(0,3),
          // assign the variables ‘age’ and ‘occupation’ to the last two entries.
          [age, occupation] = row;

      // make a new object to return for each row
      return ({
        // put age (as a number) and occupation (left as a string) into the object
        age: +age,
        occupation: occupation,
        // put the ‘race’ array in the object, throwing away empty entries
        race: race.filter(x=>x)
      })
    })

davidjayfrancis · February 13, 2021, 11:56pm

Thanks! All the feedback has been super helpful. Re: ‘race categorization’, I actually agree. Personal (and apparently scientific concensus) is that race is a made up construct for folks that want to make sense of people that look different. That said, we went with Census classification, and added a bunch of regional identifiers for “ethnicity”.

davidjayfrancis · February 14, 2021, 12:00am

This is excellent @jrus! Thank you, appreciate the explanation.

Topic		Replies	Views
re-keying sliced data? Community Help	4	606	December 7, 2020
Create a new CSV file with row and column numbers from each cell Community Help	5	360	May 3, 2023
Read in csv file attachment with integers being wrapped in quotes Community Help	2	513	June 9, 2020
CSV supplying data to input \| extreme newbie: stumbling over basics Community Help	3	727	November 22, 2018
Read a CSV file composed of 2 tables / nrows / skiprows Community Help	3	167	November 15, 2023

Reading in multi-row header CSV

Related topics