Each of the five IGERTs works on its own research topic, so to do this, we need to find a data science method to figure out what the topics of the five groups are, and assign new cases to each group with some degree of certainty. This sounds a lot like good case for logistic regression to me. To do this, I've expanded the data with the citation categories for each paper, which I used to calculate the Stirling Diversity Index. It that looks like this, extended out through the 224 Web of Science subject categories.
Our accuracy is 0.72, which is acceptably above chance, 0.2 in this case. I tried changing the hyperparameters, and using a support vector machine, and couldn't get higher accuracy.
Let's inspect the confusion matrix, to see where we're guessing wrong. It's clear that in terms of raw numbers, papers from Group C dominate, and we have trouble distinguishing papers in Group B and Group E from those in group C. This makes sense, since all I know from my prior investifation those three groups work in nanotechnology, among other things. It is plausible that there is research on something fundamental, like the properties of carbon nanotubes, which all three groups contribute to.
[ 55 7 16 8 3]
[ 7 99 58 1 12]
[ 4 16 359 4 21]
[ 6 4 20 92 6]
[ 3 8 54 0 79]
A 8
B 0
C 280
D 13
E 6
This is as good a place as any to put the end of the hard work of the data science in public series. I'll have one more post wrapping it all up.