I think it’s important to know if this program is working. The 2018 NRT funding solicitation states: “The program is dedicated to effective training of STEM graduate students in high priority interdisciplinary research areas, through the use of a comprehensive traineeship model that is innovative, evidence-based, and aligned with changing workforce and research needs.” In an ideal world, we’d be able to embed trained ethnographers in every one of these grants, to track the development of students into scientists. That’s not feasible, and we can’t go back to the earliest grants without a time machine, but one of the nice things about scientists is that they publish what they discover, those publications are indexed by databases like the Thomson Reuters - ISI Web of Knowledge, and we can use the richly structured information in publications to say something about the grants that produced them.
In one sentence, can we use data science to show that the IGERT/NRT solicitation produced more interdisciplinary scholarship?
You can see the source of the data and how it was transformed at each step in the flowchart below. We go from all the IGERTs, to a sample of five, to a list of 202 scientists, a corpus of over 3400 publications, and then a dataframe of feature vectors amenable to data science techniques. An anonymized version of the final dataset (5-scientometrics101-cites.csv) along with some code is available on my GitHub.
The tedious part is going from a list of grants, to a roster of scientists, to a corpus of publications. Then comes the fun part. Using the Metaknowledge package by Reid McIlroy-Young, John McLevey, and Jillian Anderson, which is an absolutely fantastic tool for bibliographic work, a directory mapping journal abbreviations to Web of Science subject categories provided by Thompson-Reuters customer service, and a similarity matrix between all Web of Science subject categories by Ismael Rafols, Alan Porter, and Loet Leydesdorff, we’re able to process Web of Science records into a table which looks like this.
t-test : p-value : increase in st.dev
Group A: 0.00833 : +0.33257
Group B: 0.00000 : +0.72993
Group C: 0.00879 : +0.15155
Group D: 0.00250 : +0.31565
Group E: 0.00030 : +0.38804
The IGERT/NRT program has meet its primary stated policy objective. Going forward, I think it'd be interesting to look at the program as a whole, as opposed to just a sample. And we need to look in further detail at the groups that met these goals exceptionally well in order to understand what they did, and how those practices can be translated to other research groups.
Classifying Misfiled Papers by Machine Learning
I had some trouble coming up with a good machine learning task for this dataset. The major types of machine learning operations are regression, predicting a continuous value from a set of inputs; classification, which involves assigning an input to one or more categories; and clustering, finding patterns in data that does not have a prior structure. I toyed with coming up a regression model to predict which scientific papers would have the greatest impact, but discarded it. Important ideas are important because of their content, not because of their citations.
A flaw in my dataset presented the solution. For some reason, 309 of the papers in my sample aren’t associated with any research group. All five of my IGERTs work in different areas, so maybe I can use patterns of citations to match my misfiled papers with the right research group. I ran three common classification algorithms, Logistic Regression, a Support Vector Machine, and a Decision Tree, and found that I could associate 70% of papers in the test split with the correct group. The Support Vector Machine had slightly higher accuracy, but performance was similar across the board. Looking at the confusion matrices for each algorithm, we can see that most of the errors involve sorting papers in Groups B and E with Group C. This makes sense, since Group C is the most prolific group, and all three groups work on nanotechnology. There is overlap in papers about basic topics like the properties of carbon nanotubes.
Thank you to my instructor, Nathan Grossman, who asked questions that lead a significantly more interesting project, my cohort in the Metis Live Online class, for listening to me explain versions of this work four times, and Sebastian Raschka, for his great book on Python Machine Learning.