Since I really want to avoid scraping and cleaning more data, I decided to go back to the original data and evaluate them by papers. Each paper is described by several statistics.
- first_author (bool): is the first author a member of an IGERT grant
- group (categorical): Five IGERTs, A, B, C, D, E
- group_start (int): the year the group was awarded the grant
- number_authors (int) : how many authors on the paper
- sdi (float) : the Stirling Diversity Index for the paper
- times cited (int) : how many times the article has been cited
- journal (string) : the journal that published the paper
- pub_year (int) : date of publication
- velocity (float) : number of citations / years since publication. A measure of impact
- grant_product (categorical) : if the paper was published before or after the grant was awarded
We import the data, and run our initial descriptive stats tests.
A correlation matrix is another standard technique for looking at patterns in data. We see no particularly strong correlations, with the exception of times_cited and velocity, which makes sense since velocity is a derived statistic. The number of authors ia also negatively correlated with the first_author being a member of the research group, which makes sense, since only one scientist can be first author, and that is less likely on large papers.
T-test : p-value
Group A: 0.008334
Group B: 0.000000
Group C: 0.008278
Group D: 0.002501
Group E: 0.000432
In short, the IGERT grants accomplished their state science policy goals of increasing interdisciplinarity in science, and we can prove it mathematically!
Finally, let's make a little bokeh graph to explore the relationship between SDI and velocity. One thing become immediately clear. Group C (in red) dominates, both in terms of total publications, and in terms of high velocity publications. No other group has a publication with a velocity higher than 50. (edit: can't get the interactive version to upload-static image instead)