I have a strong background from previous work on my evaluating interdisciplinary groups through scientometrics and network analytics, and it seems reasonable to extend that work. I've already rewritten the code from the previous post to make it more maintainable, but what is a reasonable next step?

I considered a few things. Redoing the visualization step to use Bokeh instead of Kumu.io would give me finer control over the visualizations, and possibly extend past some technical limits I'm encountering on Kumu. These coauthorship networks are similar to scale-free networks created using a preferential attachment model, but constructed in a different way. Specifically, each new paper in a corpus is a clique of size K, where K is drawn from a distribution that is the sum of a Poisson distribution and an exponential distribution, and each author A has some probability of being an existing member of the network P. And finally, I'd like to develop some better analytics for describing these coauthorship networks, because the basic network measures like density and clustering coefficient don't seem to match any useful quality of these groups.

These are all worthy projects, but I'm not going to do any of them, because of a mismatch between what they demand, and the skills I have as a beginner data scientist. Data science, at least the 101 level, has a specific ontology. The world is made out of

*observations*, which are described by a

*feature vector*. The feature vector can be manipulated by your choice of algorithms to produce, in supervised learning, classifications and regressions which allow new observations to be predicted, or, in unsupervised learning, to find patterns in the data as a whole. And as I look at the social network data, I can't see a good way to generate feature vectors. There are techniques to evaluate social networks in data science, but I'm moving in a week, and I don't have time to pack, unpack, and learn a bunch of new math in two weeks.

So we go with Plan B. I have complete bibliographic records for a few hundred interdisciplinary scholars, as identified by their participation in one of my interdisciplinary research groups. I'm going to see if I can distinguish interdisciplinary scholars from (assumed) non-interdisciplinary scholars. As a first step, I've selected a subsample of 20 random interdisciplinary scholars from my previous data, and paired them with scholars of the same rank (Assistant, Associate, or Full Professor) and department at the same university for a total of 40 professors. Then I scraped all their publications from Web of Science.

Next up, cleaning the data!

And before I go, this final project has several goals.

1) Use machine learning to develop an understand of scientific publication patterns between interdisciplinary and monodisciplinary scholars.

2) Document my thought process via blog post using the DSiP tag.

3) Make the code publicly available at https://github.com/mburnamfink/scientometrics_101_project as the first step in building a data science portfolio.

Comments welcome, especially on step 3, since this is a new form of publishing for me.