Skip to content

Latest commit

 

History

History
125 lines (89 loc) · 13.4 KB

File metadata and controls

125 lines (89 loc) · 13.4 KB

Supplementary Materials

Context

word2vec is a fairly ‘simple’ embedding approach and has now been largely supplanted at large commercial firms (such as Google and Facebook) by algorithms with names like BERT, SBERT and RoBERTa. These further enhance our ability to work with corpora by disentangling—that is, providing distinct embeddings for—‘bank’ (a place that provides financial services) and ‘bank’ (the land at the side of a river). The approaches employed in this tutorial do not provide that level of disambiguation.

If you are interested primarily in document similarity then it is possible to improve on this approach using a dedicated Document Embedding algorithm instead. That said, we’ve often found these results to be less intuitive that the ones derived from word embeddings.

A Note on Replicability

If you are concerned with full replicability of your results, then please also note that you need to change the number of workers from 4 to 1 when running the word2vec algorithm. More than one worker means the process is running in parallel and you cannot guarantee that documents/words will be passed to the modelling process in the same order every time. When that happens the derived embeddings may differ. So if you use multiple workers then the results should be consistent, but not exactly the same, from run to run. Of course, running with only one worker will also increase the model run time substantially!

Completeness

Table 1 shows the percentage of records for each field in the sample that contain data. While we have not verified that these all contain useable data it is nonetheless obvious that some, like Abstract and Institution (and, of course, Author and Title) are effectively complete, while others, such as Supervisor, Funder or DOI, are poorly populated at best. Across the full EThOS data set the DDC is nowhere near as well-populated as in our selected sample (the same holds for some of the other fields), but that’s because we purposively chose records that would enable us to validate our approach against the expert-assigned label.

Table 1. Completeness of Selected EThOS Metadata Attributes by Decade

Attribute 1980s 1990s 2000s 2010s Overall
Author 100 100 100 100 100
Title 100 100 100 100 100
Abstract 100 100 100 100 100
Keywords 87 69 37 49 52
DDC 100 100 100 100 100
Institution 100 100 100 100 100
Department 46 33 31 53 45
Supervisor 12 9 16 49 33
Subject Discipline 100 100 100 100 100
Language 100 100 100 100 100
Funder 9 7 7 23 16
DOI 4 4 4 11 8
Count by Decade 3,583 6,931 11,249 26,980 48,743

Additional Examples of Embeddings

In Table 5 are the top-10 most similar words for an array of other terms, demonstrating the extent to which they allow us to identify ‘relatedness’ across a range of disciplines based on the context in which terms are used. The first three terms in Table 5 stress that this is not about the computer developing some underlying understanding of ‘salmon are like rainbow trout’ and ‘Einstein developed the theory of relativity’ but a context-based substitutability based on the window size and weighting that we specified when developing the word embeddings.

Table 5. Selected Terms and their Top-10 Most Similar

Term Top 10 Similar
einstein field_equation, gravity, scalar_field, equation, relativity, gauge_theory, string_theory, quantum_field_theory, non_abelian, minkowski
colorectal_cancer cancer, breast_cancer, prostate_cancer, ovarian_cancer, type_diabetes, leukaemia, leukaemic, human_cancer, malignant, brca1
atlantic_salmon salmo, fish, rainbow_trout, salar, salmonid, salmon, oncorhynchus, brown_trout, mykiss, freshwater
new_keynesian open_economy, dsge_model, optimal_monetary, dsge, indirect_inference, partial_equilibrium, return_scale, small_open_economy, financial_friction, cge
land_use_change land_use, change_climate, environmental_change, vegetation_change, biodiversity, habitat_fragmentation, rainfall, agricultural_intensification, cl...
semi_structured semi_structured_interview, interview, participant_observation, in_depth_interview, interview_conduct, interview_focus, focus_group, focus_group_di...
influenza_virus virus, viruses, influenza, viral, viral_rna, adenovirus, rna, herpesvirus, norovirus, rna_virus
north_east_england group_young, mixed_race, east_midlands, town, experience, britain, old, england, north_of_england, birmingham
built_environment build_environment, quality_life, city, urban_form, informal_settlement, urban, social_sustainability, physical_activity, energy_supply, sustainabl...
information_communication_technology ict, icts, communication_technology, information_technology, new_medium, internet, telecommunication, digital_technology, knowledge_economy, techn...
urban_regeneration regeneration, urban_development, city, initiative, planning_policy, urban, cultural_policy, urban_design, urban_policy, public_policy
gravitational_wave_detector interferometer, gravitational_wave, astronomical, detector, device, laser, semiconductor_laser, collimation, lasers, ultra_low
cultural_heritage heritage, cultural, landscape, contemporary, national_identity, buddhist, tourism, community, modernisation, intangible
cultural_capital bourdieuas, bourdieu, cultural, literary, symbolic, social_capital, solidarity, elite, assert, habitus

Manifold Learning

t-SNE (t-distributed Stochastic Neighbour Embedding) is another commonly-used manifold learning technique; however, it is designed to emphasise visibility (local structure) and its parameters are less conducive to preserving global structure in an intuitive way. We’d suggest using UMAP in preference to t-SNE for most applications where both levels of structure are needed to support clustering.

Measuring Success

One of the standard approaches in Machine Learning to quantifying the performance of a classifier model is the Confusion Matrix in which:

Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class

So if the embdding+UMAP+cluster approach works well, the the predicted class should be largely the same as the actual class. If we were to lay this out in a table with actual DDC in the row labels and predicted cluster in the column lables then we should have entries mainly on the diagonal where the row and column labels are the same.

Confusion matrix (2 Clusters)

But we can also investigate this result in a more nuanced way using something called the Confusion Matrix and Classifiction Report. Recall that the DDC plot in Figure 3 shows some Social Science theses clearly mapped on to the Science-like space. Here we make use of the derived cluster label to compare the DDC label to the cluster-derived one!

# Classification report gives a (statistical) sense of power (TP/TN/FP/FN)
print(classification_report(clustered_df[f'ddc{ddc_level}'], clustered_df[f'Cluster_Name_{num_clusters}']))

# A confusion matrix is basically a cross-tab without totals,
# which I think are nice to add
pd.crosstab(columns=clustered_df[f'Cluster_Name_{num_clusters}'],
            index=clustered_df[f'ddc{ddc_level}'],
            margins=True, margins_name='Total')

At the top level, the expert-assigned DDC and automated cluster values line up extraordinarily well:

Table 7. Confusion matrix for top-level DDC classes and clusters

Science Cluster Social Sciences Cluster Total
Science DDC 26,591 479 27,070
Social Sciences DDC 676 20,948 21,624
Total 27,267 21,427 48,694

In other words, just 1.8% of the records classified as ‘Science’ were misclassified as Social Sciences in our automated analysis (479/27,070), and 3.1% of theses classified by librarians as being from the Social Sciences were assigned to the Science cluster (676/21,624).

Classification report (2 Clusters)

The confusion matrix can then be used as the basis for calculating precision and recall values. Precision is T_{P} / (T_{P}+F_{P}), where T_{P} is the number of correctly-predicted observations (true positives), and F_{P} is the number of incorrectly-predicted observations (false positives) in that class. Recall measures something slightly different: T_{P}/(T_{P}+F_{N}) where f_{N} is the number of observations falsely assigned to other classs (false negatives). For the 2-cluster formulation above this yields a precision and recall (averaged over the two classes) of 0.98. Accuracy is calculated as (T_{P} + T_{N})/(T_{P} + T_{N} + F_{P} + F_{N}) and is also 0.98.

precision recall f1-score support
Science 0.98 0.98 0.98 27070
Social sciences 0.98 0.97 0.97 21624
accuracy 0.98 48694
macro avg 0.98 0.98 0.98 48694
weighted avg 0.98 0.98 0.98 48694

In short: using nothing more than a short abstract and title for a PhD thesis we’ve been able to correctly classify them into Social and Physical sciences with 98% accuracy!

Confusion Matrix (4 Clusters)

In the confusion matrix we are again looking for values that are ‘off’ the diagonal as an indicator of poor or declining clustering performance:

Table 8. Confusion matrix for 2nd level DDC classes and clusters

Expert Class Biology Cluster Economics Cluster Physics Cluster Social sciences Cluster Total
Biology DDC 17,498 214 514 178 18,404
Economics DDC 417 11,063 79 1,050 12,609
Social sciences DDC 230 45 8,349 42 8,666
Physics DDC 165 1,880 15 6,955 9,015
Total 18,310 13,202 8,957 8,225 48,694

Clearly, clustering is still performing well: although accuracy has fallen to 0.90 with average precision and recall of 0.89, Biology and Phyics are more readily distinguished with precision of 0.96 and 0.95, and recall of 0.93 and 0.96, respectively. This squares nicely with the intuition developed from looking at the UMAP embedding in Figure 3 above where we saw much greater overlap between the selected social science DDCs than the selected science DDCs. This effect neatly encapsulates one of the advantages to this approach: the visualisation, clustering, and validation results all reinforce one another, giving us confidence that what we’re seeing isn’t simply an artefact of the data or sheer good luck.