Comparing automated document classification using word embeddings to expert-assigned Dewey Decimal classifications

About EThOS

Content loaded from EThOS URL: https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.594308

Completeness

Percentage Complete by Decade
Field Overall 1970s 1980s 1990s 2000s 2010s
Abs. 50 19 21 29 42 88
Key. 50 12 89 84 32 47
DDCs 91 93 98 98 97 83
Dept. 29 17 16 13 21 53
Count 526,276 40,318 59,825 88,598 136,172 176,171

DDC as ‘Expert Label’

DDC1 Group (‘Class’) DDC2 Group (‘Division’) Count
Science (500–599)
Biology (570–579)
Physics (530–539)
27,095 18,418 8,677
Social Sciences (300–399)
Economics (330–339)
Social Sciences (300–309)
21,648 12,625 9,023
Total 48,743

From Text to Data

Cleaning

  • Removal of Punctuation / Symbols / HTML
  • Named Entity Recognition (including acronyms)
  • Lemmatisation
  • Removal of Stop Words
  • Phrase Detection
  • Reduction of Vocabulary

Learning

Analysis

Cleaning

Learning

  • Learning of Context
  • Weighting of Relationships
  • ‘Training’ of Word Embeddings

Analysis

Cleaning

Learning

Analysis

  • Dimensionality Reduction
  • Hierarchical Clustering
  • Validation of Results

Cleaning

Source Text Cleaned Text
The economic effects of resource extraction in developing countries. This thesis presents three core chapters examining different aspects of the relationship between natural resources and economic development… economic effect resource extraction develop_country present_three core examine different_aspect relationship natural_resource economic_developmen…
Making sense of environmental governance : a study of e-waste in Malaysia. The nature of e-waste, which is environmentally disastrous but economically precious, calls for close policy attention at all levels of society, and between state and non-state actors… making_sense environmental_governance study waste malaysia nature waste environmentally disastrous economically precious call close policy attention level society state_non_state actor…
An exploratory study of the constructions of ‘mental health’ in the Afro Caribbean community in the United Kingdom. Afro Caribbean people living in the United Kingdom have historically been overrepresented in the ‘mental health’ system… exploratory_study construction mental_health caribbean community united_kingdom caribbean people_live united_kingdom historically mental_health system…

Learning

Term Dim 1 Dim 2 Dim 3 Top 7 Most Similar
accelerator -2.597380 0.562458 3.047121 beam, cern, facility, spectrometer, beam_energy
london_stock_exchange 0.516811 -0.935569 1.090004 ipo, ftse, stock_market, announcement, lse
national_health_service 1.782367 -2.309419 -2.430357 nhs, public_sector, emergency, public_health, developed_world

How Did We Do? Visualising the Corpus

Visualisation of corpus after UMAP dimensionality reduction of Word Embeddings

How Did We Do? Clustering Accuracy

Table 1: Clustering Results

(a) DDC Class (DDC1) 2 Cluster Results
Cluster Science Social
sciences
Science 26,591 479
Social sciences 676 20,948
(b) DDC Division (DDC2) 4 Cluster Results
Cluster Biology Economics Physics Social sciences
Biology 17,498 214 514 178
Economics 417 11,063 79 1,050
Physics 230 45 8,349 42
Social sciences 165 1,880 15 6,955

‘Mis-clustered’ Word Clouds

Word clouds for dissertations assigned to Physics DDC but clustered with another topic.

Word clouds for dissertations assigned to Economics DDC but clustered with another topic.

Applications

Dealing with ‘born digital’ archives:

  • Growing problems of scale and context.
  • Understanding overall structure of corpus.
  • Adding metadata / filling in missing fields.

Improving document retrieval:

  • Reduced reliance on ‘direct hits’.
  • Searching the ‘semantic space’ means fewer ‘near misses’.

Identifying trends:

  • Embeddings may anticipate discoveries (e.g. Tshitoyan et al. (2019))

Logos of CASA and British Library

Jennie / @JenniexWilliams / jenniewilliams

Jon / @jreades / jreades

Acknowledgements

We’d like to thank the British Library (particularly the EThOS team) and Programming Historian/Jisc for their support, guidance, and encouragement. We gratefully acknowledge the contribution of the ESRC via the LISS and UBEL DTPs to making Jennie’s research possible.

References

Andrews, D., L. Broad, P. Edwards, D. Fox, T. Gallagher, S. Garland, R. Kidd, and J. Sweeney. 2016. “The Creation and Characterisation of a National Compound Collection: The Royal Society of Chemistry Pilot.” Chem. Sci. 7. The Royal Society of Chemistry:3869–78. https://doi.org/10.1039/C6SC00264A.
Firth, J. R. 1957. “A Synopsis of Linguistic Theory, 1930-1955.” Studies in Linguistic Analysis. Basil Blackwell.
Howe, K. 2015. “A Novel Use of PhD Data: Investigating the State of the Dementia Workforce.” 2015. https://blogs.bl.uk/science/2015/09/a-novel-use-of-phd-data.html.
Montgomery, C. 2019. “Surfacing ‘Southern’ Perspectives on Student Engagement with Internationalization: Doctoral Theses as Alternative Forms of Knowledge.” Journal of Studies in International Education 23 (1):123–38. https://doi.org/10.1177/1028315318803743.
Tshitoyan, V., J. Dagdelen, L. Weston, A. Dunn, Z. Rong, O. Kononova, K. A. Persson, C. Gerbrand, and J. Anubhav. 2019. “Unsupervised Word Embeddings Capture Latent Knowledge from Materials Science Literature.” Nature 571 (July):95–98. https://doi.org/10.1038/s41586-019-1335-8.