19th AIAI 2023, 14 - 17 June 2023, León, Spain

Comparing vectorization techniques, supervised and unsupervised classification methods for scientific publication categorization in the UNESCO taxonomy.

Neil Villamizar, Jesús Wahrman, Minaya Villasana

Abstract:

  A comparison of classification strategies for scientific articles using the UNESCO taxonomy for categorization is presented. An annotated set of articles were vectorized using TF-IDF, Doc2Vec, BERT y SPECTER and it was established that among those options SPECTER provided the best separability properties using quantitative metrics as well as qualitative inspection of 2D projections using t-SNE. When pairing the best performing vectorization strategy with classical machine learning strategies for the classification task, such as multiple layer perceptron and support vector machines, comparable results are found, concluding that the choice of text representation strategy exerts a greater impact over the choice of classifier. The most problematic areas for classification were identified and a cascading classification strategy was implemented and evaluated. Unsupervised methods were also tested to consider the case when annotated data is not readily available and test their suitability. Two different unsupervised methods were used and it was determined that k-means yielded the best results when considering 3 times the number of categories as the optimal number of clusters.  

*** Title, author list and abstract as seen in the Camera-Ready version of the paper that was provided to Conference Committee. Small changes that may have occurred during processing by Springer may not appear in this window.