18th AIAI 2022, 17 - 20 June 2022, Greece

Exploring the pertinence of distance functions for nominal multi-label data

Payel Sadhukhan


  Data with nominal features constitute a good fraction of multi-label datasets. Dealing with high-dimensional, nominal data is different from the handling of data with numeric features. The key reason being -- the distance functions which work good on numeric datasets may not function optimally (without returning the true separations of the points) in a nominal feature space. We have further observed that, in a multi-label dataset, an imbalance exists in the distribution nominal features which further aggravates the learning. In this work, we focus to find the suitability of four different distance functions euclidean, hamming, jaccard and kulsinski in a binary-nominal context. Additionally, we also propose and explore an ensemble of two classifiers where one classifier is modelled using jaccard distance and the other is modelled on kulsinski distance. An empirical study involving five binary-nominal datasets, four evaluation metrics and three multi-label classifiers is used to evaluate the pertinence of each distance function and the ensemble. We find that the proposed ensemble gives the best outcome across all but one case.  

