DEVELOPMENT OF OBJECT CLUSTERING QUALITY METRIC FOR MULTIDIMENSIONAL SPACES

Main Article Content

D.O. STUPAK

Abstract

In the modern world, new data is generated in geometric progression. Clustering is one of the machine learning methods that does not require labeled data, and therefore makes it possible to quickly determine the structure of the data and draw certain conclusions. The article considers the problem of clustering objects in a multidimensional space. This problem is not new. The concept of the “curse of dimensionality” is critical in clustering, since the algorithm must first divide objects in a multidimensional space into clusters, and then apply clustering quality metrics to find the optimal structure. Existing metrics for assessing the quality of clustering often depend on the dimensionality of the space, and therefore their use under such conditions can be difficult or lead to incorrect results. The aim of the article is to develop a clustering quality metric, the value of which would not depend on the dimensionality of the space in which the objects are described. Two sets of datasets were generated to study clustering. In the first set, the objects are grouped into 5 well-separated clusters, in the second, the clusters almost “touch” each other. Each set contains 6 datasets with a space dimension of 10, 100, 300, 1024, 2048 and 4096. The developed clustering quality metric is based on a comparison of the intercluster characteristic of dividing objects into clusters and the intracluster characteristic. The metric takes into account the dimension of space and the condition of priority of dividing objects into a smaller number of clusters. Numerical experiment methods were used to prove the effectiveness of the application of the developed clustering quality metric. The test was performed on synthetic datasets that are close in terms of the distribution of objects in existing datasets in practical problems. It is significant that the developed metric of the quality of clustering of objects allows us to determine the correct optimal number of clusters using the “elbow method”, does not depend on the dimensionality of the space and can be applied even in complex cases when the clusters are located close to each other.

Article Details

Section
Computer Modelling in Physics
Author Biography

D.O. STUPAK, P.h.d, Academic Degree, senior data analyst, EPAM Digital, Cherkasy, Ukraine

P.h.d, Academic Degree,

senior data analyst, EPAM Digital, Cherkasy, Ukraine

References

LLM Basics: Embedding Spaces - Transformer Token Vectors Are Not Points in Space [Електронний ресурс] — Режим доступу: https://www.lesswrong.com/posts/pHPmMGEMYefk9jLeh/llm-basics-embedding-spaces-transformer-token-vectors-are.

Madhulatha TS. An overview on clustering methods. IOSR J Eng. 2012; 2(4): pp. 719–725.

Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J. and Wu, S. Understanding and enhancement of internal clustering validation measures. IEEE Transactions on Cybernetics, 2013; 43(3), pp. 982–994.

Elbow method (clustering) [Електронний ресурс] — Режим доступу: https://en.wikipedia.org/wiki/Elbow_method_(clustering).

Silhouette (clustering) [Електронний ресурс] — Режим доступу до ресурсу: https://en.wikipedia.org/wiki/Silhouette_(clustering).

Davies–Bouldin index [Електронний ресурс] — Режим доступу до ресурсу: https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index.

davies_bouldin_score [Електронний ресурс] — Режим доступу до ресурсу: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html.