Benchmark-on-Clustering-Mixed-Type-Data

This repository contains the code and the results of the paper:

Jarrett Jimeno, Madhumita Roy, and Cristina Tortora (2020) Clustering Mixed-Type Data: a benchmark study on KAMILA and K-prototypes. In Studies in Classification, Data Analysis, and Knowledge Organization, accepted.

Abstract: Benchmarking in cluster analysis is the process of analyzing which clustering techniques give the best result for different types of data structuresas well as setting a standard for evaluation of newer clustering methods. There are many instances of benchmarking in cluster analysis for continuous data, but only a few for mixed-type data, i.e. data sets with nominal and continuous variables. Therefore, we explore the process for benchmarking various clustering methods on simulated mixed-type data sets with varying proportions of continuous and nominal variables. For this purpose, we test a newer clustering algorithm, KAMILA, against K-prototypes and tandem analysis where data are pre-processed using multiple correspondence analysis and then clustered using K-means, fuzzy K-means, probabilistic distance clustering (PD) and Student-t mixture models.

Benchmark-on-Clustering-Mixed-Type-Data

A benchmark study onKAMILA and K-prototypes. Jarrett Jimeno, Madhumita Roy, Cristina Tortora