The impact of neglecting feature scaling in k-means clustering

Chantha Wongoutong

doi:10.1371/journal.pone.0310839

Abstract

Despite the popularity of k-means clustering, feature scaling before applying it can be an essential yet often neglected step. In this study, feature scaling via five methods: Z-score, Min-Max normalization, Percentile transformation, Maximum absolute scaling, or RobustScaler beforehand was compared with using the raw (i.e., non-scaled) data to analyze datasets having features with different or the same units via k-means clustering. The results of an experimental study show that, for features with different units, scaling them before k-means clustering provided better accuracy, precision, recall, and F-score values than when using the raw data. Meanwhile, when features in the dataset had the same unit, scaling them beforehand provided similar results to using the raw data. Thus, scaling the features beforehand is a very important step for datasets with different units, which improves the clustering results and accuracy. Of the five feature-scaling methods used in the dataset with different units, Z-score standardization and Percentile transformation provided similar performances that were superior to the other or using the raw data. While Maximum absolute scaling, slightly more performances than the other scaling methods and raw data when the dataset contains features with the same unit, the improvement was not significant.

Citation: Wongoutong C (2024) The impact of neglecting feature scaling in k-means clustering. PLoS ONE 19(12): e0310839. https://doi.org/10.1371/journal.pone.0310839

Editor: Nattapol Aunsri, Mae Fah Luang University, THAILAND

Received: April 9, 2024; Accepted: September 6, 2024; Published: December 6, 2024

Copyright: © 2024 Chantha Wongoutong. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data underlying the results presented in the study are available from https://doi.org/10.6084/m9.figshare.26333128.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

K-means clustering is a distance-based algorithm used to group or cluster items according to measured or perceived intrinsic characteristics or similarities [1]. This technique used to discover natural grouping(s) of patterns, points, or objects is invaluable for analyzing large databases of multivariate data involving many features used in the fields of data mining [2], statistical data analysis [3], pattern recognition [4], and image processing [5].

Initialization typically influences the k-means clustering algorithm and must be provided with the number of clusters beforehand [6]. In general, initial cluster centers are chosen randomly, which affects the final cluster formation; this implies that the clustering outcome can differ each time the algorithm is executed, even on the same dataset. Many researchers have proposed initial cluster centers for the k-means algorithm [7–10]. Although it is essential to determine the number of clusters required for k-means clustering, this can be challenging for end-users who do not have an in-depth understanding of the dataset. Several methods have been proposed in the literature to address this issue, including the rule of thumb [11], the elbow method [12], the information criterion approach [13], the information theoretical approach [14], choosing the number of clusters (k) using the silhouette method [15], and cross-validation [16].

However, the k-means technique requires feature scaling in the pre-processing stage, a process that is often overlooked. The k-means algorithm relies on distance-based metrics (e.g., the Euclidean distance) to mitigate scale variation. This means the results can vary depending on the range of values used for the features, and the smaller the distance, the closer the points will be to each other, thereby indicating their similarity [17]. Especially when analyzing real-world datasets, the features can have different scales because they are measured using different units. Scaling the features into a uniform range to avoid any feature becoming predominant in the distance calculation and to help improve the clustering results is crucial to enhance the k-means algorithm’s performance [18, 19]. Patel and Mehta [20] claimed that a normalized dataset produces better outcomes during the raw data clustering process. According to recent research [21] a high magnitude affects the distance between two given points, which impacts the performance of k-means clustering since variables with higher magnitudes are given more weight. Hence, it is always advisable to bring all the features to the same scale for applying distance-based algorithms such as k-means clustering [22].

Although feature scaling is commonly recommended as a pre-processing step before clustering, there is limited research on how the scaling of features affects the clustering performance, especially when all of the features have the same or different units. Therefore, the aim of the present study is to investigate and clarify whether feature scaling is essential as a pre-processing step by examining two scenarios: datasets in which the units of the features are the same or different.

The remainder of this paper is organized as follows. In section 2, datasets used in the study are presented, while Section 3 provides the methodology for k-means clustering and feature scaling. The performance metrics and experimental study are covered in Section 4, the results of which and a discussion thereon are provided in Section 5. Finally, conclusions on the study are imparted in Section 6.

2. The datasets used in the study

To examine the impact of the prior feature scaling for k-means clustering, 10 real-world datasets were obtained from [23] https://www.kaggle.com/, the details of which are provided in Table 1.

Download:

Table 1. The datasets used in this research.

https://doi.org/10.1371/journal.pone.0310839.t001

3. Methodology for k-means clustering and feature scaling

3.1 K-means clustering

Data clustering is a crucial technique for many applications such as data mining and is a valuable tool for researchers working with large datasets of multivariate data [24]. Several clustering methods are available, each with its inherent advantages and disadvantages. K-means, a centroid-based clustering algorithm that is the most well-known data mining method, is a simple yet powerful clustering technique [25, 26]. In this method, an iterative algorithm is used to partition the dataset into k distinct non-overlapping subgroups (clusters), where each data point belongs to only one group [27]. Besides, the inter-cluster data points are simultaneously made as similar as possible while keeping the different clusters as far apart as possible. The working of k-means clustering is explained in the following steps:

Select a suitable value for k.
Determine the initial centroids by randomly assigning them from the existing data until k equals the number of initial centroids.
Assign each data point to its closest centroid, forming the predefined value of k.
Calculate the distance and place a new centroid in each cluster.
Repeat the third step, which means reassigning each data point to the new closest centroid in each cluster.
If any reassignment occurs, go to step 4; otherwise, finish.

A flowchart for the k-means clustering algorithm is illustrated in Fig 1.

Download:

Fig 1. A flowchart for k-means clustering.

https://doi.org/10.1371/journal.pone.0310839.g001

3.2 The feature scaling methods

Feature scaling is a crucial step in the pre-processing stage of machine-learning when real-world datasets have many features with a wide range of values. Hence, transforming a dataset’s variables (features) from different dynamic ranges to a specific range and ensuring that no single feature predominates in the distance calculations for the algorithm can help improve its performance [28]. The most common techniques of feature scaling consist of normalization and standardization [29]. Normalization is used when there is a need to bind the values between two numbers, typically between [0,1], while standardization transforms the data to zero mean and unit variance, thereby making it unitless. Some machine-learning algorithms, such as k-means clustering, are most affected by the range of features. Behind the scenes, k-means clustering uses distances (usually the Euclidean distance) between data points to determine their similarity. Hence, before using k-means clustering, feature scaling is crucial [30, 31]. In the present study, five scaling methods were used: Min-Max normalization, Z-score, Percentile transformation, and Maximum absolute scaling and RobustScaler.

3.2.1. Z-score.

This is a standard scoring method since the distribution of the scores on the various variables is standardized. This technique scales the values of a feature with zero mean and unit variance [32]. It is calculated by subtracting the mean of the feature from each value and then dividing it by the standard deviation. The Z-score values can be positive or negative: negative and positive signs represent an observation below or above the mean, respectively. The equation for calculating the Z-score is (1)

3.2.2. Min-max normalization.

In this method, the raw data are scaled within a specific range [0,1] while preserving the relationships among them [33]. It is calculated by subtracting the minimum value from each feature value and then dividing it by the range of that feature. Thereby, the standard deviation of the data scale has a smaller value. Moreover, the effect of outliers is suppressed to a certain extent. The equation to achieve this is (2)

3.2.3. Percentile transformation.

The percentile is used to compare the values in the given data or to find where the dataset’s value stands when compared to other candidates [34]. The benefit of the percentile is that each value has a relatively straightforward interpretation; it is the percentage of observations with that value or below it. It is simple to calculate by considering the percentile of x, which is the ratio of the number of values below (n) x to the total number of values (N) multiplied by 100; i.e., (3)

3.2.4 Maximum absolute scaling.

aims to bring different features or variables onto a common scale, thereby enabling a fair comparison and improving the performance of machine-learning algorithms. Maximum absolute scaling is one of the most commonly used methods for normalization. It is computed by dividing each observation by the maximum absolute value of the variable. Hence, the preceding transformation results in a distribution in which the values vary approximately within the range of -1 to 1. The equation for Maximum absolute scaling is (4)

3.2.5. RobustScaler.

RobustScaler is a scaling method that uses the median. It removes the median and scales the data based on the interquartile range (IQR), which ranges between the 25th and 75th percentile. Because it uses the interquartile range, it can handle outliers well while scaling the data. The equation to achieve this is (5)

4. The performance metrics and experimental study

The different feature scaling methods and using the raw data on the performance of k-means clustering. Two scenarios were considered: datasets with the same or different units and five scaling methods were used: Z-score, Min-Max normalization, Percentile transformation, Maximum absolute scaling, and RobustScaler.

4.1 The performance metrics

Evaluating the performance of k-means clustering is impossible when the true data grouping is unknown. Therefore, datasets in which the true group membership is known were obtained, and different metrics were used to evaluate the performances of the feature scaling methods. A confusion matrix was created from the true grouped data and k-means clustering results and used to compare the accuracy, precision, recall, and F-scores of the methods. Thereby, the number of misclassified patterns and the total number of patterns for each dataset were evaluated.

4.1.1 Confusion matrix.

A confusion matrix is a popular measure used to solve classification problems; it summarizes the predictions made by a classification model organized into a table by class [35]. Also, it helps gain insight into how correct the predictions were and how they hold up against the actual values. Table 2 provides an example of binary classification using a confusion matrix of predicted and actual values.

Download:

Table 2. Binary classification using a confusion matrix.

https://doi.org/10.1371/journal.pone.0310839.t002

4.1.2 Accuracy.

Accuracy is one of the most frequently used metrics for evaluating classifier models. To calculate accuracy, counts of correctly classified observations are divided by the total number of observations, which is easy to understand and implement [36]. While accuracy proves to be one of the most popular classification metrics because of its simplicity, it has a few major shortcomings, such as an imbalanced dataset. The equation for accuracy is (6) where TP, TN, FP, and FN are the true positive, true negative, false positive, and false negative values, respectively.

4.1.3 Precision and recall.

Precision and recall are two metrics that can help differentiate between error types and can still prove helpful for problems in scenarios with class imbalance [37]. The respective equations are (7)

4.1.4 F-score.

The F-score is another valuable metric, especially for imbalanced datasets. It balances precision and recall, thereby more comprehensively evaluating a model’s performance. It is beneficial when there is a need to assess both false positives and false negatives by offering a single numerical representation that considers both aspects of classification accuracy [38]. The equation to achieve this is (8)

4.2 Experimental study

Ten datasets were obtained for the study: five with the same unit for all of the features and five with different units. The five scaling methods: Z-score, Min-Max normalization, Percentile transformation, Maximum absolute scaling, and RobustScaler were used for each of the datasets. k-means clustering for each dataset was computed using the raw data and after applying the five scaling methods. Performance in terms of accuracy, precision, recall, and F-score was calculated by comparing the true grouped data and the results of the k-means clustering. A flowchart of the overall process is presented in Fig 2.

Download:

Fig 2. A flowchart for feature scaling by using Z: Z-score, M: Min-Max normalization, P: Percentile transformation, and A: Maximum absolute scaling and B: RobustScaler and the k-means clustering results.

https://doi.org/10.1371/journal.pone.0310839.g002

5. Results and discussion

5.1. Heatmaps for data visualization

Visualization is simple and valuable for exploring the overall structure of results. One of the most popular graphical methods for visualizing high-dimensional data is the heatmap, in which a table of numbers is encoded as a grid of colored cells; it can be used to explore in-depth and effective information based on the results. The arrangement of rows and columns in a heatmap matrix is ordered to accentuate patterns, often complemented by dendrograms. Heatmaps are used in various forms of analytics for visualizing, including observations, correlations, patterns of missing values, and more. Their versatility makes them a valuable tool for distilling complex information into visually accessible representations.

Fig 3 shows a heatmap for the results using the five datasets in which the features in each dataset have different units (D1–D5). For example, D1(R) is a raw dataset of three penguin species with four features: culmen length (mm), culmen depth (mm), flipper length (mm), and body mass (g). Obviously, a pattern could not be detected in the plotted heatmap when using the raw data because the unit for body mass (grams) is different from the other three features (mm), which impacts the Euclidean distance used to measure the similarity.

Download:

Fig 3.

Heatmaps for datasets (a) D1, (b) D2, (c) D3, (d) D4, and (e) D5. R: raw data, Z: Z-score, M: Min-Max normalization, P: Percentile transformation, A: Maximum absolute scaling and B: RobustScaler.

https://doi.org/10.1371/journal.pone.0310839.g003

Feature scaling using Z-score is shown in Fig 3 as heatmap D1 (Z); some object patterns can be seen in the structure of this data as a dendrogram to the right of the heatmap. Heatmap D1 (M) is after normalizing features to [0,1] using Min-Max normalization. We can see that the heatmap plot of this dataset represents a pattern that is easier to explore when compared with the heatmap of raw data. Furthermore, scaling via Percentile transformation and normalizing all features to [0,1] resulted in heatmap D1 (P), in which the pattern is easier to explore when compared with the heatmap using raw data. Likewise, scaling data by using Maximum absolute scaling resulted in heatmap D1 (A), in which all features are normalized to [0,1]. Last, scaling data using RobustScaler resulted in heatmap D1 (B), similar to Z-score but using the interquartile range; it can handle outliers well while scaling the data.

Moreover, compared with the heatmap using raw data, we can see that the dendrogram uncovers some object patterns. When there were more than 10 features, such as D2 and D3, it was difficult to gain insights from the heatmaps using the raw datasets because of the different units of the features; the shade of color and the arrangement of the rows in the dendrogram is not conducive to detecting the classifier pattern.

On the other hand, Fig 4 illustrates heatmaps for the five datasets in which the features in each dataset have the same unit (S1–S5). Heatmap visualization using the raw data and the five scaling methods provide easily explorable patterns. Thus, feature scaling is essential before k-means clustering when a dataset has features with different units.

Download:

Fig 4.

Heatmaps for datasets (a) S1, (b) S2, (c) S3, (d) S4, and (e) S5. R: raw data, Z: Z-score, M: Min-Max normalization, P: Percentile transformation, A: Maximum absolute scaling and B: RobustScaler.

https://doi.org/10.1371/journal.pone.0310839.g004

5.2 Performance evaluation of feature scaling before k-means clustering

The results for performance metrics accuracy, precision, recall, and F-score and also testing the hypothesis for homogeneity between the true grouped data and the results of k-means clustering for the datasets with different and the same unit are reported in Tables 3 and 4, respectively.

Download:

Table 3. The performance results for k-means clustering and testing the hypothesis for homogeneity between the true grouped data and feature scaling on datasets containing features with different units.

https://doi.org/10.1371/journal.pone.0310839.t003

Download:

Table 4. The performance results for k-means clustering and testing the hypothesis for homogeneity between the true grouped data and feature scaling on datasets containing features with the same unit.

https://doi.org/10.1371/journal.pone.0310839.t004

As an example from Table 3, D1 with raw data for k-means clustering compared with the true grouped data attained accuracy, precision, recall, and F-score values of 0.5614, 0.4949, 0.6058, and 0.5448 respectively. In comparison, feature scaling beforehand using Z-score provided 0.9620, 0.9578, 0.9512, and 0.9545, respectively, while Min-Max normalization provided 0.8538, 0.8360, 0.8395, and 0.8378, respectively, Percentile transformation provided 0.8889, 0.8973, 0.8693, and 0.8831, respectively, Maximum absolute scaling provided 0.9152, 0.8951, 0.8978, and 0.8965, respectively and RobustScaler provided 0.9327, 0.9333, 0.9297, and 0.9315, respectively.

These results obviously indicate that when features in the dataset have different units, feature scaling beforehand improves the k-means clustering performance. For testing the hypothesis for homogeneity between the true grouped data and results of k-means clustering using the raw data, the Chi-square value was 19.829 (p-value < 0.001), which signifies a significant difference). In comparison, the Chi-square values when using the five feature-scaling methods all show non-significant differences. The results showed the same trend for datasets D2–D5.

As an example from Table 4, S1 contains five variables: MCG, GVH, AAC, ALM1, and ALM2, all with the same unit (percentage). The accuracy, precision, recall, and F-score results for S1 using raw data for k-means clustering compared with the true grouped data were 0.9191, 0.9164, 0.9073, and 0.9119, respectively. In comparison, feature scaling using Z-score beforehand provided 0.9338, 0.9257, 0.927, and 0.9267, respectively, while Min-Max normalization provided 0.9228, 0.9187, 0.9122, and 0.9154, respectively, Percentile transformation provided 0.9375, 0.9240, 0.9338, and 0.9289, respectively, Maximum absolute scaling provided 0.9191, 0.9164, 0.9073, and 0.9119, respectively and RobustScaler provided 0.9228, 0.9218, 0.9223, and 0.9220, respectively. These results indicate that when features in the dataset have the same units, feature scaling may not be required before k-means clustering. The results for the other datasets having features with the same unit are the same trend.

In summary, for a dataset having features with different units, neglecting feature scaling before k-means clustering leads to noticeably poor performance whereas for a dataset having features with the same unit, feature scaling before k-means clustering did not affect the performance.

To visualize the results for D1–D5 and S1–S5, stacked bar charts (left-hand side) and heatmaps (right-hand side) between the true grouped data and the results of k-means clustering are presented in Figs 5 and 6, respectively. For example, the stacked bar chart for D1 in Fig 5, shows the frequencies for three penguin species: gentoo, chinstrap, and Adélie of 123, 68, and 151, respectively. The results for k-means clustering using the raw data provide clusters of 61, 15, and 116, respectively, which when hypothesis for homogeneity testing with a significant difference (p < 0.001) leads to the conclusion that the true grouped data and k-means clustering results are different. However, the results of k-means clustering after applying a feature scaling method provided Gentoo, Chinstrap, and Adélie clusters of 123, 63, and 143, respectively, using Z-score; 86, 55, and 151, respectively, using Min-Max normalization; 123, 61, and 120, respectively, using Percentile transformation; and 105, 56, and 151, respectively, using Maximum absolute scaling; 102, 66, and 151, respectively, using RobustScaler. These are all close to the true grouped data and hypothesis for homogeneity testing for each provides the same conclusion of a non-significant difference between the true grouped data and k-means clustered data.

Download:

Fig 5. Stacked bar charts and heat maps for the datasets with different units (D1–D5).

The dendrograms denote significant differences. R: raw data, Z: Z-score, M: Min-Max normalization, P: Percentile transformation, A: Maximum absolute scaling and B: RobustScaler.

https://doi.org/10.1371/journal.pone.0310839.g005

Download:

Fig 6. Stacked bar charts and heat maps for the datasets with the same units (S1–S5).

R: raw data, Z: Z-score, M: Min-Max normalization, P: Percentile transformation, A: Maximum absolute scaling and B: RobustScaler.

https://doi.org/10.1371/journal.pone.0310839.g006

The results of the k-means clustering of D1–D5 are visualized using stacked bar charts and heatmaps in Fig 5, in which the Euclidean distance is used to test for similarity between the true grouped data and k-means clustered data. The dendrogram shows that the k-means clustered groups using the raw data are very different from those obtained using the feature scaling methods.

Similarly, Fig 6 shows stacked bar charts and heatmaps for S1–S5. For example, the true grouped data for S1 shows three Escherichia coli strains: pp, im, and cp with true frequencies of 52, 77, and 143, respectively. The k-means clustering results using raw data provided clusters of 48, 69, and 133, respectively. Meanwhile, Z-score provided clusters of 48, 69, and 137, respectively; Min-Max normalization provided clusters of 48, 69, and 134, respectively; Percentile transformation provided clusters of 47, 69, and 139, respectively; Maximum absolute scaling provided clusters of 48, 69, and 133, respectively and RobustScaler provided clusters of 47, 68, and 163, respectively. These are all close to the true grouped data, and hypothesis for homogeneity testing revealed only non-significant differences between the true groups and k-means clusters in all cases. In addition, the dendrogram shows that the k-means clusters using the raw data are the same as when using the five scaling methods. The same trend in results was found for S2–S5.

The data visualization using stacked bar charts and heatmaps in Figs 5 and 6 indicates once again that feature scaling before k-means clustering analysis is essential for datasets in which the features have different units but not when they have the same unit.

To compare the performances of the various methods, plots of their accuracy, precision, recall, and F-scores for datasets having different or the same units are shown in Fig 7. For the datasets with different units, the performance metric results (accuracy, precision, recall, and F-score) using the five scaling methods are superior to when using the raw data whereas for the datasets with the same units, they provided similar results.

Download:

Fig 7. Plots of the performance metrics (accuracy, precision, recall, and F1-score) of the results of the k-means clustering of datasets with different (D1–D5) and the same (S1–S5) units.

https://doi.org/10.1371/journal.pone.0310839.g007

Table 5 provides the results for the overall average accuracy, precision, recall, and F-scores of k-means clustering for all of the datasets. For the datasets having features with different units, the Z-score method obtained accuracy, precision, recall, and F-score values of 0.9142, 0.9149, 0.9115, and 0.9128 respectively, which indicate its mostly superior performance to the other scaling methods. The only exception was the precision of the Percentile transformation method (0.9202). However, even the poor performance metrics for the Min-Max normalization were better than when using the raw data (accuracy, precision, recall, and F-score values of 0.8855, 0.8859, 0.8850, and 0.8851 versus 0.6621, 0.6112, 0.7104, and 0.6559, respectively). On the other hand, for datasets having features with the same unit, the Maximum absolute scaling method obtained accuracy, precision, recall, and F-score values of 0.8076, 0.8333, 0.7967, and 0.8131, respectively, which were slightly better than the other scaling methods and using raw data.

Download:

Table 5. Comparison of the average performance metric values for k-means clustering of datasets having features with different (D1–D5) or the same (S1–S5) units.

https://doi.org/10.1371/journal.pone.0310839.t005

As shown in Fig 8(A) for the datasets having features with different scales, the results clearly demonstrate that when using the scaling methods before the k-means clustering analysis, the overall average performance metrics (accuracy and precision, recall, and F-score) were superior to using the raw data. In contrast, as shown in Fig 8(B) for the datasets having features in the same unit, the performance metrics results are similar, with feature scaling using Maximum absolute scaling being slightly better than using the other scaling methods or raw data.

Download:

Fig 8. Comparison of the average performance metric values for k-means clustering of datasets having features with different (D1–D5) and the same (S1–S5) units.

R: raw data, Z: Z-score, M: Min-Max normalization, P: Percentile transformation, A: Maximum absolute scaling and B: RobustScaler.

https://doi.org/10.1371/journal.pone.0310839.g008

6. Conclusions

Feature scaling, a crucial aspect of pre-processing, is particularly significant for distance-based algorithms like k-means clustering. This algorithm relies on feature scaling to ensure equal distance consideration across all features. The direct impact of distance-based scales on cluster assignment is a key concern. In particular, k-means clustering uses Euclidean distance between data points, so features with larger values or ranges will have a greater impact on the clustering results than features with smaller ones.

This study clarifies this problem and focuses on whether feature scaling should be used before k-means clustering analysis. To this end, the k-means clustering performance using raw data was compared with preprocessing the data using five feature-scaling techniques (Z-score, Min-Max normalization, Percentile transformation, Maximum absolute scaling and RobustScaler) when the true group membership is known. Ten real data sets, five with features having the same unit and five with features with different units, were used in the evaluation. Four performance measures, accuracy, precision, recall, and F-score, were used to compare the performances of the various techniques. The results indicate that the feature-scaling step is crucial and should not be neglected when performing k-means clustering on datasets containing features with different units, which leads to more accurate and reliable cluster assignments. The findings results are consistent with those from [39, 40]. Although feature scaling using Maximum absolute scaling is slightly better than the other scaling methods and raw data when the dataset contains features with the same unit, the improvement was not significant.

The main contribution of this study’s findings confirms that feature scaling, particularly when the features in the dataset have different units commonly found in real-world data sets, helps prevent features with larger magnitudes from dominating the distance calculations, leading to more accurate and consistent clustering results and better identification of the natural groupings in the data. Ignoring scale in k-means clustering can distort results as features with larger values dominate. Since Euclidean distance, the basis of k-means is sensitive to such variations. At the same time, the feature scaling in k-means clustering performed slightly better than the raw data when the dataset contained features with the same unit; the improvement was insignificant. This study’s findings indicate that feature scaling is crucial and should not be neglected when performing k-means clustering, especially on datasets with features having different units. According to recent research [21, 28], a high magnitude affects the distance between two given points, which impacts the performance of k-means clustering since variables with higher magnitudes are given more weight. Therefore, it is always advisable for researchers to bring all the features to the same scale before applying distance-based algorithms such as k-means clustering [22].

However, this study only employs a k-means clustering algorithm to determine the similarity between data points using Euclidean distance by focusing on the impact of feather scaling of datasets with the same units and different units. In the future, other algorithms can be compared using different distance measures to achieve better clustering performance.

Acknowledgments

The authors would like to thank the Department of Statistics, the Faculty of Science from Kasetsart University for supporting and providing facilities to conduct this research.

References

1. Ali PJ, Faraj RH, Koya E, Ali PJ, Faraj RH. Data normalization and standardization: a technical report. Mach Learn Tech Rep. 2014 Jan;1(1):1–6.
- View Article
- Google Scholar
2. Bhardwaj CA, Mishra M, Desikan K. Dynamic feature scaling for k-nearest neighbor algorithm. arXiv preprint arXiv:1811.05062. 2018 Nov 13. https://doi.org/10.48550/arXiv.1811.05062.
- View Article
- Google Scholar
3. Bholowalia P, Kumar A. EBK-means: A clustering technique based on elbow method and k-means in WSN. International Journal of Computer Applications. 2014 Jan 1;105(9).
- View Article
- Google Scholar
4. Cheadle C, Vawter MP, Freed WJ, Becker KG. Analysis of microarray data using Z score transformation. The Journal of molecular diagnostics. 2003 May 1;5(2):73–81. pmid:12707371
- View Article
- PubMed/NCBI
- Google Scholar
5. Congalton RG. A review of assessing the accuracy of classifications of remotely sensed data. Remote sensing of environment. 1991 Jul 1;37(1):35–46. https://doi.org/10.1016/0034-4257(91)90048-B.
- View Article
- Google Scholar
6. Dong-hai ZH, Jiang Y, Fei G, Lei YU, Feng DI. K-means text clustering algorithm based on initial cluster centers selection according to maximum distance. Application Research of Computers/Jisuanji Yingyong Yanjiu. 2014 Mar 1;31(3).
- View Article
- Google Scholar
7. Erisoglu M, Calis N, Sakallioglu S. A new algorithm for initial cluster centers in k-means algorithm. Pattern Recognition Letters. 2011 Oct 15;32(14):1701–1705.
- View Article
- Google Scholar
8. Genolini C, Falissard B. KmL: k-means for longitudinal data. Computational Statistics. 2010 Jun;25(2):317–328.https://doi.org/10.1007/s00180-009-0178-4.
- View Article
- Google Scholar
9. Hamerly G, Elkan C. Learning the k in k-means. Advances in neural information processing systems. 2003;16.
- View Article
- Google Scholar
10. Hartigan JA. Clustering algorithms. John Wiley & Sons, Inc.; 1975 Feb 1.
- View Article
- Google Scholar
11. Henderi H, Wahyuningsih T, Rahwanto E. Comparison of Min-Max normalization and Z-Score Normalization in the K-nearest neighbor (kNN) Algorithm to Test the Accuracy of Types of Breast Cancer. International Journal of Informatics and Information Systems. 2021 Mar 1;4(1):13–20. https://doi.org/10.47738/ijiis.v4i1.73.
- View Article
- Google Scholar
12. Hossain MZ, Akhtar MN, Ahmad RB, Rahman M. A dynamic K-means clustering for data mining. Indonesian Journal of Electrical engineering and computer science. 2019 Feb 1;13(2):521–526. https://doi.org/10.11591/ijeecs.v13.i2.pp521-526.
- View Article
- Google Scholar
13. Jain AK. Data clustering: 50 years beyond K-means. Pattern recognition letters. 2010 Jun 1;31(8):651–666. https://doi.org/10.1016/j.patrec.2009.09.011.
- View Article
- Google Scholar
14. Juba B, Le HS. Precision-recall versus accuracy and the role of large data sets. InProceedings of the AAAI conference on artificial intelligence 2019 Jul 17 (Vol. 33, No. 01, pp. 4039–4048). https://doi.org/10.1609/aaai.v33i01.33014039.
15. Competition Kaggle. Data Clustering. https://www.kaggle.com/competitions/in-class-competition-data-clustering-2023.
- View Article
- Google Scholar
16. Kapil S, Chawla M. Performance evaluation of K-means clustering algorithm with various distance metrics. In2016 IEEE 1st international conference on power electronics, intelligent control and energy systems (ICPEICES) 2016 Jul 4 (pp. 1–4). IEEE. https://doi.org/10.1109/ICPEICES.2016.7853264.
17. Khan SS, Ahmad A. Cluster center initialization algorithm for K-means clustering. Pattern recognition letters. 2004 Aug 1;25(11):1293–1302. https://doi.org/10.1016/j.patrec.2004.04.007.
- View Article
- Google Scholar
18. Kumar KM, Reddy AR. An efficient k-means clustering filtering algorithm using density based initial cluster centers. Information Sciences. 2017 Dec 1;418:286–301.
- View Article
- Google Scholar
19. Kumar S. Efficient k-mean clustering algorithm for large datasets using data mining standard score normalization. Int. J. Recent Innov. Trends Comput. Commun. 2014;2(10):3161–3166.
- View Article
- Google Scholar
20. Kwedlo W, Iwanowicz P. Using genetic algorithm for selection of initial cluster centers for the k-means method. InInternational Conference on Artificial Intelligence and Soft Computing 2010 Jun 13 (pp. 165–172). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-13232-2_20.
21. MacQueen J. Some methods for classification and analysis of multivariate observations. InProceedings of the fifth Berkeley symposium on mathematical statistics and probability 1967 Jun 21 (Vol. 1, No. 14, pp. 281–297).
22. Mansoury M, Burke R, Mobasher B. Flatter is better: percentile transformations for recommender systems. ACM Transactions on Intelligent Systems and Technology (TIST). 2021 Mar 9;12(2):1–6. https://doi.org/10.1145/3437910.
- View Article
- Google Scholar
23. Marom ND, Rokach L, Shmilovici A. Using the confusion matrix for improving ensemble classifiers. In2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel 2010 Nov 17 (pp. 000555–000559). IEEE. https://doi.org/10.1109/EEEI.2010.5662159.
24. Milligan GW, Cooper MC. An examination of procedures for determining the number of clusters in a data set. Psychometrika. 1985 Jun;50:159–79. https://doi.org/10.1007/BF02294245.
- View Article
- Google Scholar
25. Muhammed LA. Role of data normalization in k-means algorithm results. InAIP Conference Proceedings 2023 Mar 29 (Vol. 2591, No. 1). AIP Publishing. https://doi.org/10.1063/5.0119267.
26. Ozsahin DU, Mustapha MT, Mubarak AS, Ameen ZS, Uzun B. Impact of feature scaling on machine learning models for the diagnosis of diabetes. In2022 International Conference on Artificial Intelligence in Everything (AIE) 2022 Aug 2 (pp. 87–94). IEEE. https://doi.org/10.1109/AIE57029.2022.00024.
27. Patel VR, Mehta RG. Performance analysis of MK-means clustering algorithm with normalization approach. In2011 World Congress on Information and Communication Technologies 2011 Dec 11 (pp. 974–979). IEEE. https://doi.org/10.1109/WICT.2011.6141380.
28. Patro SG, Sahu KK. Normalization: A preprocessing stage. arXiv preprint arXiv:1503.06462. 2015 Mar 19. https://doi.org/10.48550/arXiv.1503.06462.
- View Article
- Google Scholar
29. Peng X, Zhou C, Hepburn DM, Judd MD, Siew WH. Application of K-Means method to pattern recognition in on-line cable partial discharge monitoring. IEEE Transactions on Dielectrics and Electrical Insulation. 2013 May 24;20(3):754–61. https://doi.org/10.1109/TDEI.2013.6518945.
- View Article
- Google Scholar
30. Rahmani MK, Pal N, Arora K. Clustering of image data using K-means and fuzzy K-means. International Journal of Advanced Computer Science and Applications. 2014;5(7).
- View Article
- Google Scholar
31. Schaffer C. Selecting a classification method by cross-validation. Machine learning. 1993 Oct;13:135–43. https://doi.org/10.1007/BF00993106.
- View Article
- Google Scholar
32. Song Q, Jiang H, Liu J. Feature selection based on FDA and F-score for multi-class classification. Expert Systems with Applications. 2017 Sep 15;81:22–7. https://doi.org/10.1016/j.eswa.2017.02.049.
- View Article
- Google Scholar
33. Sugar CA, James GM. Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Association. 2003 Sep 1;98(463):750–63. https://doi.org/10.1198/016214503000000666.
- View Article
- Google Scholar
34. Tanir D, Nuriyeva F. On selecting the initial cluster centers in the K-means algorithm. In2017 IEEE 11th International Conference on Application of Information and Communication Technologies (AICT) 2017 Sep 20 (pp. 1–5). Ieee. https://doi.org/10.1109/ICAICT.2017.8687081.
35. Uppada SK. Centroid based clustering algorithms—A clarion study. International Journal of Computer Science and Information Technologies. 2014;5(6):7309–13.
- View Article
- Google Scholar
36. Vera JF, Macías R. On the behaviour of K-means clustering of a dissimilarity matrix by means of full multidimensional scaling. psychometrika. 2021 Jun;86(2):489–513. pmid:34008128
- View Article
- PubMed/NCBI
- Google Scholar
37. Wang J. Consistent selection of the number of clusters via crossvalidation. Biometrika. 2010 Dec 1;97(4):893–904. https://doi.org/10.1093/biomet/asq061.
- View Article
- Google Scholar
38. Zubair M, Iqbal MA, Shil A, Chowdhury MJ, Moni MA, Sarker IH. An improved K-means clustering algorithm towards an efficient data-driven modeling. Annals of Data Science. 2022 Jun 25:1–20. https://doi.org/10.1007/s40745-022-00428-2.
- View Article
- Google Scholar
39. Al Radhwani AM, Algamal ZY. Improving K-means clustering based on firefly algorithm. InJournal of Physics: Conference Series 2021 May 1 (Vol. 1897, No. 1, p. 012004). IOP Publishing. https://doi.org/10.1088/1742-6596/1897/1/012004.
40. Thara DK, PremaSudha BG, Xiong F. Auto-detection of epileptic seizure events using deep neural network with different feature scaling techniques. Pattern Recognition Letters. 2019 Dec 1;128:544–50. https://doi.org/10.1016/j.patrec.2019.10.029.
- View Article
- Google Scholar

[ref1] 1. Ali PJ, Faraj RH, Koya E, Ali PJ, Faraj RH. Data normalization and standardization: a technical report. Mach Learn Tech Rep. 2014 Jan;1(1):1–6.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Bhardwaj CA, Mishra M, Desikan K. Dynamic feature scaling for k-nearest neighbor algorithm. arXiv preprint arXiv:1811.05062. 2018 Nov 13. https://doi.org/10.48550/arXiv.1811.05062.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Bholowalia P, Kumar A. EBK-means: A clustering technique based on elbow method and k-means in WSN. International Journal of Computer Applications. 2014 Jan 1;105(9).
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref4] 4. Cheadle C, Vawter MP, Freed WJ, Becker KG. Analysis of microarray data using Z score transformation. The Journal of molecular diagnostics. 2003 May 1;5(2):73–81. pmid:12707371
View Article
PubMed/NCBI
Google Scholar

[11] View Article

[12] PubMed/NCBI

[13] Google Scholar

[ref5] 5. Congalton RG. A review of assessing the accuracy of classifications of remotely sensed data. Remote sensing of environment. 1991 Jul 1;37(1):35–46. https://doi.org/10.1016/0034-4257(91)90048-B.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref6] 6. Dong-hai ZH, Jiang Y, Fei G, Lei YU, Feng DI. K-means text clustering algorithm based on initial cluster centers selection according to maximum distance. Application Research of Computers/Jisuanji Yingyong Yanjiu. 2014 Mar 1;31(3).
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref7] 7. Erisoglu M, Calis N, Sakallioglu S. A new algorithm for initial cluster centers in k-means algorithm. Pattern Recognition Letters. 2011 Oct 15;32(14):1701–1705.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref8] 8. Genolini C, Falissard B. KmL: k-means for longitudinal data. Computational Statistics. 2010 Jun;25(2):317–328.https://doi.org/10.1007/s00180-009-0178-4.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref9] 9. Hamerly G, Elkan C. Learning the k in k-means. Advances in neural information processing systems. 2003;16.
View Article
Google Scholar

[27] View Article

[28] Google Scholar

[ref10] 10. Hartigan JA. Clustering algorithms. John Wiley & Sons, Inc.; 1975 Feb 1.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref11] 11. Henderi H, Wahyuningsih T, Rahwanto E. Comparison of Min-Max normalization and Z-Score Normalization in the K-nearest neighbor (kNN) Algorithm to Test the Accuracy of Types of Breast Cancer. International Journal of Informatics and Information Systems. 2021 Mar 1;4(1):13–20. https://doi.org/10.47738/ijiis.v4i1.73.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref12] 12. Hossain MZ, Akhtar MN, Ahmad RB, Rahman M. A dynamic K-means clustering for data mining. Indonesian Journal of Electrical engineering and computer science. 2019 Feb 1;13(2):521–526. https://doi.org/10.11591/ijeecs.v13.i2.pp521-526.
View Article
Google Scholar

[36] View Article

[37] Google Scholar

[ref13] 13. Jain AK. Data clustering: 50 years beyond K-means. Pattern recognition letters. 2010 Jun 1;31(8):651–666. https://doi.org/10.1016/j.patrec.2009.09.011.
View Article
Google Scholar

[39] View Article

[40] Google Scholar

[ref14] 14. Juba B, Le HS. Precision-recall versus accuracy and the role of large data sets. InProceedings of the AAAI conference on artificial intelligence 2019 Jul 17 (Vol. 33, No. 01, pp. 4039–4048). https://doi.org/10.1609/aaai.v33i01.33014039.

[ref15] 15. Competition Kaggle. Data Clustering. https://www.kaggle.com/competitions/in-class-competition-data-clustering-2023.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref16] 16. Kapil S, Chawla M. Performance evaluation of K-means clustering algorithm with various distance metrics. In2016 IEEE 1st international conference on power electronics, intelligent control and energy systems (ICPEICES) 2016 Jul 4 (pp. 1–4). IEEE. https://doi.org/10.1109/ICPEICES.2016.7853264.

[ref17] 17. Khan SS, Ahmad A. Cluster center initialization algorithm for K-means clustering. Pattern recognition letters. 2004 Aug 1;25(11):1293–1302. https://doi.org/10.1016/j.patrec.2004.04.007.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref18] 18. Kumar KM, Reddy AR. An efficient k-means clustering filtering algorithm using density based initial cluster centers. Information Sciences. 2017 Dec 1;418:286–301.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref19] 19. Kumar S. Efficient k-mean clustering algorithm for large datasets using data mining standard score normalization. Int. J. Recent Innov. Trends Comput. Commun. 2014;2(10):3161–3166.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref20] 20. Kwedlo W, Iwanowicz P. Using genetic algorithm for selection of initial cluster centers for the k-means method. InInternational Conference on Artificial Intelligence and Soft Computing 2010 Jun 13 (pp. 165–172). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-13232-2_20.

[ref21] 21. MacQueen J. Some methods for classification and analysis of multivariate observations. InProceedings of the fifth Berkeley symposium on mathematical statistics and probability 1967 Jun 21 (Vol. 1, No. 14, pp. 281–297).

[ref22] 22. Mansoury M, Burke R, Mobasher B. Flatter is better: percentile transformations for recommender systems. ACM Transactions on Intelligent Systems and Technology (TIST). 2021 Mar 9;12(2):1–6. https://doi.org/10.1145/3437910.
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref23] 23. Marom ND, Rokach L, Shmilovici A. Using the confusion matrix for improving ensemble classifiers. In2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel 2010 Nov 17 (pp. 000555–000559). IEEE. https://doi.org/10.1109/EEEI.2010.5662159.

[ref24] 24. Milligan GW, Cooper MC. An examination of procedures for determining the number of clusters in a data set. Psychometrika. 1985 Jun;50:159–79. https://doi.org/10.1007/BF02294245.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref25] 25. Muhammed LA. Role of data normalization in k-means algorithm results. InAIP Conference Proceedings 2023 Mar 29 (Vol. 2591, No. 1). AIP Publishing. https://doi.org/10.1063/5.0119267.

[ref26] 26. Ozsahin DU, Mustapha MT, Mubarak AS, Ameen ZS, Uzun B. Impact of feature scaling on machine learning models for the diagnosis of diabetes. In2022 International Conference on Artificial Intelligence in Everything (AIE) 2022 Aug 2 (pp. 87–94). IEEE. https://doi.org/10.1109/AIE57029.2022.00024.

[ref27] 27. Patel VR, Mehta RG. Performance analysis of MK-means clustering algorithm with normalization approach. In2011 World Congress on Information and Communication Technologies 2011 Dec 11 (pp. 974–979). IEEE. https://doi.org/10.1109/WICT.2011.6141380.

[ref28] 28. Patro SG, Sahu KK. Normalization: A preprocessing stage. arXiv preprint arXiv:1503.06462. 2015 Mar 19. https://doi.org/10.48550/arXiv.1503.06462.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref29] 29. Peng X, Zhou C, Hepburn DM, Judd MD, Siew WH. Application of K-Means method to pattern recognition in on-line cable partial discharge monitoring. IEEE Transactions on Dielectrics and Electrical Insulation. 2013 May 24;20(3):754–61. https://doi.org/10.1109/TDEI.2013.6518945.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref30] 30. Rahmani MK, Pal N, Arora K. Clustering of image data using K-means and fuzzy K-means. International Journal of Advanced Computer Science and Applications. 2014;5(7).
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref31] 31. Schaffer C. Selecting a classification method by cross-validation. Machine learning. 1993 Oct;13:135–43. https://doi.org/10.1007/BF00993106.
View Article
Google Scholar

[77] View Article

[78] Google Scholar

[ref32] 32. Song Q, Jiang H, Liu J. Feature selection based on FDA and F-score for multi-class classification. Expert Systems with Applications. 2017 Sep 15;81:22–7. https://doi.org/10.1016/j.eswa.2017.02.049.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref33] 33. Sugar CA, James GM. Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Association. 2003 Sep 1;98(463):750–63. https://doi.org/10.1198/016214503000000666.
View Article
Google Scholar

[83] View Article

[84] Google Scholar

[ref34] 34. Tanir D, Nuriyeva F. On selecting the initial cluster centers in the K-means algorithm. In2017 IEEE 11th International Conference on Application of Information and Communication Technologies (AICT) 2017 Sep 20 (pp. 1–5). Ieee. https://doi.org/10.1109/ICAICT.2017.8687081.

[ref35] 35. Uppada SK. Centroid based clustering algorithms—A clarion study. International Journal of Computer Science and Information Technologies. 2014;5(6):7309–13.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref36] 36. Vera JF, Macías R. On the behaviour of K-means clustering of a dissimilarity matrix by means of full multidimensional scaling. psychometrika. 2021 Jun;86(2):489–513. pmid:34008128
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

[ref37] 37. Wang J. Consistent selection of the number of clusters via crossvalidation. Biometrika. 2010 Dec 1;97(4):893–904. https://doi.org/10.1093/biomet/asq061.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref38] 38. Zubair M, Iqbal MA, Shil A, Chowdhury MJ, Moni MA, Sarker IH. An improved K-means clustering algorithm towards an efficient data-driven modeling. Annals of Data Science. 2022 Jun 25:1–20. https://doi.org/10.1007/s40745-022-00428-2.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref39] 39. Al Radhwani AM, Algamal ZY. Improving K-means clustering based on firefly algorithm. InJournal of Physics: Conference Series 2021 May 1 (Vol. 1897, No. 1, p. 012004). IOP Publishing. https://doi.org/10.1088/1742-6596/1897/1/012004.

[ref40] 40. Thara DK, PremaSudha BG, Xiong F. Auto-detection of epileptic seizure events using deep neural network with different feature scaling techniques. Pattern Recognition Letters. 2019 Dec 1;128:544–50. https://doi.org/10.1016/j.patrec.2019.10.029.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

Figures

Abstract

1. Introduction

2. The datasets used in the study

3. Methodology for k-means clustering and feature scaling

3.1 K-means clustering

3.2 The feature scaling methods

3.2.1. Z-score.

3.2.2. Min-max normalization.

3.2.3. Percentile transformation.

3.2.4 Maximum absolute scaling.

3.2.5. RobustScaler.

4. The performance metrics and experimental study

4.1 The performance metrics

4.1.1 Confusion matrix.

4.1.2 Accuracy.

4.1.3 Precision and recall.

4.1.4 F-score.

4.2 Experimental study

5. Results and discussion

5.1. Heatmaps for data visualization

5.2 Performance evaluation of feature scaling before k-means clustering

6. Conclusions

Acknowledgments

References