Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Generative adversarial local density-based unsupervised anomaly detection

  • Xinliang Li,

    Roles Conceptualization, Funding acquisition, Methodology, Validation, Writing – original draft

    Affiliation Chongqing College of International Business and Economics, ChongQing, China

  • Jianmin Peng,

    Roles Data curation, Formal analysis, Funding acquisition, Investigation, Resources, Writing – review & editing

    Affiliation People’s Hospital of Xinjiang Uygur Autonomous Region, Urumqi, China

  • Wenjing Li,

    Roles Methodology, Project administration, Software, Visualization, Writing – review & editing

    Affiliation School of Information Science and Engineering, Xinjiang University, Urumqi, China

  • Zhiping Song,

    Roles Data curation, Methodology, Project administration, Resources, Supervision, Writing – review & editing

    Affiliation Chongqing College of International Business and Economics, ChongQing, China

  • Xusheng Du

    Roles Conceptualization, Methodology, Writing – original draft

    duxusheng@stu.xju.edu.cn

    Affiliation School of Information Science and Engineering, Xinjiang University, Urumqi, China

Abstract

Anomaly detection is crucial in areas such as financial fraud identification, cybersecurity defense, and health monitoring, as it directly affects the accuracy and security of decision-making. Existing generative adversarial nets (GANs)-based anomaly detection methods overlook the importance of local density, limiting their effectiveness in detecting anomaly objects in complex data distributions. To address this challenge, we introduce a generative adversarial local density-based anomaly detection (GALD) method, which combines the data distribution modeling capabilities of GANs with local synthetic density analysis. This approach not only considers different data distributions but also incorporates neighborhood relationships, enhancing anomaly detection accuracy. First, by utilizing the adversarial process of GANs, including the loss function and the rarity of anomaly objects, we constrain the generator to primarily fit the probability distribution of normal objects during the unsupervised training process; Subsequently, a synthetic dataset is sampled from the generator, and the local synthetic density, which is defined by measuring the inverse of the sum of distances between a data point and all objects in its synthetic neighborhood, is calculated; Finally, the objects that show substantial density deviations from the synthetic data are classified as anomaly objects. Extensive experiments on seven real-world datasets from various domains, including medical diagnostics, industrial monitoring, and material analysis, were conducted using seven state-of-the-art anomaly detection methods as benchmarks. The GALD method achieved an average AUC of 0.874 and an accuracy of 94.34%, outperforming the second-best method by 7.2% and 6%, respectively.

1. Introduction

Anomaly detection is a critically important task in the field of data mining, widely applied in various domains such as financial fraud detection [1], cybersecurity [2], healthcare [3], and industrial monitoring [4]. The essence of anomaly detection tasks lies in accurately identifying data objects that deviate from normal behavior patterns [5]. These anomaly objects often represent crucial indicators of significant events or potential issues within systems. Therefore, the accuracy of anomaly detection is not only crucial for enhancing operational efficiency across various domains but also serves as a cornerstone for ensuring effective decision-making processes and system security [6].

As data collection capabilities continue to strengthen, methods for anomaly detection are also advancing. In the early stages, anomaly detection primarily relied on simple statistical analysis techniques to identify anomalies, such as setting thresholds or conducting basic statistical tests [7]. However, these methods often assume that data follow a specific distribution, such as a normal distribution. With the increasing complexity and diversity of data, traditional methods are gradually showing limitations [8]. The introduction of machine learning algorithms has significantly expanded the application scope of anomaly detection while enhancing detection efficiency and accuracy. For instance, support vector machines (SVM) [9], decision trees [10], and other models can effectively handle and analyze complex datasets, freeing anomaly detection from strict statistical assumptions and manual feature extraction limitations. Despite the significant advantages of machine learning methods over traditional approaches, their performance heavily relies on the quality and quantity of training data. Moreover, they may not be responsive to newly emerging anomaly types, limiting their generalization ability [11].

In recent years, the application of deep learning methods in anomaly detection has been increasingly prominent. Their primary advantage lies in their ability to handle highly complex and nonlinear data patterns, thereby enhancing the detection and generalization capabilities for anomalies [12]. However, existing deep learning methods often struggle to detect anomaly objects with complex distributions that closely resemble normal object distributions [13]. The introduction of generative adversarial networks has brought a new research paradigm to anomaly detection. GANs leverage an adversarial process between a generator and a discriminator to learn the latent distribution of data, offering significant advantages in handling complex data distributions [14]. The generator aims to create synthetic data that mimics the real data distribution, while the discriminator attempts to distinguish between real and synthetic data. This adversarial setup forces the generator to learn intricate, nonlinear features of the data, enabling GANs to approximate highly complex and non-Gaussian distributions. The continuous feedback loop between the generator and discriminator allows GANs to progressively improve their modeling of complex patterns that traditional models often struggle to capture.

However, while GANs are adept at learning and generating synthetic data that closely follows the latent normal data distribution, distinguishing between low deviation anomaly objects and normal ones can still be challenging. This is where the calculation of local synthetic density plays a crucial role. By calculating the local synthetic density between objects in the original data and their synthetic neighbors, we can provide a more refined measure of how well each data point aligns with the synthetic data distribution generated by the GANs. Unlike simple neighborhood relationships, which often fail to differentiate anomaly objects in complex high-dimensional spaces due to their reliance on fixed distance metrics, local density synthetic offers a more nuanced analysis of how isolated or well-integrated a data point is within its local region. This approach effectively enhances the ability of GANs-based methods to identify anomalies, as it captures subtle deviations that simple neighborhood-based methods often overlook. Hence, the integration of local synthetic density with GANs allows for a more robust identification of anomalies that are less apparent under traditional neighborhood approaches. Therefore, we propose a GANs with local synthetic density-based anomaly detection method. The main contributions are summarized as follows:

(1): The proposed GALD method effectively leverages the distribution fitting capabilities of GANs, allowing the generator to learn the underlying normal data distribution, while also incorporating local density measures to better identify anomaly objects. This dual approach ensures that even subtle deviations are effectively detected, particularly when anomaly objects are hidden within complex distribution patterns.

(2): The GALD method calculates the local synthetic density by comparing the density differences between the generated synthetic data and the original data. This local density calculation provides a more precision view of the data distribution, allowing the model to capture subtle changes that indicate the presence of anomalies, thus significantly enhancing detection accuracy in high-dimensional and imbalanced datasets.

(3): In class-imbalanced anomaly detection tasks, GANs tend to prioritize fitting the majority class (normal objects) to minimize the loss under adversarial training. The GALD method by using the synthetic data generated by GANs as fake-normal objects and incorporating local synthetic density to better detect minority anomaly objects. This approach ensures a more robust and precise anomaly detection process, effectively overcoming the challenges of class imbalance.

2. Related work

This section provides an overview of existing anomaly detection methods, including classical statistical and machine learning approaches, as well as the application of deep learning and generative adversarial networks in anomaly detection. By comparing the strengths and weaknesses of these methods, it establishes a theoretical foundation and background support for subsequent research.

2.1 Classical anomaly detection methods

Classic anomaly detection methods can be broadly categorized into statistical-based, clustering-based, density-based [15], and distance-based [16] detection methods. Statistical-based methods identify anomaly objects by calculating statistical metrics such as mean and standard deviation. However, these methods assume data follow a specific distribution and are sensitive to extreme values, making them less effective for handling multivariate data. Cluster-based methods detect anomaly objects by partitioning data points into different clusters. For instance, the K-Means clustering method divides data into clusters, where data points not belonging to any major cluster or belonging to small and sparse clusters are more likely considered anomaly objects [17]. However, these methods are sensitive to parameters, perform poorly on high-dimensional data, and struggle with handling clusters of complex shapes [18]. Density-based methods identify anomaly objects by estimating the density of data points in their vicinity. The typical Local Outlier Factor (LOF) method calculates the local density for each data point, where points in low-density regions are considered potential anomaly objects or abnormal activities [19]. However, density-based methods are computationally intensive, challenging in parameter selection, and sensitive to points at the boundaries of the data distribution. Distance-based methods identify anomaly objects by calculating distances between data points. The typical K-Nearest Neighbors (KNN) method, for example, identifies anomaly objects by computing the distance between a target data point and its nearest neighbors, marking data points with unusually large distances as anomaly objects [20]. While straightforward and intuitive, these methods often perform poorly with high-dimensional data. Table 1 summarizes these classical methods.

thumbnail
Table 1. Classical anomaly detection methods.

The advantages and disadvantages of classical anomaly detection methods such as statistics, clustering, density and distance are summarized to help understand their applicability.

https://doi.org/10.1371/journal.pone.0315721.t001

2.2 Deep learning-based anomaly detection method

As data volumes and computational power expand, deep learning algorithms have become increasingly effective for handling high-dimensional data and complex pattern recognition. These methods build complex neural network models that autonomously extract high-dimensional features from data, facilitating anomaly detection. Key techniques in deep learning-based anomaly detection include Autoencoders [21], Convolutional Neural Networks (CNN) [22], Long Short-Term Memory networks (LSTM) [23], and Graph Neural Networks (GNN) [24].

Methods based on autoencoders leverage the ability of autoencoders to reconstruct data, demonstrating significant potential in industrial anomaly detection. However, their performance depends on the quality and quantity of the data. When the proportion of anomaly data increases or the data quality is low, the effectiveness of anomaly detection based on autoencoders tends to decrease. Additionally, anomaly detection methods based on autoencoders are particularly sensitive to the choice of threshold for reconstruction errors [25]. Convolutional Neural Networks (CNN) have been utilized in recent years for anomaly detection in time series and multidimensional data, owing to their ability to extract local features from data and identify complex patterns and anomaly objects through their multi-layered structure. However, CNN models require complex preprocessing steps to ensure data quality, which can lead to increased computational costs and training times when handling large datasets [26]. Recurrent Neural Networks (RNN) and Long Short-Term Memory networks (LSTM) are suitable for learning from long sequential data and can identify anomaly objects when data behaviors deviate from normal patterns [27]. However, the performance of these models largely depends on the setting of hyperparameters, and the network training can be unstable when dealing with long sequences of data [28]. Graph Neural Networks (GNNs) achieve effective results in anomaly detection tasks by learning the relationships between nodes and edges in graph-structured data, coupled with message-passing techniques in hidden layers. However, the training process for GNN is complex, and constructing graphs for large-scale data incurs high computational and training costs [29]. Additionally, the model’s sensitivity to changes in graph structure can impact the stability of detection results. Table 2 summarizes these deep learning-based methods.

thumbnail
Table 2. Deep learning-based anomaly detection method.

The capabilities and limitations of deep learning methods in dealing with complex data distributions are compared.

https://doi.org/10.1371/journal.pone.0315721.t002

2.3 GANs-based anomaly detection method

The anomaly detection method based on generative adversarial networks has received widespread attention in recent years. This approach identifies anomaly data through the adversarial training of a generator and a discriminator. The generator generator objects similar to the training data, while the discriminator evaluates the authenticity of these objects. When the input objects cannot be reconstructed by the generator or is recognized as an anomaly objects by the discriminator, it is considered an anomaly [30]. In this way, GANs models not only generate new data objects to augment the training set but also help the model better understand the normal and anomaly patterns within the data.

The TGAN(Transformer-with-GANs) generates synthetic data similar to real tabular data using GANs, and then uses reconstruction error to detect anomaly objects, achieving high detection accuracy [31]. However, existing GANs-based anomaly detection methods also have several drawbacks. For instance, the anomaly generative adversarial network (AnoGAN) train GANs with unlabeled data and detects outliers by comparing the differences between real and generated data [32]. However, its training process is often unstable and has limitations, especially when dealing with high-dimensional data. f-AnoGAN(fast-AnoGAN) improves the efficiency and accuracy of AnoGAN by introducing a fast-training encoder network that maps data to the latent space of GANs. Although it enhances the efficiency of anomaly detection, f-AnoGAN still does not address the inherent training instability of GANs [33]. GANomaly employs an encoder-decoder-encoder structure to generate latent space representations of data and identifies outliers by comparing the differences between the original input and the reconstruction. However, its effectiveness can be limited for complex data structures, especially when there is an insufficient number of normal objects [34]. Multiple(Single)-Objective Generative Adversarial Active Learning (MO-GAAL, SO-GAAL) is an innovative anomaly detection method that introduces a multi-generator structure and an active learning strategy based on GANs. It leverages multiple generators working collaboratively to generate richly informative potential anomaly objects, significantly enhancing anomaly detection performance and robustness on complex datasets. Although MO-GAAL uses a multi-generator structure to avoid mode collapse of a single generator, it still faces issues such as insufficient coordination among generators or imbalanced generation distributions. Additionally, its performance and effectiveness may be highly influenced by the choice of hyperparameters [35]. STEP-GAN (step-by-step training method for multi generator GANs) is an improved generative adversarial network approach that leverages multiple generators interacting step-by-step with a discriminator to learn different modes of the distribution of task-specific normal data, thereby simulating potential anomaly distributions and mitigating the mode collapse problem. Its limitation lies in the fact that, since the model is entirely trained on normal data, its ability to detect unseen or significantly different anomaly patterns may be limited [36]. Table 3 summarizes the methods based on GANs.

thumbnail
Table 3. Generative adversarial nets-based anomaly detection.

The features and challenges of different GANs-based anomaly detection methods are listed, highlighting their respective detection effectiveness and shortcomings.

https://doi.org/10.1371/journal.pone.0315721.t003

3. Methodology

Confronted with complex data distributions, existing anomaly detection methods often struggle due to their reliance on statistical assumptions and neighborhood relations. This limitation makes it difficult to effectively fit the patterns of complex distributions. To address this challenge, we have proposed the GALD method. As illustrated in Fig 1, GALD constrains its generator with a loss function to better fit the probability distribution of more frequent objects in the original dataset, then compares the neighborhood density similarity between the original data and the synthetic data produced by the generator, with greater deviations indicating a higher likelihood of anomaly objects.

thumbnail
Fig 1. The entire structure of GALD.

The overall flow of the GALD method is shown, including the interaction of the generator and the discriminator, and how anomaly detection is achieved by local synthesis of the density.

https://doi.org/10.1371/journal.pone.0315721.g001

3.1 Unsupervised learning data distribution

In anomaly detection tasks, datasets typically exhibit two distinct distribution patterns: the majority being normal objects distributions and the minority being anomaly objects distributions. While the GAN network is a special type of neural network that includes a generator module and a discriminator module. The primary function of the generator is to receive random vectors (typically from a Gaussian distribution) and generate synthetic data objects consistent with the original data distribution.

When the GAN processes input data containing both normal and anomaly objects, its generator, constrained by the loss function, prioritizes learning the latent distribution of the overwhelmingly predominant normal objects in the original data, thus minimizing errors to the greatest extent. The discriminator, on the other hand, is responsible for distinguishing which of the input data comes from the original data and which comes from the synthesized data of the generator. The model continuously updates parameters during training, with the generator and discriminator being alternately optimized. The training objective of the GAN can be expressed by the following Eq (1).: (1)

In Eq (1), D(x) denotes the output of the discriminator D for a given object x of original data. If x from the original dataset, the discriminator should output a value close to 1, indicating that it considers the object to be “real”. G(z) denotes the data object synthesized by the generator. D(G(z)) is the probability assigned by the discriminator to classify the generated sample as real, with an expected value of 0. Fig 2 represents the training framework of the GANs.

thumbnail
Fig 2. The training framework of GANs.

Describes how the GANs model learns the distribution of original data through adversarial training between a generator and a discriminator.

https://doi.org/10.1371/journal.pone.0315721.g002

The respective training objectives of the discriminator and generator are shown in Eqs 2 and 3, xi denote the i-th object within the original data containing m objects, k represents the amount of generated objects, and zj represents the j-th generated object.

(2)(3)

The generator optimizes its output by minimizing the objective function in Eq 3. During this optimization process, due to the dominance of normal objects, the generator predominantly learns how to generate samples like normal objects. Since anomaly objects are relatively rare, they have limited impact on the loss function, which leads the generator to focus primarily on the distribution characteristics of normal objects. Let the original datasets be X = {x1, x2, x3,…, xn}∈Rd*n, where xi represents any object in X. Fig 3 illustrates the generator fitting the original data distribution during adversarial training of the GANs.

thumbnail
Fig 3. Generator fitting the distribution of the original dataset X.

Demonstrates the generator’s ability to fit the original data distribution by progressively generating samples that more closely match it during adversarial training.

https://doi.org/10.1371/journal.pone.0315721.g003

During the training process of GANs, the generator optimizes the generated data by minimizing the discriminator’s judgment error. Since normal objects are the majority in the dataset, the generator is primarily influenced by these normal samples during optimization, naturally tending to learn and fit the distribution of normal data. At the same time, the rarity of anomaly objects means they have a relatively small impact on the loss function, causing the generator to focus more on generating data like normal samples. Additionally, in the early stages of training, the discriminator can easily distinguish between normal and anomaly data, which forces the generator to gradually produce samples that are closer to normal data in order to fool the discriminator, thus prioritizing the learning of normal objects distribution.

In Fig 3, the blue curve represents the distribution of anomaly objects, the yellow curve represents the distribution of synthetic data by the generator, and the green curve represents the distribution of normal objects. In Fig 3, the lower part of each subplot represents the noise area Z, and the arrows indicate the mapping relationship between the data input to the generator and the original data X distribution. Fig 3A shows the noise distribution mapped by the generator at random initialization, while Fig 3B–3D show the noise distribution mapped by the generator G after continuous parameter updates.

Since normal and anomaly objects in the original dataset are samples from two different latent distributions, the generator has two distinct mapping methods for the input noise to minimize the error in Eq 3, as shown in Fig 4.

thumbnail
Fig 4. Generator maps noise to two potential latent distributions.

It is illustrated how the generator converts input noise into normal or anomaly objects in the latent distribution for effective anomaly detection.

https://doi.org/10.1371/journal.pone.0315721.g004

As shown in Fig 4(A), when the objects generated by the generator closely resemble normal objects, it becomes increasingly difficult for the discriminator to distinguish between the synthetic and the original data. The discriminator struggles to differentiate between them using a simple boundary. On the contrary, Fig 4(B) depicts objects generated by the generator that align with the distribution of anomaly objects. In such a case, the discriminator identifies it easier to differentiate between fake and original data. Obviously, this outcome does not align with the training objectives of the generator. Thus, in unsupervised learning of the original data distribution, the GAN network primarily focuses on capturing the distribution of normal objects.

Method 1 Unsupervised fitting the data distribution

Input: Original dataset X, Learning rateγ, Number of iterations n, sample size m.

Output: Fake data

1. for n = 1:m do

2. for k = 1 do

3.  Sample m objects from X.

4.  Sample m noise objects {z1, z2, , zi} from Pz(z).

5.  #Update discriminator by increasing random gradient:

6.  

7. end for

8. Sample m noise samples {z1, z2, , zi} from Pz(z).

9. #Update generator by increasing random gradient:

10.  

11. end for

12. return Fake data

3.2 Local synthetic density

Local synthetic density refers to the similarity in regional density between the original data and the data synthesized by the generator. In unsupervised training, the generator synthetic the representations of normal objects found in the original data. The introduction of local synthetic density aims to address the limitations of GANs in fitting highly imbalanced data distributions. By calculating the local synthetic density, we aim to provide a more refined measure of how well each data point aligns with the synthetic data distribution generated by the GANs. This approach helps to enhance anomaly detection by capturing subtle deviations that traditional GANs-based methods may overlook, particularly in complex and imbalanced data settings.

If an original data object and its synthetic neighbor objects exhibit high similarity in terms of local density, it indicates that the object conforms well to the distribution of normal objects. Conversely, if the local density is significantly different, the object is more likely to be an anomaly object.

Definition 1: Neighborhood

Let FD represent the set of synthetic data through the GANs, define the neighborhood of Xi, denoted as Nk(Xi), as the collection of the closest neighbors in FD to the data point Xi.

(4)

FDi represents any object in the synthetic data, dist(FDi,Xi) denotes the Euclidean distance between FDi and Xi, and k-distance(Xi) represents the distance between Xi and the k-th nearest object in the synthetic data. The calculation method for dist(FDi,Xi) is given by Eq 5: (5)

Definition 2: Local synthetic density (LSD)

Local synthetic density is particularly useful in anomaly detection as anomaly objects often exist in sparse regions where the local synthetic density is low. The local synthetic density of an object Xi in the original data X is calculated as the reciprocal of the sum of distances between Xi and the objects in its synthetic neighborhood Nk (Xi). A higher local synthetic density suggests that the object is situated near the central region of the synthetic data, while a lower density indicates that the object is positioned further from the center. The formula for calculating local synthetic density is as follows Eq 6: (6)

Fig 5 illustrates the principle of calculating local density. The generator of the GANs first fits the distribution of the original data, generating fake data FD with objects distributed similarly to normal objects. Nk(Xi) represents the neighborhood set composed of the k-nearest neighbors of Xi.

thumbnail
Fig 5. Local synthetic density.

The process of how to calculate the local synthetic density between the original and generated data is shown.

https://doi.org/10.1371/journal.pone.0315721.g005

Method 2 Local synthetic density

Input: Original dataset X, fake data FD, the number of k-nearest neighbor k

Output: Local synthetic density LSD

1. for t = 1:m do

2. for k = 1:K do

3.  Sample xi from X.

4.  Sample FDi from FD.

5. 

6. end for

7. #Construct the neighborhood of Nk(Xi)

9. 

10. end for

11. return Local synthetic density

3.3 Anomaly factor

The anomaly factor(AF) represents the degree of deviation of an object from others; the higher the value, the more likely the object is to be an anomaly object. The method for calculating the anomaly factor is defined as shown in Eq 7.

(7)

In Eq 7, AFi is the anomaly factor of the i-th object in original data X, Nk(xi) is the number of synthetic neighbors. If the similarity in average local density between an object in the original data and its synthetic neighbor objects is higher, it is less likely to be an anomaly, and its AF value will tend to be closer to 1. If the AF value approaches 0, it indicates that the object is located in a high-density area, reducing the likelihood of it being an anomaly. If the AF value is greater than 1, it indicates that the object is in a sparser area compared to the generated "normal objects," making it more likely to be an anomaly object.

4. Experiments

This section provides a detailed description of the experimental setup, results, and in-depth analysis. It outlines the methodologies used for evaluating the proposed methods, the datasets employed, and the experimental procedures followed to ensure robustness and reliability.

4.1 Experimental design

The experimental design section provides an overview of the experimental design, detailing the baseline methods used for comparison, the datasets employed, and the evaluation metrics applied. These elements are essential for systematically assessing the performance and effectiveness of the proposed method.

4.1.1 Comparison methods.

To verify the effectiveness of the proposed GALD method in anomaly detection tasks, five widely validated and efficient methods from the anomaly detection field were selected for comparison. These methods and their hyperparameter settings are as follows:

  1. Autoencoder (AE), Multi-Objective Generative Adversarial Active Learning (MO-GAAL), Single-Objective Generative Adversarial Active Learning (SO-GAAL), STEP-GAN and f-AnoGAN. The learning rates for these methods are set to 0.0001, and the hidden layers are both set to 3, ensuring that they have equivalent learning capabilities.
  2. Local Anomaly Factor (LOF), a method based on local anomaly factors. To ensure a fair comparison with the proposed methods, we varied the k-value for LOF in the range from 1 to 100 and selected the best result.
  3. K-means, an method based on clustering. The number of clusters (k) in the k-means method requires manual setting; in this study, we varied k within the range of 1 to 20 based on the dataset used.
  4. K-Nearest Neighbors (KNN), a distance-based method. Similar to LOF, KNN considers the number of neighbors during the detection process. We searched for the optimal k-value for KNN in the range of 1 to 100.
  5. solation Forest (IForest), an isolation-based method. In the detection process of Isolation Forest, sampling is required for the data set under investigation [37]. In this study, we set the sampling range to 10 to 200.

For proposed GALD method, we applied strict parameter settings. To ensure that GALD has comparable neural network learning capabilities with benchmark methods, the number of hidden layers is set to 3 and the learning rate set to 0.0001. For the calculating local density module and anomaly factors module, the hyperparameter values in the range of 1 to 100.

4.1.2 Real-world datasets and evaluation method.

Real world benchmark datasets are an effective way of comparing the detection capabilities of different anomaly detection methods. ODDS (a widely used public database in the field of anomaly detection) provides a wealth of benchmark datasets for this purpose. In this study, we use several types of benchmark datasets in ODDS to evaluate the performance of the proposed method.

Before applying the proposed method, several preprocessing steps were carried out to ensure data quality and comparability. Specifically: (1) Deduplication. The datasets were deduplicated to remove redundant entries, ensuring that only unique samples were retained. This step helps to avoid bias introduced by repeated data and improves the reliability of the evaluation. (2) Normalization: We applied Min-Max normalization to bring all features into the range [0, 1]. This technique rescales each feature according to its minimum and maximum values using the Formula 8: (8) where x is the original feature value, xmin and xmax are the minimum and maximum values of that feature. Min-Max normalization was used because it ensures all features are on the same scale, as features with larger magnitudes could otherwise dominate the results. This step helps improve the stability of the GAN training process by preventing any one feature from having a disproportionate influence.

  1. Breastw Dataset: This dataset is derived from fine needle aspiration images of breast lumps and mainly describes the morphological characteristics of cell nuclei, with 9 dimensions and a total of 683 objects, including 239 anomaly objects. The information it contains is used to analyze the nature of breast lumps.
  2. Heart Dataset: This dataset extracts features from regions of interest in heart images taken under different conditions, such as rest and stress. These features reflect the activity levels in different heart states. The dataset has 44 dimensions and a total of 267 objects, including 55 anomaly objects.
  3. Glass dataset: This dataset consists of glass fragment samples collected from crime scenes, categorized by analyzing the chemical composition of the glass. It contains 9 dimensions and a total of 213 objects, including 9 anomaly objects. The data reflects the compositional differences between various types of glass.
  4. Pima Dataset: This dataset records multiple biological characteristics related to diabetes in a Native American population from Arizona. These features can be used to study the risk of individuals developing diabetes. It contains 8-dimensional features, totaling 768 objects, with 268 anomaly objects.
  5. Inner Race, Outer Race, Ball Fault Dataset: These dataset from the bearing fault diagnosis laboratory at Case Western Reserve University. The dataset describes 3 main types of bearing failures. It contains 23 dimensions, totaling 860 objects, with 60 anomaly objects. These datasets extract time-frequency domain features from bearing vibration signals, reflecting different wear conditions of the bearings. They provide specific signal characteristics related to bearing faults.

The evaluation method of the experimental results are crucial for determining the final performance of the method. We choose five commonly used metrics in the field of anomaly detection to evaluate the proposed method and comparison methods. These evaluation methods are (i) Area Under Curve (AUC), which is commonly used to measure the performance of classification models for binary classification tasks. A higher AUC value indicates better classification performance, especially in imbalanced datasets; (ii) Execution time. This metric refers to the time taken for the algorithm to complete its processing and return results. It is crucial for assessing the efficiency and real-time applicability of the method; (iii) Accuracy (ACC). Accuracy represents the proportion of correctly classified objects (both true positives and true negatives) among the total objects. It gives an overall measure of how well the method performs in terms of correct predictions; (iv) Detection rate (DR). Detection rate, also known as the true positive rate or recall, measures the proportion of actual anomalies that are correctly identified by the method. A higher detection rate indicates better sensitivity in anomaly detection; (v) False alarm rate (FAR). The false alarm rate represents the proportion of normal objects that are incorrectly classified as anomaly objects (false positives) out of all normal objects.

4.2 Experimental results and analysis

In this section, we first present the visual detection results of the GALD method. Then, we analyze the final experimental results of the proposed method and the comparison methods based on five evaluation metrics. Finally, we provide an analysis and summary of the underlying patterns revealed by the experimental results.

Observing Fig 6, some interesting information about the GALD detection of anomaly objects are summarize as follows:

(i) Across the seven datasets mentioned above, there is no clear boundary between normal and anomaly objects. A few anomaly objects are located around or within the region of normal objects. The GALD method combines the ability to fit the data distribution with generative adversarial networks and the density of objects with their neighbors. This comprehensive approach allows it to accurately detect even mildly deviated anomaly objects.

(ii) In datasets like breastw, pima, inner race fault, outer race fault, ball fault, etc., GALD demonstrates highly competitive detection performance. This is because the distribution pattern of normal objects in these datasets is relatively easy to fit. In contrast, the distribution pattern of normal objects in other datasets appears more scattered, making it challenging for the GALD method to fit them accurately. For instance, in the heart dataset and glass dataset, the difficulty primarily arises from the inherent properties of these datasets. The heart dataset has a high dimensionality and complex feature relationships, making it hard to differentiate between normal and anomaly objects. Furthermore, in both the heart and glass datasets, the distributions of normal and anomalous objects have significant overlap, which reduces the distinguishable features available for effective anomaly detection. Similarly, the glass dataset has a small number of samples and subtle differences between classes, resulting in fewer clear patterns for the model to learn. These characteristics make anomaly detection in these datasets more challenging, contributing to the lower performance of the GALD method on these datasets.

(iii) Nevertheless, GALD exhibits considerable robustness and accuracy when dealing with datasets characterized by complex distribution patterns. By harnessing the powerful fitting capabilities of generative adversarial networks, GALD can identify a broader range of anomaly objects, even when these anomalies are embedded within intricate normal objects distribution patterns.

thumbnail
Fig 6. Visualization of GALD method experimental results.

Each subfigure corresponds to the detection results of different datasets. The left subfigure shows the visualization results of the original dataset, and the right subfigure shows the detection results of the GALD method, which shows the anomaly detection ability of the GALD method by comparison.

https://doi.org/10.1371/journal.pone.0315721.g006

Through the analysis of Table 4, the following key insights can be distilled:

i): The GALD method achieved the best AUC values in six out of the seven datasets. Regarding the generalization performance of each method, GALD shows a better generalization ability. The stronger the generalization capability, the more stable and reliable the method’s performance across different datasets and tasks. This experimental result also indicates that GALD has a higher adaptability. Even in the face of different data distributions, GALD maintains a high level of detection accuracy.

ii): The GALD method does not have an advantage in terms of execution time. This is mainly because GALD detection process involves many matrix computations, includes training of the network and finding nearest neighbors, which reduces the overall efficiency of GALD. However, this problem is not intolerable. Although the GALD method has a relatively longer execution time, the improvement in detection performance justifies this trade-off. Specifically, GALD achieved an average AUC of 0.874 across 7 datasets, which is 7.2% higher compared to the next best-performing method. Additionally, GALD demonstrated an accuracy of 94.34%, an average Detection Rate (DR) of 75.16%, and a False Alarm Rate (FAR) of 3.52%. These metrics indicate that GALD outperforms existing methods in key performance indicators. In practical anomaly detection tasks where accuracy is crucial—such as in healthcare or safety monitoring—the trade-off between execution time and enhanced detection capability can be deemed acceptable, as the benefits of identifying anomalies more accurately often outweigh the need for faster processing.

iii): When the GAN fits the original data distribution, due to the overwhelming number of normal objects, the generator, under the constraint of the loss function, tends to fit the distribution of normal data. This means that the generated data primarily reflects the distribution characteristics of normal objects, while anomaly objects are often overlooked. By incorporating local synthetic density, the GALD can detect distribution differences in the local regions between the original and generated data, especially since anomaly objects often exhibit low density in these local regions, making them easier to identify. This approach effectively compensates for the limitations of GANs in high-dimensional complex data distributions, making them more sensitive to anomaly objects.

thumbnail
Table 4. Experimental results.

Comparison results between the GALD method and other state-of-the-art anomaly detection methods on five metrics are shown to illustrate the advantages of GALD.

https://doi.org/10.1371/journal.pone.0315721.t004

5. Conclusion

In response to the difficulty of existing methods to learn the distribution patterns of the data to be detected in an unsupervised scenario, leading to low accuracy in detecting anomaly objects, we propose an anomaly detection method based on GANs and local synthetic density. The method first utilizes GANs to learn the original data distribution; then calculates the local density between the original data and the synthetic data; finally, the greater the deviation of an object’s density compared to its synthetic neighbors, the more likely it is to be an anomaly object. This method enables GALD to utilize the distribution fitting capabilities of GANs and enhances the correlation analysis among data objects, offering a valuable new perspective for anomaly detection. Extensive experiments show that GALD method is significantly superior to the comparison methods in terms of detection accuracy and generalization, proving the reliability of the GALD method. In future work, we will further optimize the diversity of the synthetic data and improve the detection efficiency of the method. Specifically, we plan to implement approximate nearest neighbor search strategies, instead of exact nearest neighbor search, which is computationally expensive, approximate methods (e.g., using KD-trees or locality-sensitive hashing) can be utilized to improve efficiency while maintaining acceptable accuracy.

References

  1. 1. Hilal W, Gadsden SA, Yawney J. Financial Fraud: A Review of Anomaly Detection Techniques and Recent Advances[J]. Expert Syst Appl. 2022;193: 116429.
  2. 2. Yaseen A. The role of machine learning in network anomaly detection for cybersecurity[J]. Sage Science Review of Applied Machine Learning, 2023, 6(8): 16–34.
  3. 3. Kascenas A, Sanchez P, Schrempf P, et al. The role of noise in denoising models for anomaly detection in medical images[J]. Medical Image Analysis, 2023, 90: 102963.
  4. 4. Zipfel J, Verworner F, Fischer M, Wieland U, Kraus M, Zschech P. Anomaly detection for industrial quality assurance: A comparative evaluation of unsupervised deep learning models[J]. Comput Ind Eng. 2023;177: 109045.
  5. 5. Lu T, Wang L, Zhao X. Review of Anomaly Detection Algorithms for Data Streams[J]. Appl Sci. 2023;13: 6353.
  6. 6. Liu J, Xie G, Wang J, et al. Deep industrial image anomaly detection: A survey[J]. Machine Intelligence Research, 2024, 21(1): 104–135.
  7. 7. Han D, Wang Z, Chen W, Wang K, Yu R, Wang S, et al. Anomaly Detection in the Open World: Normality Shift Detection, Explanation, and Adaptation[C]. Proceedings 2023 Network and Distributed System Security Symposium. 2023, 2–4
  8. 8. Yang Z, Soltani I, Darve E. Anomaly Detection with Domain Adaptation[C]. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2023, 2958–2967.
  9. 9. Bajaj NS, Patange AD, Jegadeeshwaran R, Pardeshi SS, Kulkarni KA, Ghatpande RS. Application of metaheuristic optimization based support vector machine for milling cutter health monitoring[J]. Intell Syst Appl. 2023;18: 200196.
  10. 10. Xu H, Pang G, Wang Y, Wang Y. Deep Isolation Forest for Anomaly Detection[J]. IEEE Trans Knowl Data Eng. 2023;35: 12591–12604.
  11. 11. Koren O, Koren M, Peretz O. A procedure for anomaly detection and analysis[J]. Eng Appl Artif Intell. 2023;117: 105503.
  12. 12. Li G, Jung JJ. Deep learning for anomaly detection in multivariate time series: Approaches, applications, and challenges[J]. Inf Fusion. 2023;91: 93–102.
  13. 13. Carvalho DV, Pereira EM, Cardoso JS. Machine Learning Interpretability: A Survey on Methods and Metrics[J]. Electronics. 2019;8: 832.
  14. 14. De Souza V L T, Marques B A D, Batagelo H C, et al. A review on generative adversarial networks for image generation[J]. Computers & Graphics, 2023, 114: 13–25.
  15. 15. Fuhnwi GS, Agbaje JO, Oshinubi K, Peter OJ. An Empirical Study on Anomaly Detection Using Density-based and Representative-based Clustering Algorithms[J]. J Niger Soc Phys Sci. 2023; 1364:1–13.
  16. 16. Souto Arias LA, Oosterlee CW, Cirillo P. AIDA: Analytic isolation and distance-based anomaly detection algorithm[J]. Pattern Recognit. 2023;141: 109607.
  17. 17. Koko RRZ, Yassine IA, Wahed MA, Madete JK, Rushdi MA. Dynamic Construction of Outlier Detector Ensembles With Bisecting K-Means Clustering[J]. IEEE Access. 2023;11: 24431–24447.
  18. 18. Song H, Li P, Liu H. Deep Clustering based Fair Outlier Detection[C]. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021, 1481–1489.
  19. 19. Breunig MM, Kriegel H-P, Ng RT, Sander J. LOF: Identifying Density-Based Local Outliers[C]. Association for Computing Machinery. 2000, 93–104.
  20. 20. Zou D, Xiang Y, Zhou T, Peng Q, Dai W, Hong Z, et al. Outlier detection and data filling based on KNN and LOF for power transformer operation data classification[J]. Energy Rep. 2023;9: 698–711.
  21. 21. Torabi H, Mirtaheri S L, Greco S. Practical autoencoder based anomaly detection by using vector reconstruction error[J]. Cybersecurity, 2023, 6(1): 1–13.
  22. 22. Krichen M. Convolutional Neural Networks: A Survey[J]. Computers. 2023, 12: 151.
  23. 23. Guha D, Chatterjee R, Sikdar B. Anomaly Detection Using LSTM-Based Variational Autoencoder in Unsupervised Data in Power Grid[J]. IEEE Syst J. 2023, 17: 4313–4323.
  24. 24. Zheng X, Wu B, Zhang AX, Li W. Improving Robustness of GNN-based Anomaly Detection by Graph Adversarial Training[C]. 2024 ELRA Language Resource Association. 2024, 8902–8912.
  25. 25. Tabassum T, Toker O, Khalghani MR. Cyber–physical anomaly detection for inverter-based microgrid using autoencoder neural network[J]. Appl Energy. 2024, 355: 122283.
  26. 26. Qasim Gandapur M, Verdú E. ConvGRU-CNN: Spatiotemporal Deep Learning for Real-World Anomaly Detection in Video Surveillance System[J]. Int J Interact Multimed Artif Intell. 2023, 8: 88.
  27. 27. Chander N, Upendra Kumar M. Metaheuristic feature selection with deep learning enabled cascaded recurrent neural network for anomaly detection in Industrial Internet of Things environment[J]. Clust Comput. 2023, 26: 1801–1819.
  28. 28. Acharya T, Annamalai A, Chouikha MF. Efficacy of CNN-Bidirectional LSTM Hybrid Model for Network-Based Anomaly Detection[C]. 2023 IEEE 13th Symposium on Computer Applications & Industrial Electronics. 2023, 348–353.
  29. 29. L(y)u S, Wang K, Wei Y, Liu H, Fan Q, Wang B. GNN-based Advanced Feature Integration for ICS Anomaly Detection[J]. ACM Trans Intell Syst Technol. 2023, 14: 1–32.
  30. 30. Li H, Li Y. Anomaly detection methods based on GAN: a survey[J]. Appl Intell. 2023, 53: 8209–8231.
  31. 31. Dubey SR, Singh SK. Transformer-based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey. arXiv preprint; arXiv:2302.08641
  32. 32. Singh A, Reddy P. AnoGAN for Tabular Data: A Novel Approach to Anomaly Detection. arXiv preprint; arXiv:2405.03075.
  33. 33. Sliti O, Devanne M, Kohler S, Samet N, Weber J, Cudel C. f-AnoGAN for non-destructive testing in industrial anomaly detection[C]. Sixteenth International Conference on Quality Control by Artificial Vision. 2023, 301–308.
  34. 34. Deng X, Xiao L, Liu X, Zhang X. One-Dimensional Residual GANomaly Network-Based Deep Feature Extraction Model for Complex Industrial System Fault Detection[J]. IEEE Trans Instrum Meas. 2023, 72: 1–13.
  35. 35. Liu Y, Li Z, Zhou C, Jiang Y, Sun J, Wang M, et al. Generative Adversarial Active Learning for Unsupervised Outlier Detection. arXiv preprint; arXiv:1809.10816.
  36. 36. Adiban M, Siniscalchi S M, Salvi G. A step-by-step training method for multi generator GANs with application to anomaly detection and cybersecurity[J]. Neurocomputing, 2023, 537: 296–308.
  37. 37. Liu F, Ting K, Zhou Z, Isolation-based anomaly detection, ACM T KNOWL DISCOV D 2012,6.