A multi-layer perceptron neural network for varied conditional attributes in tabular dispersed data

Małgorzata Przybyła-Kasperek; Kwabena Frimpong Marfo

doi:10.1371/journal.pone.0311041

Abstract

The paper introduces a novel approach for constructing a global model utilizing multilayer perceptron (MLP) neural networks and dispersed data sources. These dispersed data are independently gathered in various local tables, each potentially containing different objects and attributes, albeit with some shared elements (objects and attributes). Our approach involves the development of local models based on these local tables imputed with some artificial objects. Subsequently, local models are aggregated using weighted techniques. To complete, the global model is retrained using some global objects. In this study, the proposed method is compared with two existing approaches from the literature—homogeneous and heterogeneous multi-model classifiers. The analysis reveals that the proposed approach consistently outperforms these existing methods across multiple evaluation criteria including classification accuracy, balanced accuracy, F1−score, and precision. The results demonstrate that the proposed method significantly outperforms traditional ensemble classifiers and homogeneous ensembles of MLPs. Specifically, the proposed approach achieves an average classification accuracy improvement of 15% and a balanced accuracy enhancement of 12% over the baseline methods mentioned above. Moreover, in practical applications such as healthcare and smart agriculture, the model showcases superior properties by providing a single model that is easier to use and interpret. These improvements underscore the model’s robustness and adaptability, making it a valuable tool for diverse real-world applications.

Citation: Przybyła-Kasperek M, Marfo KF (2024) A multi-layer perceptron neural network for varied conditional attributes in tabular dispersed data. PLoS ONE 19(12): e0311041. https://doi.org/10.1371/journal.pone.0311041

Editor: Kalapraveen Bagadi, Vellore Institute of Technology - Amaravati Campus: VIT-AP Campus, INDIA

Received: December 10, 2023; Accepted: September 9, 2024; Published: December 2, 2024

Copyright: © 2024 Przybyła-Kasperek, Marfo. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data used in the article is publicly available from the UCI repository. The specific links are as follows: Vehicle Silhouettes: https://archive.ics.uci.edu/dataset/149/statlog+vehicle+silhouettes Dry Bean: https://archive.ics.uci.edu/dataset/602/dry+bean+dataset Sensorless Drive Diagnosis: https://archive.ics.uci.edu/dataset/325/dataset+for+sensorless+drive+diagnosis Crowd Sourced: https://archive.ics.uci.edu/dataset/400/crowdsourced+mapping.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

Machine learning (ML) for dispersed data addresses the challenge of analyzing and utilizing data that is scattered across different sources, formats, and locations. This is increasingly important in the era of big data where data is often inconsistent, heterogeneous, and subject to privacy regulations. Data collected from various sources often lack a uniform structure, in that attributes and objects might differ significantly from one data set to another. In the healthcare sector, data on patients are stored across multiple hospitals and medical facilities, each with its own data management system. These data sets might differ in structure, terminology, and format. Additionally, data protection regulations (e.g., HIPAA in the U.S., GDPR in Europe) prevent the sharing of sensitive patient information between institutions without proper safeguards. Suppose multiple hospitals are working together to develop a predictive model for identifying patients who are at high risk of sepsis. Each hospital has its own data set which includes varying attributes such as patient vitals, lab results, and medication histories. The problem is how to collaboratively train a sepsis prediction model leveraging on these diverse data sets. A very important aspect is the ability to make use of inconsistent data that is available in dispersed form, however, due to the arbitrariness of attributes and objects present in local data sets as well as data protection laws that restrict the free flow of data, one has to be meticulous when dealing with dispersed data. ML for dispersed data is crucial in leveraging the full potential of big data across domains where data is fragmented and regulated. It enables organizations to collaboratively develop sophisticated models.

To clarify, certain assumptions have been made that define the area of the problem under consideration. To begin with, there is the assumption that data are available in tabular and dispersed form. Also, data are provided by independent entities that do not want to share data—storing data on a central server with other data from other sources. For this reason, there are a set of local decision tables where objects and conditional attributes of such tables need not satisfy any constraints—they do not have to be equal or same but they may have some shared attributes as well as objects. In the situation considered in the paper, we do not guarantee full confidentiality. We assume that the local entities agree to disclose information about what attributes are stored in the table—the names of these attributes and certain characteristics of the values stored in the table such as mean, median, minimum and maximum of the attributes in the local tables.

Researchers have been seeking for solutions associated with dispersed data in domains such as federated learning and distributed learning. Federated learning puts a greater emphasis on data protection. The general approach here is to distribute an initial model from a central server to all local spaces for the model to be trained locally. Trained parameter values from all local models are then sent to the central server where some aggregations are performed to produce a global model. The global model is then sent back to the local spaces for verification—local units can accept, modify or reject the global model. Such a process is performed iteratively until some acceptable convergence metric is achieved. In [1], a detailed description can be found. For such a global model to be constructed in federated learning, the assumption of an equal set of conditional attributes present in all local tables must be satisfied. Distributed learning on the other hand assumes that all data are available in a centralized form, for example in a single decision table (see [2]). The division into local sets is intentional and aims to improve the quality of the model’s classification or its ability to deal with huge data. Often, the process of creating local tables in distributed learning is focused on strengthening the local classifiers by sensitizing them to difficult cases. This approach assumes full access to all data and does not necessarily guarantee any data protection.

The model proposed in this paper is different from the two domains mentioned above. Namely, the proposed model does not impose any assumptions of homogeneity on the form of data present in local spaces whiles guaranteeing a certain level data protection. By sharing data on attribute names and general characteristics of the values stored in the tables, Individual data tuples and individual raw data are protected. Also, the proposed method does not employ an iterative process to reach a consensus but rather a non-iterative algorithm that leads to the construction of a global model.

The main contribution of the paper is to propose a method that generates a global neural network model based on dispersed data. To begin, local neural networks with the same structure are trained based on local tables where local tables have varied and unrestricted form—no constraints on the set of objects and the set of attributes. In order to generate local networks with the same structure, it is necessary to somehow modify the local tables. The goal is to generate local networks whose input layers considers the full set of conditional attributes present in all local tables. In each of the local tables, values for missing attributes are imputed by using certain characteristics determined based on other local tables containing the missing attributes. In this way, a set of extended local tables are prepared and used to train local MLP networks. This study uses the MLP networks as this is the initial take on the proposed approach, thus, it is appropriate to start with the standard neural network suitable for classification. Other types of networks such as the radial basis function neural networks as well as autoenconders are planned to be used in future work. In the next step, the local neural networks are aggregated. In this study, two approaches of aggregating networks are considered—average and sum of the weights from the local models. Finally, the aggregated model is re-trained with a sample of data that is shared and defined for the full set of conditional attributes. In this way the final global model is constructed.

In this paper, the proposed model is investigated in terms of many variants. The following features are tested:

the method to substitute values of missing attributes in local tables (different number of artificial objects generated based on one original object are tested),
the number of hidden layers in MLP networks (k-hidden layer networks are tested, k ∈ {1, 2}),
the number of neurons in the hidden layer/s (different values are tested),
the method of aggregating local neural networks (sum and average are tested).

The proposed method is compared with other methods from the literature to establish its quality. Two approaches are adopted as baseline methods. To begin, the ensembles of classifiers proposed in the paper [3]. It comprises creating three base classifiers: k-nearest neighbors, decision tree and naive bayes classifier (KNN, DT, NB) based on each local table. The final decision is made by voting. The second approach is a homogeneous ensemble of classifiers and consists of generating MLP networks based on each local table separately and then generating the final decision by voting. It is shown in this paper that the proposed model produces much better results than both baseline models. These differences are also confirmed with statistical tests.

The paper is organized as follows. In the second section, an overview of the literature is included. The third section presents the newly proposed adaptive approach. Here, a formal definition of dispersed data and description of steps of the process of building a global model based on dispersed data is included. The fourth section gives the experimental protocol and description of the data used. The fifth describes obtained results. Comparative analysis are also carried out in this section. Finally, a summary is presented in the conclusion section.

2 Related work

Artificial intelligence (AI) has transformed various sectors by integrating human-like abilities such as learning, reasoning, and perception into software systems. This advancement has enabled computers to execute tasks traditionally performed by humans. Fueled by enhancements in computational capacity, the availability of extensive data sets and the creation of state-of-the-art AI algorithms, AI applications have become widespread. Noteworthy examples include finger vein recognition [4], diabetic retinopathy detection [5], RNA Engineering [6], cancer detection [7], biomathematical challenges [8], and smart agriculture [9].

Ensemble learning deals with distributed data which is similar to the issues in this paper. It is a very popular technique in machine learning that is employed to boost the predictive performance of learning algorithms. The underling reasoning of using this approach is to tackle problems involving data sets that are too large to handle at once [10] or in situations where having access to a very small data set, at which data sampling is necessary to obtain reasonable results [11]. Another rationale for using this approach may be to cope with the issue of identifying the right model for the considered problem [12]. To expound, rather than risking selecting the wrong model, one can use a heterogeneous approach of ensemble learning. This approach also works well for problems whose solution space is quite large, thus, faces the risk of getting stuck in local minima/maxima [13]. Many different approaches involving the use of neural networks to address the above mentioned problems have been proposed. Such solutions are proposed in areas such as the business field [14], malware detection [15] and audio classification [16], however, all these approaches assume free access to data and a necessary condition that all data is stored in a centralized form rather than a dispersed one.

Federated learning is another approach within distributed machine learning [1]. Different from classifier ensembles, it puts the greatest emphasis on data segregation and protection [17]. Here, the assumption is that data are available in separate sets that must not be centralized. The idea is to build local models separately and generate a global model in a central space by iteratively aggregating the local models. Neural networks are well applicable here as it is relatively simple to aggregate these models while maintaining high quality [18, 19]. There are types of federated learning: horizontal, vertical and hybrid federated learning. The latter approach is the closest to the approach proposed in this paper, however, unlike the proposed approach, hybrid federated learning requires that different parties share the data identity information which is a threat to the privacy of local clients [20]. Unfortunately, for the considered data sets, it is impossible to apply this approach due to the hybrid nature of the partitioning—regarding both objects and attributes—and the inability to obtain identity information about objects between dispersed data sets. Many different models are proposed in federated learning with various aggregation methods, network types and applications being considered in the literature [21–23].

Another approach to the problem of classification based on dispersed data is to build a separate model that aggregates prediction vectors generated by independent local models. Data privacy is also preserved here as only prediction vectors are consolidated. The form of the data can be completely arbitrary in this approach but here, a global model is not generated and the algorithm is non-iterative. Instead, it generates a separate model that only aggregates the prediction results obtained by local models. The local models can be of a completely different type than the aggregation model. In the literature, one can find papers that use neural networks, decision trees or other models as the aggregation model [24–28]. Statistical as well as dynamic approaches to this issue are also proposed which also consider conflicts or compatibility of local classifiers [29–31]. However, in the present study, the approach considered is different as the goal is to determine a global model based on dispersed data.

MLPs have been key in developing neural networks and machine learning. Although more complex models like Convolutional Neural Networks (CNNs) and Transformers have emerged, recent improvements have renewed the importance and usefulness of MLPs particularly where simplicity and efficiency are needed. Techniques such as Adam and RMSProp [32] have enhanced MLP training by dynamically adjusting learning rates, leading to faster convergence and improved generalization. Incorporating residual connections within MLPs akin to ResNet architectures [33] has mitigated the problem of gradient vanishing, enabling the training of deeper MLP models. MLPs traditionally require large amounts of labeled data to perform effectively. Techniques such as data augmentation and transfer learning are being adapted to address this limitation [34]. Some of the techniques mentioned above (e.g Adam optimizer [35]) are used in this paper for MLP. But, to the best of our knowledge, MLP networks have never been used in the way that is proposed in this paper—for dispersed data with different sets of attributes using augmentation of missing attribute values.

3 Basic concepts and proposed global model

In this section, we present preliminary designations as well as a detailed discussion on the proposed method for generating a global MLP network model based on dispersed data.

3.1 Dispersed data

A necessary assumption made is that data are available in a dispersed form—separate independent predefined data sets which are free of any constraints. In real applications, independent units collect data in tabular form. In tables, both sets of conditional attributes and sets of objects do not necessarily have to be disjoint as they may share common elements.

Also, there is an assumption that a set of decision tables is given. The tables are collected independently by separate units. A set of decision tables—local tables D_i = (U_i, A_i, d)i ∈ {1, …, n} from one discipline is available, where U_i is the universe, a set of objects; A_i is a set of conditional attributes and d is a decision attribute. Decision tables are collected independently so both sets of objects and sets of attributes can have any form. They can have common elements between tables, but not necessarily. The only condition that must be satisfied by all local tables is the collection of data from one discipline. Formally, this is satisfied by the assumption that the same decision attribute is present in all tables.

Since different sets of attributes appear in local tables, the construction of a MLP local model based on each of the tables separately would create a set of networks with completely different structures. This is because the input layer in each neural network would be different since the feature vectors are not the same across the local tables, thus, making it impossible to aggregate local MLP networks into a single global model.

The approach proposed in this paper is completely different from previous studies as it has not been proposed in the literature until now. The steps of the approach are listed below.

Determine a uniform MLP network structure for a set of local tables—dispersed data;
Train a MLP network based on each local table separately.
Aggregate MLP networks into a single model—a global MLP network;
Post-train the global MLP network with a sample of global data.

Fig 1 shows the general steps of building the global MLP network model from dispersed data. In the first step, there is dispersed data—local tables with different sets of conditional attributes and different sets of objects. In order to build local neural networks with the same structure (the input layer requires the most attention here), the training data in each local space is imputed so as to have the same set of attributes. This step is carried out with the help of certain characteristics calculated from local tables. It is important to emphasize that the raw data is not shared at any model construction stage. In the next step, local MLP networks are trained, after which they are aggregated to construct a global network. The final step is to re-train the global network. In the study, this is done using a validation set.

Download:

Fig 1. Stages of building the global MLP network model based on dispersed data.

https://doi.org/10.1371/journal.pone.0311041.g001

All the steps are discussed in detail in the subsequent subsections.

3.2 Determine an uniform MLP network structure for a set of local tables—local models

Since the dispersed data need not satisfy any constraints, the key in determining the structure of the MLP network is in the number of neurons in the input layer. The output layer poses no problem since all local tables share the same decision attribute. The number of hidden layers as well as the number of neurons in the hidden layers are optimized experimentally. Thus, the most important challenge is to determine a common input layer. In this first study on the approach, it is proposed to unify the input layer by using all conditional attributes from local tables. So the input vector will have the dimension determined by the number of elements in the sum of conditional attributes present in the local tables where card{X} is the number of objects in the set X. Such a sum is not a simple concatenation of attributes. We operate on sets, and we recognize attributes by their names. So the sum of the sets skips multi-duplicates—in case when one attribute appears in several tables it only appears once in the sum. It should also be noted that such a sum does not mean summing tuples from a table, but only determining the set of names of all attributes appearing in local tables.

Here a problem arises because local tables contain objects for which values are known only on a certain subset of the set . The question arises on how to train the local MLP networks with the input layer defined as above based on a local table with such objects. Fig 2 shows the overall configuration of the MLP network—local model used for each local decision table. In each of the local tables, a certain number of attributes (features) are included but not all of them. In order to make the network structure common for all local tables, the completion of missing values for a given local table is made. Of course, in each local table other missing values may occur. Completion of missing values is carried out by calculating values from local tables in which the attribute occurs. Local models are neural networks trained specifically on artificially created objects. That is, those that have completed values on attributes that are not present in the actual given local table. So only these artificial objects are used to train the neural network, the original objects are not used. The training process for these models involves a standard neural network built using the Keras library in Python, employing backward propagation over multiple epochs and steps within each epoch. In the next section, an explanation on how this problem is solved is given.

Download:

Fig 2. Schematic diagram of MLP network structure for one local decision table.

https://doi.org/10.1371/journal.pone.0311041.g002

3.3 Training a MLP network based on each local table separately

In this section, the explanation on how to train a local MLP network based on a local table is given. Let us assume that a local table D_j = (U_j, A_j, d) is given, based on which a local MLP network is to be trained with an input layer containing neurons. For an object from the local table D_j, values for attributes from the set A_j are specified, which means for each a ∈ A_j value is given. Thus, in order to provide an input vector to the MLP network, the values on the other attributes from the set must be determined. Let us assume that attribute b belongs to the set and for this attribute one has to determine the value to be completed for the object . In the proposed approach, this value is determined based on certain statistical measures: minimum, maximum, median and average calculated for values of attribute b occurring in other local tables in the dispersed data. In addition, the decision class of the object is also taken into account. These measures were chosen as the most popular, frequently used in numerous calculations and characterize both the central tendency and the entire range of variation in the value of a given attribute. The paper is the first study of the approach using artificial objects. In future work, other statistical measures will be analyzed. It is planned to use quartiles and the average value offset by the standard deviation.

More strictly, let us assume that the object has a decision value v, v ∈ V^d, where V^d is the set of values of the decision attribute d. For each of the decision tables to which the attribute b is present, the minimum, maximum, median and average are calculated for the values of the attribute b based on the objects in the decision class v. For each decision table D_i for which b ∈ A_i the following values are computed:

In this way, values designated separately for each local table containing the attribute b are obtained. To determine the final value which is completed in the object and given to the input of the neural network, one of the statistical measures (minimum, maximum, mean or median) is applied on the local values determined in the previous step. Thus, one of the four measures for determining local values based on local tables and one of the four measures for determining the aggregate value. In all, there are 16 possible combinations from which one is chosen at random as the value of b. Suppose that for calculating local values the median is drawn, and for aggregate value, the minimum is drawn, then the value on attribute b is determined as follows

This method is repeated for each of the missing attributes for object .

In the generalized version of the above method, instead of one object, k(k < = 16) objects are generated by selecting k distinct values from the 16 possible values as the value of b in each of the k objects generated from the original object . Thus, based on object , k new objects would be generated with all values on conditional attributes . This approach is also tested and the results are presented in the experimental part of the paper. Algorithm 1 presents the pseudo-code of the generalized version (in the basic version, it is enough to put k = 1), which implements this part of the model.

Algorithm 1 Pseudo-code of algorithm generating objects from one local table used for training the local MPL network

Input: One local decision table D_j = (U_j, A_j, d) for which we determine the training set for the MLP network; measures minimum, maximum, median and average , , and computed for each decision value v ∈ V^d and attribute b ∈ A_i based on the values stored in the table D_i for each i ∈ {1, …, n}; a set of conditional attributes from all local tables; k parameter value that determines how many objects are generated based on one object from table D_i.

Output: A data set used to train the MLP neural network, .

foreach x ∈ U_j

for m = 1 to k do

create an object from by assigning values on the set A_j the same as the object x has

foreach attribute in the set b ∈ A\A_j

choose a pair (choice1, choice2) from the set

{MIN, MAX, AVG, MED} × {MIN, MAX, AVG, MED}

)

end foreach

The computational complexity of the above method is linearly dependent on the number of objects in the local table D_j, value of parameter k, the number of conditional attributes card{A} and the number of local tables in the dispersed data n. More precisely, the complexity resulting from the loop is O(card{U_j} ⋅ k ⋅ card{A\A_j} ⋅ card{D_i, i = 1, …, n}). In the worst case, one can assume that there is only one conditional attribute in the table D_j, and for all other attributes, values have to be computed and the missing attributes are present in all other local tables except D_j. Then the complexity is O(card{U_j} ⋅ k ⋅ (card{A} − 1) ⋅ (n − 1)). The linear complexity of the algorithm proves that it can be used even for large dispersed data.

The data prepared in the above way in the next step is used for training MLP neural networks. As mentioned earlier, the input layer is defined by a set of conditional attributes from all local tables. The number of neurons in the output layer is equal to the number of decision classes. Each of the neurons determines the probability with which the test object belong to a given decision class. In the experimental part, one or two hidden layers are considered. The number of neurons in the hidden layer is determined in proportion to the number of neurons in the input layer. Different proportions are checked from 0.25 to 5 times the number of neurons from the input layer. In the case of two hidden layers, all combinations of the number of neurons in the hidden layers are checked such that: the first layer had the number of neurons from the set {0.25 × I, 0.5 × I, 0.75 × I, 1 × I, 1.5 × I, 1.75 × I, 2 × I, 2.5 × I, 2.75 × I, 3 × I, 3.5 × I, 3.75 × I, 4 × I, 4.5 × I, 4.75 × I, 5 × I}, and the second layer had the number of neurons from the set {1 × I, 2 × I, 3 × I, 4 × I, 5 × I} where I is the number of neurons in the input layer. For the hidden layer, the ReLU (Rectified Linear Unit) activation function is used, as it is the most popular activation function and gives very good results [36]. For the output layer, the softmax activation function is used, which is recommended when one deals with a multi-class problem [37]. In this paper, data sets containing from four to nineteen decision classes are analyzed. The neural network is trained by using the back-propagation method. A gradient descent method, with an adaptive step size is used in the back-propagation method. It is known that the softmax layer give good results with the Adam optimizer [35]. The Adam optimizer proposed in [38] and is one of the most popular adaptive step size methods. From [39], the categorical cross-entropy loss gives best results with softmax layer. That is why the Adam optimizer and the categorical cross-entropy loss function are used in the study.

The implementation of the MLP neural network from Keras library in Python is used. The algorithm that defines a neural network with one or two hidden layer with the rectified linear unit (ReLU) activation function and the number of neurons in the first hidden layer dependent on the parameter. Softmax activation function is used in the output layer. In the compilation, the categorical cross-entropy loss function, the Adam optimizer and the accuracy as the learning rate are used. For two hidden layers approach the second hidden layer with the ReLU activation function and the number of neurons dependent on the parameter is used. In the way described above, a set of local MLP networks are obtained. The number of networks is equal to the number of local decision tables. All networks have the same structure and this is a very important property necessary for the next step.

3.4 Aggregation of MLP networks into a single model—a global MLP network

The result of the previous stage is a set of local MLP networks which are trained and all have the same structure. Aggregation of such networks into a single global MLP model is relatively simple. The global network has exactly the same structure as each of the local networks i.e. the same number of layers and the same number of neurons in each layer. However, during aggregation, each local model may have a different impact on the construction of the global model. This influence is proportional to the quality of each local model’s classification on the training set. The method used is inspired by the second weighting system used in the AdaBoost algorithm [40].

For each local model, a classification error is estimated based on its training set (artificial objects generated using Algorithm 1). Let us denote by e_i the classification error determined for the i−th local model i ∈ {1, …, n}. Since local models are built based on a piece of data, their accuracy can be very different. It may sometimes happen that their classification error is above 0.5. In order not to eliminate such local models from the aggregation stage as they may contain important information on specific attributes that may have a positive impact in the global model, the min-max normalization is applied to the interval [0, 0.5] of all errors e_i, i ∈ {1, …, n}. After, the weights ω_i for each local neural network i ∈ {1, …, n} is adjusted according to the formula: (1)

The weights of global model are determined by one of two approaches: in the first approach, the weights for the global network are determined by the weighted average of the corresponding weights (assigned to edges connecting exactly the same neurons) present in local MLP networks with weights ω_i, i ∈ {1, …, n}. The second approach is to determine the weight for the global network as the sum of the corresponding weights from the local networks with weights ω_i, i ∈ {1, …, n}. The two approaches are studied separately in the experimental part of the paper.

Fig 3 illustrates the process of aggregating local models into a global model. Since all local models share the same structure, this aggregation is relatively straightforward. Each connection between neurons in the global model corresponds to the connections in the local models. The critical aspect of this process is the determination of weights, which are based on the classification performance of local models on their respective sets of artificial objects (training sets). The weights assigned to each local model are crucial as they influence the global model’s configuration. Local models that perform poorly in classification (possibly due to a higher number of missing attributes and thus less connection with reality, more values are artificial) are given smaller weight in shaping the global model. However, they are not entirely excluded from the aggregation. It will still contribute to the overall classification performance of the global model. This ensures that the global model benefits from the specialized capabilities of each local model, enhancing its overall classification quality.

Download:

Fig 3. Schematic diagram of global model.

https://doi.org/10.1371/journal.pone.0311041.g003

The implementation of the global MLP network is done in Python. First the network’s structure is defined—the number of layers and neurons in the layers are the same as in local networks. Then the weights are not trained but assigned based on average or sum with consideration of the weights for the local networks of corresponding connections in local networks.

3.5 Re-training the global MLP network with a sample of data

The retraining process with global objects is a step that enhances the proposed model’s accuracy, generalization, and robustness. By carefully integrating and fine-tuning the local models, the global model achieves superior performance, making it a valuable tool for various real-world applications. This process integrates local models into a cohesive global model, enhancing overall accuracy and generalization. The global MLP network is re-trained using the validation set. This step involves adjusting the weights and biases of the aggregated model to fine-tune it for better performance. The retraining process ensures that the global model leverages the strengths of the local models while mitigating their individual weaknesses. A validation set, which is a subset of the training data, is used for the retraining process. The size of this validation set is smaller than the local models’ training sets but is crucial for capturing the model’s generality. The validation set helps in fine-tuning the global model to prevent overfitting and ensure it generalizes well to unseen data. What is important is that the objects in such a validation set must contain a global description of the objects, i.e. include attributes/characteristics present in all local tables. The integration of local models through retraining results in a significant boost in classification accuracy. The model benefits from the collective knowledge of all local datasets, leading to more accurate predictions.

The last stage is to re-train the global network. The training objects needed in this step should have values on the set of all conditional attributes . In the paper, this is implemented by using a validation set. Such a validation set is much smaller than the training sets for local models and will have less influence on the final form of the global neural network. However, without the use of this last step, the obtained quality of classification is unsatisfactory and we miss to capture a generality of the model. For the approach of generating one artificial object, the size of a validation set is about 21% of the size of the training set for local model. In the case of generating three artificial objects, the size of a validation set is about 7% of the size of the training set for local model. In future works, it is planned to test the active learning approach [41, 42] instead. In active learning, the assumption is that the model builds its own training data or changes the original training data.

After the completion of this step, the final form of the global model is obtained and is evaluated by using an independent test data set.

It should be noted that the model avoids overfitting through a series of carefully planned steps. The final stage of training involved re-training the global MLP network with a validation set. This validation set is smaller than the training sets for local models, but crucial in capturing the generality of the model. For generating one artificial object, the validation set os about 21% of the local model’s training set size. For three artificial objects, it is about 7%. This step is essential to prevent overfitting and ensure the model could generalize well to new, unseen data. Also during the selection of optimal parameters, we aimed for the best classification accuracy with the lowest possible model complexity, involving the fewest layers and neurons. This focus on simplicity helped in reducing the risk of overfitting.

4 Experimental setup

In order to assess the efficiency of the suggested model, the methodology of the experiment is analyzed within this section. The simulation platform, parameter allocation, and criteria for measuring performance are all elaborated upon. The scheme for describing the experimental methodology, which is widely used in the literature as shown in the work of [43, 44] is used below.

4.1 Simulation platform

All simulation is conducted using an open-source software Jupyter Notebook 6.5.4 and Anaconda 2023.09-0 (Sep 29, 2023) Installer Python Version: 3.11.5. The implementation of the proposed model is made using the Keras library in Python. The simulations were run on a computer with an Intel(R) Xeon(R) W-2235 CPU @ 3.80GHz 3.79 GHz processor and 32.0 GB RAM to avoid any bias in the analysis of the results, it is crucial that the study is conducted using the same compiler, on the same computing hardware, and with the same processing capabilities.

4.2 Data set

The experimental study uses data sets available in the UC Irvine Machine Learning Repository: Vehicle Silhouettes [45], Dry Bean [46], Sensorless Drive Diagnosis [47] and Crowd Sourced [48]. The characteristics of the data sets are given in Table 1.

Download:

Table 1. Basic characteristics of data sets.

https://doi.org/10.1371/journal.pone.0311041.t001

Each of the data sets are originally available in non-dispersed form—each data in a single decision table. The training set are dispersed where different degrees of dispersion are considered. Each single data set is converted into five different dispersed versions: 3, 5, 7, 9 and 11 local tables respectively. During the construction of the local tables, a subset of attributes in each local table is considered. The number of attributes is significantly reduced in local tables as compared to the original table with some attributes repeating among some tables to satisfy. This is done to make provision for the possibility that some local tables may share common attributes. The full set of objects is stored in each local table but without their identifiers. More precisely, the number of local tables is first determined (e,g dispersed version with 5 local tables). Then the number of original set of conditional attributes is divided evenly among the local tables (so that each local table had more or less the same number of attributes). In addition, it is assumed that there are common attributes between the selected local tables, e.g. between table one and two we have two common attributes, between table two and three we have one common attribute, and so on. With the initial assumptions made, attributes are then randomly distributed between local tables. Once we have established sets of attributes in local tables then entire columns from the original tables are rewritten into local tables. In this way, we have the same sets of objects in all local tables.

The Sensorless data set is balanced with each decision class containing 5319 objects. The Vehicle, Dry Bean and Crowd Sourced data sets are imbalanced (Fig 4). The data are balanced but it is worth emphasizing that this process is carried out after the dispersion (to keep the approach as consistent with the real situation as possible). The Synthetic Minority Over-sampling Technique (SMOTE) method is used [49] for each local decision table separately. The implementation of this algorithm available in WEKA [50] software is used. The data considered are multiclass labeled so in each decision table and for each decision class except the most dominant one, the SMOTE method is used. As a result, all decision classes have the same number of objects after balancing. Finally, for each of the three original data sets, 5 dispersed versions of imbalanced data and 5 dispersed versions of balanced data are obtained. Thus, a total of 35 dispersed data are considered in the experimental part.

Download:

Fig 4. Imbalance of data—cardinality of decision classes in training and test sets.

https://doi.org/10.1371/journal.pone.0311041.g004

4.3 Parameter assignments

The proposed model comprises of three phases: structure phase, training phase, and testing phase. The structure phase involves determining the structure of local and global MLP models. The input layer of the model is strictly dependent on the data set—the number of neurons is equal to the number of attributes present in all local tables. The same is true of the output layer—the number of neurons is equal to the number of decision classes. However, as for the other parameters of the network, it is variable and determined experimentally. Also, the method of determining the value of missing attributes, as well as the number of artificial objects used is variable. Another parameter of the model is the method of aggregation of local networks. Different parameter values are tested. We conducted a comprehensive grid search to explore various configurations of the hyperparameters. This involved testing different combinations of the number of hidden layers, the number of neurons in each layer, methods for aggregating local neural networks, and strategies for handling missing attributes. By systematically varying these parameters, they are able to identify the optimal configuration that achieved the best performance. The optimal number of hidden layers and neurons in each layer is determined through the grid search. The authors tested various configurations, ranging from shallow networks with fewer layers to deeper networks with more layers and neurons. The chosen configuration provided the best trade-off between model complexity and classification accuracy. Different methods for aggregating the local neural networks are evaluated. The authors considered techniques such as averaging the weights of the local models and summing the weights. By employing a systematic and thorough approach to hyperparameter optimization, the authors ensured that their model is both robust and efficient. The experiments are carried out according to the following scheme:

Different approaches to substitute values of missing attributes in local tables are studied—one or three artificial objects are generated based on one original object in local table.
Different approaches to aggregating local neural networks are studied—using average of weights or sum of weights.
Different numbers of hidden layers in the local and global networks are studied—one or two hidden layers.
Different numbers of neurons in hidden layers are studied. The number is determined in proportion to the number of neurons in the input layer. The following values are tested: for the first hidden layer {0.25, 0.5, 0.75, 1, 1.5, 1.75, 2, 2.5, 2.75, 3, 3.5, 3.75, 4, 4.5, 4.75, 5} × the number of neurons in the input layer; for the second hidden layer {1, 2, 3, 4, 5} × the number of neurons in the input layer.

So in total, 384 different experiments for the proposed method are conducted—384 different settings of the approaches analyzed (2 ⋅ 2 ⋅ 16 + 2 ⋅ 2 ⋅ 16 ⋅ 5). In the tables in Section 5, we show both the results from each parameters settings, as well as the optimal parameter values. The optimal parameters are chosen as those provide the best classification accuracy with the lowest possible model complexity—the lowest number of layers and the lowest number of neurons used in the model.

4.4 Formulations of the performance metrics

The quality of classification are evaluated based on the test set. A classification accuracy measure (acc) is used for this purpose. That is, a fraction of the total number of objects in the test set that are classified correctly. As is mentioned in the previous section, in the final step the aggregated model is re-trained. To do this, the use a validation set containing objects that have values on all conditional attributes present in the dispersed data is used—attributes occurring in all local tables. The validation set is obtained by dividing the original test set randomly but in a stratified manner into two equal parts. First, one part is used as the validation set (for re-training process) and the second part is used to assess the quality of classification. Then the roles reverses as the second part acts as the validation set. Finally, both results are averaged. Each experiment is repeated three times; in the following section, all results given are the average of these three runs.

4.5 Reproducibility of the proposed model

The structure phase aims to prepare the local MLP neural networks with a consistent architecture. This involves identifying common attributes among the local tables and addressing any missing attributes. The next step involves supplementing the local tables with additional objects, assigning values to the missing attributes. Following this, the training phase optimizes the network weights. During the testing phase, the model’s accuracy is validated on test objects with values defined for all attributes. To assess the effectiveness of the proposed model, a comparative analysis is performed against baseline training methods. Given the inherent randomness in the structure phase—where missing values are filled and new objects are created—and the training phase—where initial network weights are set randomly—the experiments are repeated three times and the results are averaged. The simulations are configured as follows to facilitate reproducibility. Benchmark data that is publicly available are used and the process of splitting (performed only once) into local tables is done as described as in Section 4.2. The local tables are then augmented with additional values on missing attributes. One setting of all parameters is selected. The experiments are performed three times and results are averaged. Such a procedure is repeated for all other parameter settings using the same predefined local tables.

4.6 Baseline methods

In the literature, there are no models dedicated for dispersed data to generate a single model based on local tables with different attributes (but some of them are common). Therefore, for comparison, an intermediate approach is used, which, although does not generate a global model and does not resolve differences between attributes, it generates local models and performs global classification by voting. In the paper, two approaches for building a local model are used.

The first approach is an ensemble of homogeneous classifiers, where the base classifiers are MLP networks. However, it should be noted here that each network generated for local tables has a different structure as the input layer is different. Which means that no unification of the input layer is done by filling in the values of missing attributes. For a single local table, an MLP network is created, which in the input layer has neurons corresponding to attributes occurring exactly in that local table. In order to maintain transparency and integrity numbers of neurons in the hidden layer are tested as for the proposed method. The number of neurons in the output layer is the same in all local models as it is equal to the number of decision classes. The final decision of the ensemble is made by soft voting.
The second approach is the method proposed in [3]. This ensemble of classifiers method consists of creating three base classifiers: k−nearest neighbors, decision tree and Naive Bayes classifier (KNN, DT, NB) based on each local table. The parameter k = 3 and the Gini index as a splitting criterion when building decision trees are used. Thus, three classifiers are defined for each local table. The final decision of the ensemble is also made by soft voting.

Both approaches are implemented in the Python programming language using implementations available in the sklearn library.

5 Results and comparisons

The results of the experiments are shown in the tables below. Comparison of experimental results are made in terms of:

The quality of classification for different numbers of artificial objects created based on one original object in local table.
The quality of classification for different approaches: average of weights and sum of weights to aggregating local neural networks.
The quality of classification for different number of hidden layers.
The quality of classification for different number of neurons in the hidden layers.
The quality of classification of the proposed method versus two other approaches from the literature—homogeneous and heterogeneous ensemble of classifiers.

The average classification accuracy obtained from three runs of the algorithm are presented. Tables 2–5 show the results obtained for one hidden layer, different number of artificial objects (one or three artificial objects generated), different aggregation methods used (average or sum) and different number of neurons in the hidden layer. For simplicity, the designations are adopted:

1HL, 2HL—for one or two hidden layers,
1AO, 3AO—for one or three generated artificial objects,
AVG, SUM—for the aggregation method—average and sum.

Download:

Table 2. Results of classification accuracy acc for the proposed approach: One hidden layer, the method to substitute values of missing attributes in local tables—One artificial objects generated based on one original object, MLP networks aggregation using average of weights and various number of neurons in the hidden layer (1AO-1HL-AVG).