Principal Component Analysis(PCA) is bastion for distributed data analysis. Scalability of the data is limited when centralized data mining is taken into account. Unsupervised classification like clustering and supervised classification like other techniques needs dimensional reduction as a major part. PCA serves as a base for reducing the dimensionality of data and communication bandwidth. Considering all these issues data mining may be pricier in comparing the distributed aspects. This paper deals with the algorithmic aspects of both Homogeneous and Heterogeneous databases in distributed data mining. Tediousness arose when the heterogeneous data bases are need to be integrated where there is chances of error handling is high. This paper proposes a new algorithm to deal with heterogeneous data and error components is also taken as a part of the algorithm.
Categories and Subject Descriptors
H.2.5 [Heterogeneous Data bases]: Distributed data mining, homogeneous data bases
Distributed data mining, PCA, heterogeneous databases, homogeneous databases
Distributed Data Mining (DDM) techniques have become necessary for large and multi-scenario datasets requiring resources, which are heterogeneous and distributed. A distributed environment requires different types of data with different attributes and schemas.Furthermore, today there a deluge of data from not only from science fields but also in the field of Industry and commerce. Massive amounts of data that are being collected are often heterogeneous, geographically distributed and owned by different organizations.
There are many challenges concerning not only DDM techniques but also the infrastructure that allows efficient and fast processing, reliability, quality of service, integration, and extraction of knowledge from the mass of data.There is a persuasive demand for integration and exploitation heterogeneous biomedical information for enhanced clinical research, personalized healthcare. This can be done possible through distributed data mining. When any health care application is taken into account, the due consideration is given to analyse the data and find out the reasons or factors that leads to the prediction of particular syndrome. So to augment this there is need to develop knowledge about these.
Distributed data mining adheres to the analysis of all types of content, structure and information. Data skewness, integration of the data, Fragmentation of data, Replication of data. Above all others type of data to be considered is also necessary. Whether homogeneous or heterogeneous type of data is necessary. Enough literature is available for homogeneous data but for heterogeneous data bases very few and far are available. This paper deals with one such aspect of distributed data bases in respect to schema, integration and scalability of the data base.
DDM is data mining where the data and computation are spread over many independent sites. Typically, in a DDM environment, each site has its own data source and data mining algorithms operate on it producing local models. Each local model contains knowledge from the local data source, but could lack globally meaningful knowledge. Thus the sites need to communicate by message passing over a network, in order to keep track of the global information. DDM is thus not just about 'distributing' the centralized data mining task on several compute nodes.
1.2 Distributed Data Vs Centralised data
Repositories may contain several different data distribution schemes. These include: (1)Centralized Data Source: This is one of the simplest scenarios since the data can be thought of as residing in a single relational database, a flat file, or as unstructured data (XML). (2) Distributed Data Source: When the data is assumed to be distributed among different sites, two different scenarios can arise. (a)Horizontally Partitioned Data: The horizontal partitioning ensures that each site contains exactly the same set of attributes. (b)Vertically Partitioned Data: The vertical partitioning requires that different attributes are observed at different sites
Intrinsically distributed, complex environments have prompted a range of new data mining research challenges. In addition to data being distributed, the advent of the Internet has led to increasingly complex data, including natural language text, images, time series, sensor data, multi relational and object data types, and so on. To further complicate matters, systems need incremental or online mining tools that need not require complete remaining whenever a change is made to the underlying data . Providing these features in distributed data mining systems requires novel solutions. It's crucial to ensure scalability and interactivity as data mining infrastructure continues to grow substantially in size and complexity. Ultimately, systems must be able to hide this technological complexity from users.
II. PRINCIPAL COMPONENT ANALYSIS
Principal component analysis (PCA) is a popular technique for dimensionality reduction which, in turn, is a necessary step in classification. It constructs a representation of the data with a set of orthogonal basis vectors that are the eigenvectors of the covariance matrix generated from the data, which can also be derived from singular value decomposition. By projecting the data onto the dominant eigenvectors, the dimension of the original dataset can be reduced with little loss of information.
In PCA-relevant literature, PCA is often presented using the eigen value/eigenvector approach of the covariance matrices. But in efficient computation related to PCA, it is the singular value decomposition (SVD) of the data matrix that is used. The relationship between the Eigen-decomposition of the covariance matrix and the SVD of the data matrix itself, is presented below to make the connection. Eigen decomposition of the covariance matrix and SVD of the data matrix are used interchangeably.
Let X be the data repository with m records of dimension d (m x d). Assume the dataset is mean-centered by making E[X] = 0. A modern PCA method is based on finding the singular values and orthonormal singular vectors of the X matrix as shown in Eq. 1,
X =U ------------------- (1)
where U and V are the left and the right singular vectors of X, and S is a diagonal matrix with positive singular values, 1,??2,....??d(d = rank(X), assuming d < m), along the diagonal, arranged in descending order.
Using covariance matrix to calculate the eigenvectors, let C = E[XTX] represent the covariance matrix of X. Then the right singular vectors contained in V of Eq. 1 are the same as those normalized eigenvectors of the covariance matrix C (Eq. 2). In addition, if the nonzero eigenvalues of C are arranged in a descending order, then the kth singular value of X is equal to the square root of the kth eigenvalue of C. That is, ??k =??k2.
C = 2VT ------------ (2)
The paper presents an algorithm to calculate the global principal components from distributed databases by only transferring the eigenvectors and eigenvalues (or the singular vectors and the singular values) of the local covariance matrix instead of the original data set. We assume the dimension of the data (d) is much less than the number of data samples (m). We prove that in homogeneous databases, the global principal components derived are exactly the same as those derived from a centralized data warehouse. We also quantitatively measure the error introduced to the integrated global principal components when the databases are heterogeneous.
2.1 Global PCA for Distributed Homogeneous Databases
Global PCA algorithm assuming the distributed databases are homogeneous and the dimension of the dataset is much less than the number of data samples (d m).
Let Xmxd and Ypxd be two distributed databases of the same dimension (d). X is a matrix of m rows and d columns, and Y of p rows and d columns. The number of columns is in fact the dimension of the data samples in the database, and the number of rows is the number of data samples. Assume that m; p d. Then the matrices Xmxd and Ypxd are mean-centered, that is, E[X] = E[Y] =0, then
eig[(m-1)cov(X)+(p-1)cov(Y)] = eig[(m+ p-1)cov( )] (3)
where eig[A] denotes the eigenvectors and the eigenvalues of matrix A.
Left hand side of Eq. 3 (m-1)cov(X)+(p-1)cov(Y) = XTX +YTY
Therefore, the right hand side of Eq. 3 is also equals to
When multiple distributed databases. Let Xi represent the local database of d dimension and i the index of distributed databases where i = 1'r. The global principal components can be derived by Eq. 4,
where mi(i=1,'r) is the number of samples in the distributed database i. Eq. 4 says that the principal components of the combined matrix (the global principal components) can be derived from the distributed databases through the covariance matrices assuming the distributed databases are homogeneous. Eq. 1 and Eq. 2 that the covariance matrix can be calculated if the singular values and singular vectors are known.
Based on the above analysis, design the global principal components derivation algorithm for distributed homogeneous databases. Let Xi denote the local dataset. Xi has mi samples and is of d dimension. i = 1,.. r is the index of different dataset locations. The algorithm is carried out in two steps:
Step 1: At each local site, calculate the singular values and the singular vectors of the local database Xi using SVD method, as shown in Eq. 5.
where according to Eq. 2, the columns of Vi are the eigenvectors of the covariance matrix of Xi, and the singular values (??i = (??i1, ??i2,'. ??id)) along the diagonal of Si are the square root of the eigenvalues of the covariance matrix of Xi.
Step 2: Transfer the singular values (Si), the singular vectors (Vi), and the number of data samples mi from the local site (i) to another site ( j = i+1). That is, transfer the covariance matrix and the number of data samples from the local site to another site. Let the knowledge transfer process be carried out serially.
Reconstruct the covariance matrix based on data from site i and site j as shown in Eq. 6.
where Ci,Cj are the covariance matrices of Xi,,Xj respectively, and Ci+j is the covariance matrix of
Step 2 will continue until C1+2+'+r is generated. By using SVD method to derive the global principal components of the distributed datasets.
During this learning process, only the singular vectors, the singular values, and the number of samples in each dataset are transferred. Yet, this process generates exactly the same global principal components as if calculated from a central database. The data transferred in the proposed approach is at the order of O(rd2) compared to the transfer of the original local database which is at the order of O(rmd), where m d.
Algorithm 1: Global PCA in distributed homogeneous databases.
Data: local dataset Xi of size mixd, the number of distributed datasets,r, i = 1,' r.
Result: Global principal components VG.
following session can be done in parallel at local sites;
Mean-center Xi such that E[Xi] = O (a 1xd zero vector);
SVD: where the non-zero singular values along the diagonal of ??i form the vector ??i;
i = 1;
while i < r
do transfer Vi and ??i to where Xj=i+1 is located;
Reconstruct the covariance matrix Ci+j using Eq. 6;
increase i by 1
VG is the global principal components.
Step 2 can also be carried out hierarchically.
The proposed approach is described in detail in Algorithm 2.
3. PROPOSED ALGORITHM
Proposed algorithm for the PCA in distributed heterogeneous data bases are well discussed here.
3.1Global PCA in Distributed Heterogeneous Databases
The global PCA algorithm presented can also be used if the databases are heterogeneous, except that the global principal components are not accurate any more if we only transfer the singular values and singular vectors as shown in Eq. 7. Assume again that X and Y are two databases at two different locations. Assume the structure of X and Y are not the same. The dataset X is of dX dimension, and Y of dY dimension. X and Y are related through a public key. For example, different departments of a hospital might use different databases to record patient's information. Patient's personal data can be organized in one table and patient's diagnostic history might be stored in another table. However, patient's ID is unique which is used as a public key to link these two tables. Without loss of generality, assume X and Y have the same number of samples. That is, X is an m x dX matrix, and Y an m x dY matrix. The combined covariance matrix of Cov([ X Y]) can then be calculated as Eq. 7.
If it is assumed that data are transferred from where X locates to where Y locates, then the error introduced in the global principal components (Eq. 7) actually comes from the estimation of X since Y is the local dataset. Therefore, the key problem in deriving the global principal components from distributed heterogeneous databases is how to estimate local dataset when the computation is away from that local site.
According to SVD algorithm, X can be decomposed as Eq. 8,
where uj is the jth column vector of the left singular matrix of X, vj the jth column vector of the right singular matrix of X, and sj the jth singular value of X. The singular values are arranged in descending order, i.e., ??1 > ??2 >'.> ??j>''>??d.
Usually, the first component in Eq. 8 ( ) contains most of the information of X and thus would be a good estimation of X. In another word, besides transferring the singular vectors and singular values, transfer u1, the first column vector of the left singular matrix of X. The loss of information by estimating X using only the first component in Eq. 8 can be formulated using Eq. 9,
Therefore, the amount of data transferred among heterogeneous databases is at the order of O(rd2+m). The more uj's transferred, the more accurate the estimation to X, the more data need to be transferred as well.
Apply the above analysis to Eq. 7, we have
where approximates X. t is the number of uj's transferred. The loss of information is then calculated by
Algorithm 2: Global PCA in distributed heterogeneous databases and reduced error handlings
Data: local dataset Xi of size mixd, the number of distributed datasets,r, i = 1,' d.
Result: Global principal components Cov ([X Y]) with the following session can be done in parallel at local sites and also both globally and locally and reduction in errors
Mean-center Xi such that E[Xi] = O and =1 if error component arises
SVD: where the non-zero singular values along the diagonal of ??i form the vector ??i and also estimation component of local site.
i = 1;
while i < r
do transfer Vi and ??i to where Xj=i+1, of the local sites is located;
Reconstruct the covariance matrix Cov([X Y])using Eq. 7;
Construct and reconstruct SVD composition X using Eq.8;
increase i by 1 and vi,??i
Cov([X,Y]) is the global principal components with and error component .
the error component >X then
Recheck and reconstruct the SVD and reduce or filter the errors Else
Error component is eliminated
The proposed algorithm 2 reduces the errors to some extent comparing to the other previously proposed algorithms in specified in the literature.
2.3 Computational complexity
The computational complexity for the heterogeneous datasets when dealing with the error components are .Where n is the number of data to be transferred and m for error components. When this algorithm is applied the number of error components handling will be very high but on the time of processing the data sets it takes into account the reduced error components. To some extent the error components handling and reducing the numbers of errors and handling the same is very good in proposed algorithm2.
The proposed algorithm2 for heterogeneous data handling and also reduces the errors when dealing with massive large data sets. figure 2 specify the comparison of algorithm2 and other algorithms proposed in the run of the mill environment.
Figure 1 Comparison of proposed alg2 with others in terms of time complexity and reduced errors.
III. CONCLUSION AND FUTURE WORKS
Homogeneous databases for the dimensionality reduction were available previously. Limited Works are available for heterogeneous algorithm with the error components. Proposed algorithm2 for PCA in distributed heterogeneous databases with the error component produces dimensionality reduction for the data bases and also time reduction and reduced errors . Future work may extend for any type of distributed data mining techniques with heterogeneous databases.