Abstract. Association Rule mining can be applied in the field of bioinformatics for identification of co-occurrence between various biological elements such as genes and protein. In bioinformatics protein-protein interaction network provides useful information regarding the functions of protein. Association analysis has been used for identification of frequently occurring interactions among the proteins in the network for predicting functions of the proteins. As the amount of data are increasing exponentially, parallel implementation of the Association analysis for identification of co-occurrences between proteins in protein-protein interaction network will be more efficient, fast and scalable. In this paper we proposed an efficient framework for association analysis of frequently occurring pattern in the protein-protein interaction network. The algorithm has been parallelized using Hadoop software. The performance view of the sequential and the parallel version has been depicted in function flow graph and it shows that the parallel version is more effective than the sequential one.

Keywords: Association Rule mining. Hadoop, Map Reduce. Protein-protein interaction, Hyperclique pattern

Introduction

Protein is an important molecule in cell, which performs several important biological functions. Protein always interacts with other protein to accomplish some biological function. These interactions of proteins are categorized into two types, namely physical and genetic interactions. Protein-protein interaction (PPI) [1] network is constructed to depict the physical interactions among the proteins. This PPI network is useful for predicting functions of protein. Among the several approaches used for predicting protein functions from PPI network ,association analysis in data mining are used for detection of frequently occurring sets of interactions among the proteins in the PPI network.

Association Rule mining (ARM) [2] is an important task of data mining falls under the category of unsupervised learning method. The main aim of ARM is to find out frequent patterns among data stored in an already existing database and generating association rules from them. ARM is used for market basket analysis where there is a transactional database which consists of set of transactions and a set of items purchased by the customers. There are several ARM algorithms. Among them Eclat algorithm[3] uses the binary format of the database where 1 represent that the item is purchased by a customer and 0 represent non purchase of the item in case of a particular transactions. These algorithms are also used for frequent pattern mining. There are various applications of ARM algorithm. In the field of bioinformatics association analysis is used for the analyzing PPI network.

A PPI network is an undirected graph where each node is labeled as protein and the edges are depicted as the pair wise interaction among the proteins. Association analysis is used for the purpose of identification of frequently occurring set of protein-protein interactions in PPI network. These subgraphs of PPI network is considered as functional module from which functions of the protein can be predicted. Pandey et.al[4] has proposed an association analysis based approach for removing noise from PPI network. This approach use support and h-confidence measure for selecting hyperclique pattern from the PPI data.

The remaining part of the article is organized as follows: Section II discusses about the existing work. The proposed approach has been figured out in Section III. Its analysis using function flow graph has been discussed in Section IV and it shows that a parallel implementation is much more effective than the sequential one. The final section concludes the article.

Related Work

Protein-protein interaction (PPI) network is represented as an undirected graph where the nodes denote the protein and the edges denotes the interaction among them [5].There is various approaches for analyzing a PPI interaction network. Association analysis based approach is used for analysis of frequently occurring pattern in a PPI network which can be used for prediction of protein functions. Pandey et.al has proposed an association based approach. This approach uses various types of frequent patterns and parameters which are discussed in the next subsection.

2.1 Representation of Traditional frequent pattern

Frequent pattern mining is the process of mining data in a set of items or some patterns from a large database. The resultant set supports the minimum support threshold. A frequent pattern is a pattern that occurs frequently in a dataset. Association rule mining is defined as to find out association rules that satisfy the predefined minimum support and confidence from a given data base. If an item set is said to be frequent, that item set supports the minimum support and confidence. A frequent item set should appear in all the transaction of that data base. The dataset used for this purpose is a binary matrix where the row represent the items purchased and the column represent the transactions. The entries in the matix is either 1 or 0.If the item is purchased in a particular transaction the corresponding entry is 1 otherwise it is 0.The support threshold has an antimonotonic property i.e the support of an itemset cannot be more than any of its superset in a given dataset. The density of the matrix depends on the proper choice of support threshold.

2.2 Hyperclique pattern

This approach uses a frequent pattern called hyperclique pattern [6] which contains items that are strongly associated with each other over the supporting transactions, and sparsely over the rest of the transactions. In traditional frequent pattern mining there is a problem of cross-support pattern which arises if the support value is set very low. This may cause difficulty to mine all frequent patterns because the number of frequent patterns may increase at any moment. This relates a high frequency item to a low frequency item which is the noise in case of PPI network. To overcome this problem, hyperclique pattern is used where an antimonotonic association measure called h-confidence (hconf) [6] was defined.

Thus,

hconf (X) =supp(i1 , i2 , . . . , ik )/max[supp(i1 ), supp(i2 ), . . . , supp(ik ), (1)

where supp(ij) is the support of an item ij in an itemset.

The data items which satisfies hconf ‘ ??, (where ?? is specified by the user) is a hyperclique pattern.

2.3 Algorithm for association analysis of PPI network

Generally PPI network consists of unweighted graphs. Pandey et.al proposed an association based analysis for PPI network based on the h-confidence of hyperclique pattern, where the network is considered to be a weighted one. In the algorithm the PPI network is represented by binary adjacency matrix. The h-confidence index is used for estimating the common neighborhood similarity of two proteins. In case of PPI network for unweighted graph, the h-confidence is defined as

hconf(P1,P2)= hconf(P1,P2)=min'(|NP1’NP2|/NP1,|NP1’NP2|/NP2), (2)

where NP1 and NP2 denote the sets of neighboring protein of proteins P1 and P2 respectively.

To calculate the hconf value, the equation (2) can be modified by assigning the following values to NP1 and NP2.

|NP1|=sum of weights of edges incident on P1.

|NP2|=sum of weights of edges incident onP2.

|NP1 ‘ NP2 | = sum of minimum of weights of each pair of edges that are incident on a protein P from both P1 and P2.

According to the algorithm, the value of h-confidence will be within [0,1]. Therefore if the protein pairs have a high hconf score, then definitely there exists an interaction between them. Similarly, if the interactions between pairs of protein have a low hconf then the interactions are considered as noisy or spurious. In this algorithm a threshold value is set to exclude the spurious or noisy pair of proteins by a graph transformation technique. At first the h-confidence measure is calculated from the original PPI network. Next, a threshold limit is applied to eliminate the protein pairs with a low h ‘ confidence so that spurious interactions can be removed. The resultant graph is expected to be the less noisy and more complete version of the original graph.

For the purpose of evaluation of the algorithm, the original and the transformed graphs were provided as input to the Functional Flow algorithm [7], which is a popular graph theory-based algorithm for predicting protein function from interaction networks.

Proposed Approach

Motivation

The approach proposed by Pandey et.al has been implemented on a single machine. However with the increase in vast amount of data along with time i.e for the purpose of analyzing large scale PPI network of complex species, sequential implementation of the algorithm will consume a lot of processing time. The proposed approach focuses on the parallel implementation of Hyperclique miner algorithm, used for mining frequent hyperclique patterns. The hyperclique pattern is used for association analysis of the PPI network. The parallel implementation is done by Hadoop [8], an open-source software framework that allows to store and process data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. The Hadoop software framework uses Map-Reduce programming model.

MapReduce Model

Google’s Map Reduce paradigm [9],[10],[11] is a distributed programming paradigm and an associated implementation to support distributed computing over large datasets. With the help of this technology a programmer without any experience in parallel and distributed system can easily utilize the resources of a large distributed system, since it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing.

The ideas of map reduce technology originated from the map and reduce functions of the functional programming. The Map Reduce framework consists of large number of computers called nodes which are collectively referred to as cluster. The Map and Reduce functions of MapReduce framework are defined with respect to data structured (key, value) pairs. Map () can be expressed as

Map (k1, v1) ‘ list (k2, v2) (3)

The Map function is applied in parallel to every pair in the input dataset (k1, v1). This produces a list of pairs for each call list (k2, v2). After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each key. The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain:

Reduce (k2, list (v2)) ‘ list (v3). (4)

Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list.

Hadoop

Hadoop is a software framework of map reduce system which provides a distributed file system (HDFS)[13] that can store data across thousands of servers, which provides a means of running work (Map/Reduce jobs) across those machines, as well as running the work near the data. The Hadoop Map/Reduce framework has master/slave architecture. It has a single master server or job tracker and several slave servers or task trackers, one per node in the cluster. The job tracker is the point of interaction between users and the framework. Hadoop’s Distributed File System (HDFS) is designed to reliably store very large files across machines in a large cluster.

Algorithm

This subsection details out the parallel version of Hyperclique miner algorithm

Algorithm Parallel Hypercliqueminer algorithm

Input:

1. A database T of N transactions T= {t1,’.,tN}, each ti ‘ T is a record with m attributes {i1, i2,’. im} where m is between 0, 1 where the il(1 ‘j ‘m) is the Boolean value for the feature type Sj

2. A user specified minimum h-confidence threshold (hc)

3. A user specified minimum support threshold (supp)

Output: Hyperclique patterns with h-confidence > hc and support > supp

Method:

Begin

Scan T to get support count of every item.

Use the map function to count the occurrences of potential candidate set in a distributed way;

Apply reduce function to sum up the occurrence count of the potential candidate set;

Prune those items, whose support count is < minimum support count. The h-confidence value for itemset of size 1 is assumed to be 1;
For the itemset of size >2 do the following:

The map function performs the self join operation of candidate set Ck-1 (where k=size of the itemset >1)to generate the potential candidate set;

The reduce function count the occurrences of the candidate set, and prune the itemset based on anti-monotone property of h-confidence support ;

Prune the cross support pattern based on cross support property of h-confidence;

Finally the reduce phase compute the support count and h-confidence of all the itemset and prune all the itemset which does not meet the user specified threshold of minimum support and h-confidence;

End;

Performance Analysis

The parallel implementation of Hyperclique miner algorithm can be applied for efficient association analysis of large complex dataset. We analyze the gain in performance of the algorithm when implemented using the high performance computing platforms such as Hadoop, where the speed of processing will increase with the increase in number of processors. The Hadoop has been installed in multicores machine. The experiment is carried out using 4 machines each consisting of multiple cores AMD processors. It has been found that as the number of cores increases, there is a remarkable improvement in performance of the algorithm. Therefore the parallel implementation of the proposed algorithm will provide a significant improvement regarding the processing time even if the size of the dataset increases. Moreover the algorithm is implemented on Hadoop platform which is an open source, reliable software framework used for the distributed processing of very large datasets. The dataset is distributed among the four machines each performing the map and reduce task. The performance of the algorithm is evaluated on the basis of three factors. These three factors are scaleup[14], sizeup[14], and speedup[14].

The scaleup factor measures the ability of a larger system to perform larger jobs at the same run time as that of a single system. Scaleup is defined as follows:

Scaleup(data,n)=T_1/T_nn , (5)

where T1 is the execution time for processing data on 1 core and Tnn is the execution time for processing the n times of the data on a m core.

The sizeup factor is defined as is defined as the time taken by the system to process larger size of data.The sizeup factor is denoted as

Sizeup (data, n) =T_n/T_1 , (6)

where Tn is the time required for processing n times larger data and T1 is the time required for processing the data.

Speed up factor measures the execution time of a parallel system. The speed of processing will increase with the increase in number of cores of processor, even if the size of the dataset increases. It is shown by the following formula,

Speedup= T_1/T_m , (7)

where T1 is the processing time of a single processor and Tm is the time required for processing data by m number of processors.

To show that the scalup properties we have increased the size of the dataset as well as the no.of cores of the processor. Table 1(a) lists the scalup factor and the number of cores of processors. It is shown with the help of graph in Figure 1 that as the number of cores of the processor increases there is a decrease in the scalup factor for the same dataset. To show the sizeup factor Table 1(b) is considered where the size of the dataset and the sizeup factors are tabulated. Fig 2 shows that the as the size of the dataset increases there is a increase in the sizeup factor which means that the algorithm will show better result with the increase in dataset size. Table1(c) show that the speed of processing will increase with the increase in number of cores of processor, even if the size of the dataset increases. The improvement in the performance of the algorithm with time is depicted in Fig.3.The graph shows that as the number of cores in the processor increases, the speed of processing is also increases, resulting faster analysis of the Hyperclique miner algorithm.

No.of cores ScaleUp

4 1

8 .83

16 .8

32 .7

48 .69

Datasetsize SizeUp

1GB 1

2GB 2.3

4GB 3.5

8GB 4.7

16GB 5.5

No.of cores SpeedUp

4 3

8 5

16 6.5

32 7.5

48 8

Table1. Different types of evaluation factors (a) ScaleUp factors in case of dataset size 8GB.Processing (b)SizeUp factors (c)Speed in case of large dataset of size 8GB.

Fig1: Scale up factor of the parallel algorithm using dataset of 8GB

Fig 2: SizeUp of the parallel algorithm Vs Size of dataset

Fig3: Runtime of the parallel algorithm using dataset of 8GB

This parallel implementation of the Hyperclique miner algorithm can be applied for association analysis of many large databases. It is applied for association analysis of wide varieties of PPI networks available for different species. The experimentally determined protein-protein interactions are documented in several databases such as the DIP [12] database and MIPS database. There are several dataset used for the analysis of the PPI. The dataset used for the proposed algorithm is the protein interaction dataset of S. cerevisiae (one kind of yeast) .It is available in the DIP database. The total number of proteins available in the database is 5171 and the number of interactions among the proteins is 24609 as on 1st March 2015.Hadoop provides faster processing of large datasets. Since the PPI interaction network stores information regarding protein, distributed analysis of this type of database will improve the speed of processing .The will lead to efficient association analysis of the PPI network.

The performance of this type of graph based PPI algorithms is evaluated with the help of function flow algorithm. It is a popular graph theory based algorithm for determining PPI network. In this graph, the annotated protein is the source protein which is connected to several unannotated protein. The PPI network is depicted in Fig 4.The flow of the graph is depicted from the source protein to the destination proteins. The number of edges entering into the destination protein defines the function of the protein. Usually graph of DIP dataset consists of unweighted edges. The weight of the edges of the graph is considered as a range of values between 0 and 1.

Fig4: Graph consisting of proteins and the interactions among them

Conclusion

Association analysis is one of the important datamining tools for analyzing the traditional market basket analysis. It can be used in a wide variety fields like web mining, social network mining. In the field of bioinformatics it can be used for wide variety of applications. It is used for analysis of the PPI network .Hyperclique miner algorithm is used for association analysis of PPI interaction network. In this paper we have parallel implemented the traditional hyperclique miner algorithm used for construction of hypergraph useful for the protein-protein interaction network. In conclusion, significant scope exists for future research on designing novel association analysis techniques for mining of complex biological data sets and their associated problems. Such techniques will significantly help in understanding the importance of association analysis for discovering important knowledge from these database and solve many important bioinformatics problems.

References:

Pandey, G., Kumar, V., Steinbach, M.: Computational approaches for protein function prediction: A survey. Technical Report, Department of Computer Science and Engineering, University of Minnesota. 06-028 (2006).

Agrawal, R., Imielinski, T., and Swami, A. N. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, P. Buneman and S. Jajodia, Eds. Washington, D.C., 207’216.

M. J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372’390, 2000.

Pandey, G., Steinbach, M., Gupta, R., Garg, T., Kumar, V.: Association analysis based transformations for protein interaction networks: a function prediction case study. In: Proceedings of the 13th ACM SIGKDD International Conference, pp.540’549 (2007).

Salwinski, L., Eisenberg, D.: Computational methods of analysis of protein-protein interactions. Curr. Opin. Struct. Biology 13(3), 377’382 (2003).

Xiong, H., Tan, P.-N., Kumar, V.: Hyperclique pattern discovery. Data Min. Knowl. Discov. 13(2), 219’242 (2006).

E.Nabieva et al.’Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps,’Bioinformatics,21:Suppl. 1.pp.i302-i310,2005.

White Tom, ‘Hadoop :The Definitive Guide’, O’reilly,3rd edition ISBN: 978-1-449-31152-0.

R. Agarwal and J. Shafer, ‘Parallel mining association rules’, IEEE Trans. On Knowledge and Data Engg., 8(6):962-969, December 1996, pp. 4-6, 14.

M. J. Zaki, S. Parthasarathy and W. Li., ‘Parallel data mining for association rules on shared memory multi-processors’. In Supercomputing 96, Pittsburg, PA, November 1996, pp. 17-22.

M. J. Zaki, S. Parthasarathy, M. Ogihara and W. Li, ‘New algorithms for fast discovery of association rules’, in Proc. of 3rd Int’l. Conference on Knowledge Discovery and Data Mining, August 1997, pp. 283-296.

Xenarios, I., Salwinski, L., Duan, X.J., Higney, P., Kim, S.-M., Eisenberg, D.: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Research 30(1), 303’305 (2002)

‘HDFS High Availability Using the Quorum Journal Manager.’ Apache Software Foundation. Available at http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html. Accessed on June 5, 2013.

NingLi ,Li Zeng,Qing He and Zhongzhi Shi,’Parallel Implementation of Apriori algorithm based on MapReduce’,in Proc. of 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing,@IEE