Intrusion Detection Systems (Idss)

Abstract- As the data on the network is growing day by day, there is requirement of detecting that data for intrusions with high speed and accuracy. These growing Network data is causing a serious problem of detecting intrusions to protect the useful information on the network. There are numerous network security tools to protect network from intrusions but still the fast growth of intrusive activities is a serious issue. Intrusion detection systems (IDSs) are used to detect intrusive activities on the network. Machine learning and Feature Selection Techniques help to design 'Intrusion Detection Models' which can classify the network traffic into intrusive or normal traffic. Generally the intrusions are detected by analyzing 41 attributes from the intrusion detection dataset. In this work we tried to reduce the number of attributes by using various ranking based feature selection techniques and evaluation has been done using ten classification algorithms that I have evaluated most important. So that the intrusions can be detected accurately in short period of time. Finally the most efficient among the six feature selection techniques is proposed.
Index terms'Network Intrusion Detection, NSL-KDD Dataset, WEKA, Data Mining, Feature Selection Techniques, Classification

Due to huge volume of data on network the data content is vulnerable to various kinds of attacks i.e. increase in intrusions is growing day by day. Intrusion detection is very important to prevent the intruders to break into or misuse your system. To defend against various network attacks and computer viruses a lot of methods have been developed. Among these methods NID (Network Intrusion Detection) has been considered as the most promising method to protect from complex and dynamic intrusion behaviors [1]. Intrusion Detection Systems classifies data into various categories for example normal and intrusive. Various classification algorithms have been proposed to design an effective Intrusion Detection Model. The performance of a classifier is an important factor affecting the performance of Intrusion Detection Model. Hence the selection of accurate classifier helps to improve the performance of intrusion detection system. Feature Selection is an important data pre-process tool in the field of Data Mining. The research is in the growing stage in this field for the past three decades (Zhao et al., 2010). As the data is growing day by day on network in terms of features and instances, it is necessary to reduce that data to reduce its processing time and to achieve higher accuracy in results. To reduce those larger datasets we require feature selection techniques through which we can reduce the huge amount of data to required useful data by removing irrelevant and redundant features or instances from those datasets. This process is used in the field of Data Mining. In our research work we have to reduce the features for the Network Intrusion Detection Dataset, for which we used the ranking based feature selection techniques along with ranker to get the features ranked according to their importance. To analyze intrusions 41 attributes are considered. Our aim is to reduce number of those attributes so that intrusions can be detected in short period of time with higher accuracy. The use of feature selection techniques is to remove the irrelevant or useless features that are not contributing any useful information but are wasting time of intrusion detection model. There are several performance metrics to evaluate performance of Feature Selection Techniques. In this work, NSL-KDD compatible classification algorithms have been evaluated using WEKA tool. Then Top ten classification algorithms are used to evaluate the Six Ranking Based Feature Selection Techniques. The performance of the Feature selection Techniques have been measured by considering Accuracy, Training time and FPR performance metrics.
In this paper, initially, WEKA tool and ten classification algorithms to be used have been discussed in section II and III respectively. Various Feature selection techniques have been discussed in section IV. Chosen dataset has been introduced in section V. In section VI, the parameters considered to evaluate the performance of classifiers have been discussed. Results are reported in Section VII and conclusions are mentioned in Section VIII.

WEKA is a Data Mining, Machine Learning Tool which was first implemented in the University of Waikato, New Zealand in 1997 [1]. It is a collection of number of Machine Learning and Data Mining algorithms. This software is written in Java language and contains a GUI Interface to interact with data Files. It contains 49 data pre-processing tools, 76 classification algorithms, 15 attribute evaluators and ten search algorithms for feature selection [2]. It contains three algorithms to find association rules. It has three Graphical User Interfaces: 'The Explorer', 'The Experimenter' and 'The Knowledge Flow.' The WEKA supports data stored in ARFF file format. ARFF stands for Attribute Relation File Format. It also includes tools for visualization. It has a set of panels that can be used to perform specific tasks. WEKA provides the capability to develop and include the new Machine Learning algorithm in it. The algorithms can be directly applied to dataset.

Classifiers are also called as machine learning algorithms or classification algorithms. They are used to classify the network traffic when used on dataset. They are responsible for classification of network traffic as normal or intrusive, if intrusive then to which category it belongs. Basically there are eight categories of classification algorithms in WEKA and further these categories contains several classification algorithms. In our work there are 72 classifiers that are compatible on our chosen dataset and on the basis of their performance we have evaluated best ten classifiers and then these classifiers have been used for feature selection process. In this section the ten classifiers we are using for our work are discussed as:
Lazy classifiers: These learners are called as lazy learners also because they are computationally expensive i.e. they take a lot of time for computation for classification as their most of the power resides in the matching scheme. It takes large amount for classification due to the reason that each example to be classified must be compared to the each of the example in training dataset. The training time can be reduced by eliminating the irrelevant and redundant features and this classification algorithm can be made fast for classification. The lazy classifiers we used are lazy IB1 and IBk
Random Forest: It is an ensembling classification algorithm and is based upon decision tree algorithm and produces output in the form of individual trees. This algorithm is a combination of bagging idea and random selection of features to construct a collection of decision trees with controlled variation. It is one of the highest accurate classifier for many datasets. A large number of variables can be handled by this algorithm without ignoring any variable.
Random Tree: Random Tree as its name indicating it's a tree build by picking random branches from a possible set of trees. Each tree has an equal probability of being get sampled in this algorithm or we can say the trees are distributed in a uniform way. Random Trees can be generated easily and efficiently. Combination of large sets of Random trees mostly designs accurate models.
JRip (Extended Repeated Incremental Pruning) is that type of rule based classifier that implements a propositional rule that performs repeated pruning to reduce errors and also called as RIPPER (Repeated Incremental Pruning to produce Error Reduction). JRip is a rule learner that exactly works like commercial rule learner RIPPER.
NB Tree: According to Thaseen & Kumar (2013), NB Tree is used and applicable to scale large databases and used to improve the performance of decision trees and Na??ve Bayesian Classifiers (Attribute need not to be independent for this classification algorithm.
Rotation Forest: Rotation Forest is also an ensemble classifier that transforms the data, subset of instances, subsets of classes and subsets of features using Principal Component Analysis because this method of transforming data requires less storage space and has low computation time. Rotation Forest is based upon two base classifiers: decision tree and Forest. PCA is simple rotation of the coordinate axes that forms Rotation Forest. It works on two key components: diversity and accuracy. Each decision tree uses different set of axes and trees are sensitive to the rotation of axes. It maintains accuracy as no principal component is discarded and the whole dataset is used to train each classifier with different extracted set of features. Rotation forest has highest accuracy and lowest kappa error rate as shown by experimental results.
Feature selection is the process of choosing relevant features from large number of features for a particular dataset by applying particular evaluation criteria as desired by user. Generally Feature selection process involves three phases. It starts with selecting subset of original features and then it evaluates the worth of each feature for that particular dataset as we are using NSL-KDD Dataset for Intrusion Detection. Then after evaluation of worth of each feature by learning algorithms, the features having lower rank or lower value can be eliminated from dataset. Finally the third step is to evaluate the reduced feature set with classification algorithms to check whether it gives better result than previous one or not. If yes then that will be the final efficient reduced dataset.
Feature selection is broadly classified in two categories: one is Feature subset selection and other is feature ranking. The techniques used for Feature subset selection are different from that of the feature ranking techniques. Feature subset selection techniques gives the best reduced feature set according to algorithms used for searching best feature subset. They do not rank the features. Feature ranking techniques calculates the score of each feature and rank them accordingly and we can pick the top k (as required by user) features according to higher rank and ignoring those having lower rank. We here are using Feature ranking for feature selection in our work. Further the process of feature selection can be supervised, unsupervised or semi- supervised depending upon class labels (Zhao et al., 2010). In Supervised learning method the features are evaluated using their correlation with class in the dataset while unsupervised learning method uses data distribution method to evaluate the worth of each feature and in semi-supervised learning algorithm the class label information is used to improve the performance of unsupervised learning algorithm. Now the Feature selection can be done using three models i.e. one of these models can be used for feature selection. Feature selection can be done using Filter Model, Wrapper Model or Hybrid Model. Filter model performs feature selection technique without use of learning algorithm while Wrapper model involves use of learning algorithm first and then performs feature selection. Hybrid model involves use of both of these models together for feature selection.
Our work is focused and limited to filter model using seven ranking based feature selection techniques. The main advantage of using filter model is that it work without using learning algorithm so it is unbiased. Secondly it is very simple and easy to use. It is very easy to design algorithms using his model due to its simple structure.

Information Gain (IG): Information Gain feature selection technique is based on the concept of entropy. Entropy is a measure of how pure or impure a variable is (Moore. W.A.). The expected resultant value for the feature by using information gain method is the mutual information of target variable say X and independent variable say Y. It is the reduction in the entropy of a target variable (X) achieved by using Learning algorithm (Y). Information gain method has a drawback that it chooses attributes having large distinct values over the attributes having fewer distinct values even though they are more informative. The value of information gain for an attribute can be calculated as: Let us consider an attribute X and a class attribute Y. The information gain for the attribute X with respect to class attribute Y is the reduction in uncertainty about the value of Y when the value of X is known. The value of Y is measured by its entropy i.e. H (Y) (Altidor, Khoshgoftaar, Hulse & Napolitano, 2011). Entropy comes from information theory. The higher the entropy, higher will be the informative attribute. The uncertainty about Y, given the value of X is given by the conditional probability of Y given X, H (Y|X).

Where Y and X are discrete variables that take values in {y1.....yk} and {x1....xl} then the entropy of Y is given by:

The conditional entropy Y when X is given:
Alternatively the Information Gain is given by:
Where H(X, Y) is the joint entropy of X and Y: when the predictive variable X is not discrete but continuous, the information gain of X with class attribute Y is computed by considering all possible binary attributes, X??, that arise from X when we choose a threshold ?? on X (Altidor, Khoshgoftaar,Hulse & Napolitano, 2011). ?? takes values from all the values of X. Then the information gain is simply:

Gain Ratio (GR): The Information Gain is biased towards tests with many outcomes (Harsha, 2012). Gain Ratio is a modification in Information gain to reduce its biasness. It takes into account the number of branches while choosing an appropriate attribute. It is based on the information given by intrinsic attributes. In this method the value of attribute decreases as intrinsic information gets larger. Gain ratio chooses an attribute only when its intrinsic information is very low. Intrinsic information means the amount of information which is needed to decide which branch belongs to which instance (Frank & Witten, 2011). Gain ratio of an attribute A can be calculated by using formula:

Gain Ratio ('A', Attribute) = Gain (Attribute A)
Intrinsic Info (Attribute A)
The attribute having maximum value of gain ratio will be selected as the splitting attribute.

ReliefF: The Feature Selection Technique ReliefF was proposed by Kira and Rendell in 1994. It is very easy to use and is fast and accurate technique (Gao et al., 2010). Relief works by measuring the ability of an attribute in separating similar instances. The process of selecting and ranking the features by using this technique involves basically three steps:
' Calculate the nearest miss and nearest hit.
' Calculate the weight of a feature.
' Return a ranked list of features or the top k features according to a given threshold.
ReliefF (RFF) is an extension to relief algorithm. It was extended by Kononenko.So that it can deal with multi-class problems and missing values. The basic idea of ReliefF is to draw instances at random, compute their nearest neighbors, and adjust a Feature weighing vector to give more weight to features that discriminate the instance from neighbors of different classes (Yang & Makedon, 2004). It can also be used to deal with noisy data and regression problems.

One R: It is a rule based algorithm as it generates rules and on the basis of those rules it selects features and rank them accordingly. One way of generating classification rules is through use of Decision Trees but this method of generating rules is complex and incomprehensible (Pulatova, 2006). Generally rule based algorithm tries to cover all instances belonging to a class. They work on a specific class at a time and follow three steps: Generate rule R on training data S, remove the training data covered by rule and repeat the process. OneR is the simplest approach to finding a classification rule as it generates one level decision tree. OneR constructs rules and tests a single attribute at a time and branch for every value of that attribute. For every branch, the class with the best classification is the one occurring most often in the training data.

Symmetrical Uncertainty: It is a correlation based Feature Selection approach. Correlation based feature selection evaluates the merit of a feature in a subset using a hypothesis ' 'Good feature subsets contain features highly correlated with the class, yet uncorrelated to each other' (Meo et al., 2009). This Feature selection method is used to measure the degree of association between the discrete attributes. It is also based on the concept of entropy and is a symmetric measure to measure correlation between set of features. It can be calculated by using the equation:

H (X) and H (Y) represents entropy measures for X and Y. The value of Symmetrical Uncertainty lies between 0 and 1. The value 1 indicates that one variable i.e. either X or Y completely predict the other variable (Meo et al., 2009) while the value 0 indicates that both the variables are independent.

Filtered Attribute Evaluator: It is based on the same principle of Information Gain Feature Selection Technique. In our work we used this technique and found it gives the attributes in the same order as selected by Information Gain Method. These methods are based upon direct performance evaluation criteria of the dataset without looking upon the data to be predicted for the reduced feature set. Such methods are computationally less expensive than other methods.
Chi-Square: It is based on the statistical theory. It measures the lack of independence between the attributes (Nuipian et al., 2011). It can test strength of relationship between two variables. It is a non-parametric test (Ling, 2012). It predicts the value of an attribute using observed and expected values of attributes. The value of an attribute based upon this feature selection method can be calculated using the formula:

Here i and j are two attributes and O means observed value and E means expected value and ?? means value of chi-square. Higher the value of chi-square of an attribute higher will be its rank.

The dataset used for experimental evaluation consists of randomly selected instances from NSL-KDD Dataset. It classifies network traffic in five categories. The number of each class instances included in training and testing dataset are mentioned in table-I.


Class Type Instances in Training Dataset Instances in Testing Dataset
Normal 48522 1478
Dos 41699 37490
Probe 3608 8301
U2R 21 28
R2L 50 1075

The performance matrices used to evaluate classification techniques are:
Confusion Matrix: It contains information about actual and predicted classifications done by a classification system.



Actual Normal Attack
Normal TP FN
Attack FP TN

False positive (FP): It defines the number of detected attacks which are actually normal.
False negative (FN): It means wrong prediction i.e. it means it detects instances a normal but in actual they are attacks.
True positive (TP): It means instances that are correctly predicted as normal.

Source: Essay UK -

About this resource

This Information Technology essay was submitted to us by a student in order to help you with your studies.

Search our content:

  • Download this page
  • Print this page
  • Search again

  • Word count:

    This page has approximately words.



    If you use part of this page in your own work, you need to provide a citation, as follows:

    Essay UK, Intrusion Detection Systems (Idss). Available from: <> [27-05-20].

    More information:

    If you are the original author of this content and no longer wish to have it published on our website then please click on the link below to request removal: