The number of databases and the amount of data stored in a single database are growing rapidly. This is true for almost any type of database such as traditional (relational) databases, multimedia or spatial databases. Due to the enormous amount of data in various application domains, the requirements of database systems have changed. Techniques to analyze the given information and find so far hidden knowledge are mandatory to draw maximum benefit from the collected data. Knowledge discovery becomes more and more important in any databases since increasingly large amount of data obtained from environmental studies, geographic (map) databases or other automatic equipment are stored in spatial databases.
Classical analysis methods are in general not well suited for finding and presenting implicit regularities, patterns, dependencies or clusters in today's databases. Important reasons for the limited ability of many statistical methods to support analysis and decision making are the following:
' They do not scale to large data volumes (large number of rows/entries, large number of columns/dimensions) in terms of computational efficiency.
' They assume stationary data which is not very common for real-life databases. Data may change and derived pattern may become invalid.
For these reasons, in the last few years new computational techniques have been developed in the emerging research field of Knowledge Discovery in Databases (KDD) .
1.1.1 Knowledge Discovery in Databases
Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. The KDD process is an interactive and iterative process, involving numerous steps including preprocessing of the data, applying a data mining algorithm to enumerate patterns from it, and the evaluation of the results (), e.g.:
1. Creating a target data set: selecting a subset of the data or focusing on a subset of attributes or data samples on which discovery is to be performed.
2. Data cleaning and preprocessing: includes basic operations, such as removing noise or outliers if appropriate, collecting the necessary information to model or account for noise, deciding on strategies for handling missing data fields.
3. Data transformation: finding useful features to represent the data, e.g., using dimensionality reduction or transformation methods to reduce the number of variables under consideration or to find invariant representations for the data.
4. Data mining: searching for patterns of interest in the particular representation of the data such as classification rules or trees, association rules, regression, clustering, etc.
5. Interpretation of results: this step can involve visualization of the extracted patterns or visualization of the data given the extracted models. Possibly the user has to return to previous steps in the KDD process if the results are unsatisfactory.
1.1.2 Data Mining
Data mining is a step in the KDD process consisting of applying data analysis algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data. The different data mining algorithms that have been proposed in the literature can be classified according to the following primary data mining methods :
' Clustering: identifying a set of categories or clusters to describe the data.
' Classification: learning a function that maps (classifies) a data item into one of several predefined classes.
' Regression: learning a function that maps a data item to a real-valued prediction variable and the discovery of functional relationships between variables.
' Summarization: finding a compact description for a subset of data.
' Dependency Modeling: finding a model which describes significant dependencies between variables (e.g., learning of belief networks).
' Change and Deviation Detection: discovering the most significant changes in the data from previously measured or normative values.
Clustering is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters .
Clustering has been applied in a wide variety of fields, as illustrated below with a number of typical applications:
1. Engineering (computational intelligence, machine learning and pattern recognition). Typical applications of clustering in engineering range from biometric recognition and speech recognition, to radar signal analysis, information compression, and noise removal.
2. Computer sciences: We have seen more and more applications of clustering in web mining, spatial database analysis, information retrieval, textual document collection, and image segmentation.
3. Life and medical sciences (genetics, biology, microbiology, clinic, pathology). These areas consist of the major applications of clustering in its early stage and will continue to be one of the main playing fields for clustering algorithms. Important applications include taxonomy gene and protein function identification, disease diagnosis and treatment, and so on.
4. Astronomy and earth sciences (geography, geology, remote sensing). Clustering can be used to classify stars and planets, investigate land formations, partition regions and cities, and study river and mountain systems. Clustering can also be used in the detection of seismic faults by grouping the entries of an earthquake catalog , detection of clusters of objects in geographic information systems and to explain them by other objects in their neighborhood ( and ).
5. Social sciences (sociology, psychology, archeology, education). Interesting applications can be found in behavior pattern analysis, relation identification among different cultures, construction of evolutionary history of languages, analysis of social networks, archeological finding and the study of criminal psychology.
6. Economics (marketing, business). Applications in customer characteristics and purchasing pattern recognition, grouping of firms and stock trend analysis all benefit from the use of cluster analysis.
1. 2 Density-Based Spatial Clustering of Application with Noise (DBSCAN)
(Density-Based Spatial Clustering of Application with Noise)  is a density based clustering algorithm which discover clusters with arbitrary shape. The basic idea of density-based clustering is that for each point of a cluster, its Eps-neighborhood (for some given Eps > 0) has to contain at least a minimum number of points (MinPts > 0). DBSCAN search for clusters by checking the Eps-neighborhood of each point in the database. If the Eps-neighborhood of a point p contains more than MinPts, a new cluster with p as a core object is created. And the points that are not belonging to cluster are called noise . DBSCAN then iteratively collects directly density-reachable objects from these core objects, which may involve the merge of a few density-reachable clusters. The process terminates when no new point can be added to any cluster.