Natural Language Processing


Introduction
Natural Language Processing: Natural Language Processing, a branch of artificial intelligence that deals with analyzing, understanding and generating the languages that humans use naturally in order to interface with computers in both written and spoken contexts using natural human languages instead of computer languages.
Teaching computers to understand the way humans learn and use language is one of the challenges inherent in natural language processing. For example the sentence 'Baby swallows fly." This sentence can have multiple meanings. It depends on whether the word "swallows" or the word "fly" is used as the verb, which determines whether word "baby" is used as a noun or an adjective. The meaning of the sentence depends on both the context in which it was communicated and also on each person's understanding of the ambiguity in human languages. It poses problems for software that must first be programmed to understand context and linguistic structures. NLP is also known as computational linguistics.
Statistical NLP: Statistical NLP aims to perform statistical inference for the field of NLP. Statistical inference consists of taking some data generated in accordance with some unknown probability distribution and making inferences.

Spam is an unwanted communication that is intended to be delivered to an indiscriminate target, directly or indirectly, notwithstanding measures to prevent its delivery. Spam Detection is an automated technique that identifies spam so as to prevent its delivery. The spam contains a payload such as advertising for a (like worthless, illegal, or non-existent) product, bait for a fraud scheme, promotion of a cause, or computer malware that is designed to hijack the recipient's computer. As it is so cheap to send information, so only a very small fraction of targeted recipients or fewer needs to receive and then respond to the payload for spam to be profitable to its sender.
As e-mail is an effective, fast and cheap communication way, therefore spammers prefer to send spam through e-mail. Nowadays almost every second user has an E-mail id, and consequently they face with this spam problem. E-mail spam is unsolicited information sent to the E-mail boxes. Spam is a serious problem both for ISPs and for users. The causes of this problem are growth of value of electronic communications on one hand and also the improvement of spam sending technology. According to spam reports of Symantec (2013), the average global spam rate for the year was 89.1% that was increased by 1.4% as compared to the year 2012. The proportion of spam that was sent from botnets was much higher in 2013 that accounts for approximately 88.2% of all spam. In spite of this many spammers attempts to disrupt botnet activities throughout 2013.By the end of the year(2013) , the total number of active bots that returned to roughly the same number as at the end of year 2012, with approximately five million spam-sending botnets in use worldwide [1]. Spam messages has many disadvantages like it causes lower productivity; occupy extra space in mail boxes; extend viruses, Trojans, and materials that contains potentially harmful information for a certain kind of users, destroy stability of mail servers, and as a result users spend a lot of time for sorting incoming mail and deleting undesirable correspondence. According to a report from Ferris Research, the global sum of losses that are caused due to spam, has made about 130 billion dollars, and in the USA, 42 billion in 2012. Besides expenses for acquisition, installation, and service of protective means, users are compelled to defray the additional expenses connected with an overload of the post traffic, failures of servers, and productivity loss. So here we conclude that spam is not only an irritating factor, but it is also a direct threat to the business. By considering the stunning quantity of spam messages coming to E-mail boxes, it is possible to assume that spammers do not operate alone; it is global, organized, creating the virtual social networks. They attack mails of users, whole corporations, and even states.

Literature Survey
Nowadays Spam messages are one of weapons of information war. The notions spam and war appear in one context in scientific literature since 2003. But problems of spammer's social networks are considered in articles beginning from 2010.
' The clustering of spammers by considering them in groups is offered.[3]
' Spectral clustering method is applied to the set of spam messages collected by Project Honey Pot for defining and tracing of social networks of spammers. Social network of spammers is represented as a graph, nodes of which correspond to spammers, and social relations between spammers are represented by a corner between two junctions of graph as. In this paper, the document clustering method is applied for clustering and analyses of spam messages. In the text documents are E-mails in the text form. Instead of this fact that there are a number approaches for representation of text documents, the vector model is the most common of them.[6]
' The vector model for representation of texts has been offered in Salton's works. In the elementary case, the vector model assumes comparison to each document of a frequency spectrum of words. The dimension of space is reduced by rejection of the most common words that increases thereby percent of the importance of the basic words in more advanced vector models. The possibility of ranging of documents according to similarity in vector space is the main advantage of vector model. Applied Computational Intelligence and Soft Computing Clustering is one of the most useful approaches in data mining for detection of natural groups in a data set. The up-to-date survey of evolutionary algorithms for clustering, such as partition algorithms, is described in detail in [12].
' The comparison of advanced topics like multi-objective and ensemble-based evolutionary clustering; and the overlapping clustering are also mentioned in that paper. Each of the algorithms that is surveyed is described with respect to fixed or variable number of clusters; cluster-oriented or non-oriented operators; context-sensitive or context-insensitive operators; guided or unguided operators; binary, integer, or real encodings; and graph-based representations. Clustering of spam messages means automatic grouping of thematically close spam messages. This problem becomes complicated necessity to carry out this process in real-time mode in case of information streams as E-mails. There are different methodologies that use different similarity algorithms for electronic documents in case of a considerable quantity of signs. When classes are defined by clustering method, there is a need of their support as spam messages constantly changes, and the collection of spam messages replenishes. In this paper, the new algorithm for definition of criterion function of spam messages clustering problem has been offered. Genetic algorithm is used to solve the clustering problem [11].
' Genetic algorithms are the subjects of many scientific works. For example, in a survey of genetic algorithms that are designed for clustering ensembles, the genotypes, fitness functions, and genetic operations is presented and it concludes that using genetic algorithms in clustering ensemble improves the clustering accuracy. In this work, the k-nearest neighbor method is applied for the classification of spam messages, and for the determination of subjects of spam messages, clusters will be applied to a multi-document summarization method offered in papers [12].
' Huang et al., proposed a complex-network, which is based on SMS filtering algorithm that compares an SMS network with a phone- calling communication network. Although such comparison can provide some new features, that obtains well-aligned phone-calling networks and SMS networks that can be aligned perfectly is difficult in practice. In this paper, we present an effective SMS anti-spam algorithm that only considers the SMS communication network. We first analyze characteristics of the SMS network, and then examine the properties of different sets of meta-features including static features, temporal features and network features. We incorporate these features into an SVM classification algorithm and evaluate its performance on a real SMS dataset and a video social network benchmark dataset. We also compare the SVM algorithm to a KNN based algorithm to reveal the advantages of the former. Our experimental results demonstrate that SVM based on network features can get 7%-8% AUC (Area under the ROC Curve) improvement as compared to some other commonly used features.[2]

Problem Definition
From the literature surveys, it has been observed that in any task, Spam Detection is important security for private information. Every Spam text has similar pattern. If Spam detection system identifies this pattern, then spam can be easily classified. We will formulate email text in our thesis work.
' Let us consider the collection of Spam messages in vector space. Assume S = {s1, . . . , sn } is a collection of Spam messages, and T = {t1 , . . . , tm } is a set of terms (Spam keywords) in Spam messages collection.
' In vector model, any message can be represented as a point in m dimensional space, where m is the number of terms.
' Each Spam message, identified with the weighted vector: si= [wi1 , . . . , wim ], (1) where wij is a weight of term tj in spam message i, i = 1, . . . , n, j = 1, . . . , m, and n is the number of spam messages in collection.
We will use this Vector Model input in Discriminative Learning algorithm because spam database is very huge and it will be learn properly and will make optimize model.
Objectives
' To classify messages as spam or non-spam.
' Spam text is highly uncertain, so to make a model in which, to optimize the uncertainty of spam by using NLP approaches. This model will be more optimized and less complex as compared to existing model.
' Compare this model with existing model.
' Validate our approach.

Methodology
Insert the E-mail to system and do following steps.
' Tokenize the text into small part.
' Make the frequency distribution of Noun, Preposition, and Verb etc.
' Learn this distribution to Learning algorithm.
' Use following Discriminative learning algorithm.
If Spam Detect is correct then it is filtered otherwise learn by learning algorithm.

Hardware & Software Requirements
1. Duo 2 core processer
2. 4 GB RAM
3. WEKA
4. PYTHON
5. PYTHON-NLTK
6. ECLIPSE IDE

References
[1] Fabrizio Sebastiani. 'Machine learning in auto-mated text categorization- ACM Computing Surveys', 34(1):1-47, 2002.
[2] Qian Xu, Evan Wei Xiang and Qiang Yang, 'SMS Spam Detection Using Non-Content Features' publication in IEEE Intelligent Systems, Nov.-Dec. 2012 (vol. 27 no. 6)pp. 44-51.
[3] Symantec, 'State of spam and phishing-A monthly report 2010' http://eval.symantec.com/mktginfo/enterprise/other_resources/b-state_of_spam_and_phishing_report_05-2010.en-us.pdf
[4] Ferris Research, 'Cost of spam is flattening our 2009 predictions' http://www.ferris.com/2009/01/28/cost-of-spam-is- flattening-our-2009-predictions.
[5] S. Minoru and Sh. Hiroyuki, 'Spam detection using text clustering', in Proceedings of the International Conference on Cyber worlds, (CW '05), pp. 316-319, Singapore, November 2005.
[6] K. S. Xu, M. Kliger, Y. Chen, P. J.Woolf, and A. O. Hero, 'Revealing social networks of spammers through spectral clustering', in Proceedings of the IEEE International Conference on Communications, (ICC '09), Dresden, Germany, April 2013.
[7] K. S. Xu, M. Kliger, Y. Chen et al., 'Tracking communities of spammers by evolutionary clustering', http://www.eecs .umich.eduukevin/xu spam icml 2010 sna.pdf.
[8] G. Vishal and G. S. Lehal, 'A survey of text mining techniques and applications', Journal of Emerging Technologies in Web Intelligence, vol. 1, no.1, pp. 60-76, 2009.
[9] R. Ghaemi, N. Sulaiman, H. Ibrahim et al., 'A review: accuracy optimization in clustering ensembles using genetic algorithms' , Artificial Intelligence Review, vol. 35, no. 4, pp. 287-318, 2011.
[10] S. Nazirova, 'Mechanism of classification of text spam messages collected in spam pattern bases', in Proceedings of the 3rd International Conference on Problems of Cybernetics and Informatics, (PCI '10), vol. 2, pp. 206-209,2010.
[11] Eduardo R. Hruschka, Member, IEEE, Ricardo J. G. B. Campello, Member, IEEE, Alex A. Freitas,Member, IEEE, Andr?? C. P. L. F. de Carvalho, Member, IEEE, 'A Survey of Evolutionary Algorithms for Clustering'.
[12] RasimM. Alguliev, Ramiz M. Aliguliyev, and Saadat A. Nazirova, 'Classification of Textual E-Mail Spam Using Data Mining Techniques' Applied Computational Intelligence and Soft Computing Volume 2011, Article ID 416308, 8 pages doi:10.1155/2011/416308.
[13] http://www.webopedia.com/TERM/N/NLP.html,

Source: Essay UK - http://www.essay.uk.com/free-essays/science/natural-language-processing.php



About this resource

This Science essay was submitted to us by a student in order to help you with your studies.


Search our content:


  • Download this page
  • Print this page
  • Search again

  • Word count:

    This page has approximately words.


    Share:


    Cite:

    If you use part of this page in your own work, you need to provide a citation, as follows:

    Essay UK, Natural Language Processing. Available from: <https://www.essay.uk.com/free-essays/science/natural-language-processing.php> [22-10-19].


    More information:

    If you are the original author of this content and no longer wish to have it published on our website then please click on the link below to request removal: