Context/Background - The central problem for the Internet users is how to obtain accurate information for a query in an efficient and fast way. And the most efficient way to solve the problem nowadays is using the search engine, which is a software system that could search information from the Internet automatically and provide query services for the users.
Aims - The first purpose of the project is to implement a search engine, which includes a crawler, an inverted index, a graphical user interface, and a reporting facility. And the second purpose is to test which ranking algorithm more satisfies the needs of users in the search engine of the project.
Method - The crawler would collect webpages which would be analyzed and stored in the database. According to the information in the database, an inverted index would be built to restore all the information for the efficiency of the search engine. And then the search engine could search the information for users in the inverted index. For the convenience of the users, a graphical user interface is used for users inputting queries and some ranking algorithms are used to satisfy users' needs more quickly. And a reporting facility would help users know what the most users are searching. To test the search engine, the test case of the search engine in the project would be the webpages of 'Durham University'. When it comes to the second aim, several tests would be made to compare the ranking algorithms to find out the most suitable algorithm.
Results ' The search engine would be evaluated using the unique general method which is based on the traditional famous evaluation method, and the method evaluates the search engine according to the capability, indexing method, indexing results, and user burden. Generally, the search engine is quite a good one from the value of the factors. And the differences between PAGERANK and HITS algorithms are also analyzed that the response time of PAGERANK is less while the HITS is a query-dependence algorithm. So the user would decide the algorithm based on their queries.
Conclusion ' In the project, a search engine has been made and as a platform, the ranking algorithms PAGERANK and HITS have been compared to let the search engine get the better results. Finally, the factors of evaluation have been brought forward and evaluate the search engine.
Keywords ' search engine, crawler, index, database, query, reporting facility, PAGERANK, HITS
With the rapidly development of the Internet technology, more and more people would like to derive information from the Internet. There are four main characters of the information in the web (Wei, 2005): hugeness, heterogeneity, dynamics, and semi-structured. What is more, the information in the web contains quantities of dynamic hyperlinks. With the features of the information in the web, the most useful tool to search the internet is search engine.
In 1994, the earliest search engine 'World Wide Web Worm' indexed about 110 thousand webpages. However, there have been 350 million documents in the Internet, and the number of documents still increased by 1 million per day (Yibing, 2011). According to the research by China Internet Network Information Center (CNNIC) in 2014, 'The 33rd Statistics Report for China Internet Developing Situation', by December in 2013, the number of Chinese webpages has been up to 150 billion, up 22.2 percent on last year's same period. However, the Chinese webpages are just the tip of the submerged iceberg. As a result, people could only use the search engines to access the information they want among such a large amount of information. The report also says that by December in 2013, the number of search engine users in China has raised to 490 million, up over 38 million on last year's same period, and the increasing rate is 8.5 percent while the use rate of the search engines has been to 79.3 percent. Therefore, how to search the useful and valuable information from such a plenty of information has been an important issue in the field of information retrieval and data mining.
B. Search Engine
In the background as shown above, the search engine has been developed and is very popular among the Internet users, and nowadays, the search engine has been one of the most effective tools when it comes to the information retrieval.
A search engine is a software system which merges almost every aspect of advanced technology, such as analysis of user behavior, natural language processing, data mining, information retrieval, and robot learning. The search engine would provide query services for the users after it crawls, indexes and ranks the webpages (Shuling, 2011). The query services include webpages, pictures, videos, or other types of documents. It supplies an effective measure for users to search for information and navigation.
Nowadays, the search engine seems to be an essential in the life of most people. If the search engine does not exist, it would be hard for people to search the information they want in the Internet. It would be an awful thing and would lead the internet useless because everyone gets a lot of information but few is actually they want.
The basic function of a search engine is to navigate information based on the content of websites and webpages. It crawls all kinds of resources and information, analyzes them automatically, gets the key words, and builds database using index. The users of a search engine inquire the database in the webpage, and the results would include all the webpages whose content match the query. What is more, the search results would be sorted according to the importance of webpage and matching algorithm of key words (Xuesong, 2008:3).
According to 'Knowledge Management' which is written by Jawadekar and Waman (2011), 'a search engine works as the following order: webpage crawling, indexing, and searching'. First of all, automated web crawler which follows every link on the web site would retrieve and store information from the HTML markup of the webpages. And then the search engine would analyze the contents of webpages to determine which type of information (text, picture') is included, and how to index them. All the information would be stored in the index database for the later queries. Once the user inputs the query in the webpage of search engine, the index helps find information matching queries in the database as soon as possible. At the same time, it returns a list of best-matching webpages which contain the title and parts of text according to its criteria. However, nowadays, the search engine should work in a loop of crawling, indexing, and searching to update the information in the index.
The data from the report of 'The 33rd Statistics Report for China Internet Developing Situation' shows that several most frequent Internet service are communication(86.2 percent), information retrieval(79.3 percent), entertainment(54.7 percent), and commercial trade(48.9 percent). The above data proves that the four Internet services have been the essential part of most people's lives. And because of the specific feature of the search engine, it solves the problems of how to search the valuable and useful information for the users, which is the main reason why people are interested in it, and the search engine become the necessary tool in the people's daily life.
C. Ranking Algorithm
In the early age of the search engine technology development, the users pay more attention to whether the searched results are relevant to the query. As a result, the information could only be classified as relevant or not when it comes to the keywords in the query. Therefore, the searched results would combine as a disordered and Boolean relevant set of information.
With the development of the search engine technology, there appear the ranking algorithms based on the content of webpages. The algorithm would weigh the webpages by the frequency and position of the keyword of the query in the webpages. Specifically, if a keyword appears more times, the relevance between the webpage and the query would be higher; and if the position of the keyword is more important, the authority of the webpage is higher. Therefore, the rank of the webpage in the result set would be higher.
Then, the ranking algorithms based on the hyperlinks analysis appear. The kind of algorithms takes the methodology of citation analysis in the traditional information index theory, which considers the hyperlinks as the quotation of the webpage, and if the times of a webpage is cited more, the authority value would be higher; and if a webpage is quoted by an authority webpage, the authority value of the webpage would be higher, and finally the rank of the webpage would be higher. Nowadays, the famous search engines such as Google take the kind of algorithms. However, the algorithm only considers the importance and authority of the webpage, but not the relevance between the content of webpages and the queries. Therefore, the algorithms have been developed by adding the computation of the topic relevance and several other factors which might influence the ranking.
Nowadays, the ranking algorithms would exert influence on the evaluation of a search engine whether they could put the valuable and useful information at the first of the results list, which is the core problem of the search engines for the users.
D. Purpose of Project
The first purpose of the project is to implement a search engine, which includes a web crawler which could get all the resources and information of the webpages, an index which could save data from the crawler, a graphic user interface which is convenient for users to input a query and return the result list to the users, and several ranking algorithms which rank the search results.
The second purpose of the project is to compare the ranking algorithms PAGERANK and HITS to find their advantages and disadvantages and which ranking algorithm would be more suitable for the search engine in the project.
Design a web crawler.
Build and maintain an index.
Design a GUI (such that a user can pose queries).
Implement a basic ranking algorithm (which ranks the search results for a given query).
Implement a well-known ranking algorithm such as PAGERANK.
Improve the GUI by allowing more complicated user queries (e.g. using Boolean opereators).
Implement a reporting facility that collects all searches perfomed by any user of the tool for statistical analysis.
Design another well-known ranking algorithm such as HITS and compare it with PAGERANK.
Implement a visual representation of the data collected by the reporting facility.
A. Development of Search Engine
According to the criterion of ranking the results, the search engine could be classified as three generations (Lina et al., 2011).
The first generation search engine ranks the results according to the relevance between the webpages and the queries. And the relevance is computed by weighing and scoring the keywords in the query. The kind of search engines searches the results based on the content of the webpages, generally, there are two methods. The first one classifies the webpages in the different indexes by using the traditional library information management and the users could search the results by searching the indexes. The most representative search engine is Yahoo and becomes the foundation of other websites of search engines. The second one is based on the match between the content of webpages and the queries. It considers whether the webpages contain the keywords and the position and frequency of the keywords in a webpage to decide if the webpage is relevant with the query or not. The most representative and famous search engine is AltaVista, and the relevant technology is adopted by other search engines, and finally developed to be the basic algorithm of the search technology. However, the first generation search engines has the obvious drawbacks, the first method promises the relevance, but it is not automatic, as a result, it is not convenient for the users to use the search engine. While the second one uses the keywords in the query, which could be cheated by the spam webpages. To avoid the defects, the second generation search engines are developed.
Actually, the second generation search engines scores the importance of the webpage by whether it is clicked by the users and how long the users would like to spend in the webpage or the number of the in-hyperlinks in the webpage. Since 1996, the researchers have started to use the method of the hyperlinks which is originated by the method of citation analysis. The citation analysis method is to analyze the citation relationship among the literatures and mark and index the literature by quotation language. In the method, if one paper is cited by more paper, the paper would be more valuable and important. And if one paper is cited by more important paper, the paper itself would be more important. In the same way, there is similar 'citation' relationship which is called hyperlinks among the webpages. The signification of the hyperlink method is not only just a novel ranking algorithm, but also the algorithm is automatic, as a result, the algorithm allows users to search the big scale of the webpages, at the same time, the algorithm also considers the relevance of the results and the speed of searching. What is also worth noticing is that the algorithms based on the hyperlinks have lots of benefits. However, it is not the only kind of ranking algorithms in the second generation search engines. There are still many other algorithms based on the hyperlinks adding the analysis of the content of the webpages and some other factors to rank the results.
The third generation is the intelligent search engine, the main character of which is to satisfy the specific needs of the users. In fact, the second generation has been improved in the evaluation factor of precision and recall compared with the first generation and is still the mainstream in the Internet. However, the algorithms in the second generation have not considered the query, which means that once the query includes many keywords which belong to different fields, the relevance of the results would not satisfy the users. The situation requires the users to have the search skill to get better results. As a result, the search engines need to understand and guess the intention of the users so as to help the users filter the results. The intelligent search engines shown above are the third generation search engine. Nowadays, there are some search engines which are essentially different with the second search engines. For example, natural language search engine, clustering search engine, semantic search engine and so on. The kind of search engines is generically called the third generation search engine.
B. Problem Domain
According to 'Search Engines Information Retrieval in Practice' (W.Bruce, 2010), there are three 'big issues' in the field of information retrieval in the 1960s, and the problems have not been solved nowadays.
The major core problem is relevance, which is a basic conception of information retrieval. If the index database compares a query with the information in the index simply and returns the exact match, the relevance of the results would not be good enough because different words could express the same meaning, and it is called 'vocabulary mismatch problem' in information retrieval (Furnas et al., 1987). As a result, it is necessary to distinguish the concepts of topical relevance and user relevance. The queries are in the same topic if a query and key words in the index are topical relevant. For example, weather and snow are topical relevant. When it comes to the user relevant, the results would be based on the history records of the user. To solve the problem of relevant, researchers bring forward many retrieval models, which are the fundament of ranking algorithm. A search engine uses ranking algorithms to yield the result list. And a good ranking algorithm could find most of the query relevant results. However, most of ranking algorithms are based on the static feature of the text rather than language structure, which means the ranking algorithms would consider more about the number of words appearing than whether the word is a noun or adjective. Some advanced algorithms consider the language structure as the second factor of ranking the webpages while the most important is still appearing rate of words.
The second core problem is evaluation problem. In 1960s, Cyril Cleverdon made the evaluation method (Cyril, 1970). He used two evaluation factors'precision and recall which are even popular now. Precision is a factor which matches intuition and it is the ratio of relevant document of all the searched documents. Recall is the proportion of searched documents of the whole documents. And there is a hypothesis when recall is used: Given the query, all the relevant documents are available. It is obviously impossible for a large database. However, it would be useful when it comes to the smaller test collection. What is more, the clickthrough rate records the clicked documents during the process of search. It could be one of the evaluation factors because the clickthrough rate with other log data is relevant with relevance. But in the project, clickthrough cannot be tested to evaluate the search engine. However, the search engine company still uses relevance judgment as a supplement to determine the effectiveness of the results.
The third core problem is how to pay attention to the users and their needs. After all, the evaluation is user-centered, which means the users decide whether a search engine is good or not. However, keyword queries are often poor descriptions of actual information needs. For example, 'blue' might be a color or a feeling or a type of song. Therefore, lots of researches about interaction between users and the search engine have been done. Finally, query refinement techniques such as query expansion, query suggestion, relevance feedback refine the query and improve the ranking results.
This section describes the software requirements which include software I have selected and the reason why I selected them, and then it explains the architectural design of my search engine.
A. Software Requirements
The search engine is developed using Eclipse running on a Windows 8 operating system. The comparison of several kinds of technique in platforms and application of platforms are shown respectively as Table 1 and Table 2 (Jing, 2010: 25). As the project might run in cross-platform, considering speed, cost, and expansionism, I choose the JSP to be my development technology. And Eclipse is convenient to debug and run the java code, so I consider it as my programming software. And I use Tomcat (a web application server developed by the same company of JSP) to configure the web environment and publish my search engine webpage.
B. System Architecture
According to the basic procedure and the function of a simple search engine written above, I take the architecture shown as Figure 1, which is based on a similar figure that can be found in 'Development of Lucene + Nutch Search Engine' (Xuesong, 2008). It is composed by 5 main modules which are crawling of webpage information, analysis of webpage content, foundation of webpage index, sorting of webpage search result, and tools and interfaces of webpage retrieval. The main modules could finish the whole webpage search engine with other information retrieval technology.
Crawling of webpage information
The software or system which collects the webpage information is called a web crawler. Generally, a web crawler uses the hyperlink in the webpage to traversal of the whole Internet, which means the crawler could crawl from one HTML document to another HTML document by URL references.
Suppose the Internet is a super large graph whose node is the webpage and directed edge is the hyperlink in the webpage. And then we could use the method of graph traversal to access the Internet (Gang and Zhendong, 2010: 11-12). Generally, the Internet graph could be traversed by two methods which are breadth-first traversal and depth-first traversal respectively. However, if the the depth-first traversal is used, the crawler might fall in the much 'depth' of the graph (most of them would be spam or useless information) or go into the 'black hole' (The black hole is the hyperlink of the page points to itself, or the different URLs point to the same webpage) of the Internet so that it could not get out to search other webpages. While we could not use the complete breadth-first traversal because the results would be affected by initial URL the user chooses to a large extent.
As a result, two methods described above are not suitable in the project. Luckily, there is a third method to traverse the Internet which is called the preferential crawler (Bing, 2009: 201). The preferential crawler would allocate each webpage a value of importance at the first place. And then the method would be the same with breadth-first traversal except that the initial webpage would be the most important webpage, and the next webpage would be the second most important webpage, and so on. Now, the problem is how much each webpage values. Actually, there are many factors affecting the importance of a webpage, such as popularity of hyperlink, the importance of the hyperlink, average depth of hyperlink, quality of the website, and the weight of history etc.
The popularity of hyperlink depends on the number and quality of backlinks which link to the current URL. And we define it as IB(P). The importance of hyperlink is a function of string of URL. For example, the URL with '.com' or 'home' would be more important than those with '.cc' or 'map'. And we call it as IL(P). Then, we could define the importance of webpage I(P) as following (Gang and Zhendong, 2010:22):
In the function, a and b are the parameters which adjust the proportion of IB(P) and IL(P). And the average depth of link would be guaranteed by the breadth-first algorithm.
Analysis of webpage content
Generally, the webpages or the documents which are crawled by the web crawler would be analyzed at the first place for building the index later. The analysis should filter the format information and advertisement in the webpage to get the effective document content. Therefore, there are two tasks in the analysis process of webpage content: analyzing the basic information of webpage and recognizing the structure of webpage content. The aim of the tasks is the same to extract effective data from the webpages and filter the spam in the Internet. Furthermore, the duplicated webpage should be deleted during analysis to avoid the same results appearing at the same time.
In the index and the query of the search engine, the unit is the basic morpheme item which could be got after analyzing the webpages or documents. And the analysis of documents usually includes extracting words, eliminating punctuation mark, convert between uppercase and lowercase letters, resuming word stem, eliminating conjunction and high frequency words.
Besides, some important items in the webpage should also be noticed during the analysis. For example, all the titles, link words, blackletter type, etc. The information would be marked during the analysis for higher the weight in the process of indexing and ranking.
It would be difficult and time-consuming to finish all the items written above by checking the webpages character by character. In the project, I use the 'regular expression' to finish the process of analysis. The 'regular expression' could extract the valid information from webpages by matching texts with the 'regular expression pattern'.
The module of the analysis part is shown as figure 2 (Xuesong, 2008:46).
Natural Language Process
Actually, the 'regular expression' just eliminates the spam in the webpages but not extract the important information. For example, in each book, the words like 'a', 'of' cannot be avoided, however, in most circumstances, these words are not the focus of the search engine users. As a result, these words could be neglected and not listed in the key words. The natural language process should also be used in the query of users. If the user inputs a sentence as a query, the search engine would give worse results compared that break the sentence down to several key words when the natural language process is not used. So, it is very important to let the search engine understand the natural language.
The first thing I dealt with is to set a list of stop words like 'of' and 'the'. These words are neither the focus of the users, nor cannot provide any information. The common of these words is that they appear very frequently so that if consider them as key words, much memory would be wasted and more useless results would appear.
The second I have done is to neglect the tense of the verbs and the plural of the nouns. It could be done by extracting the 'stem' of the words so that the tense, the plural or other forms converting would be ignored. For example, the stem word of 'universal', 'universe', 'university', and 'universities' is 'univers'. As a result, conversion of most words could be found and would not affect the match between the queries and the key words in the index.
Foundation of webpage index
The webpages have to be indexed for the later use. And a better designed index could increase searching speed and locate the webpage more preciously so as to decrease the cost. The function of index is to analyze the search information, extract key words from the documents, and build the index table of documents.
There are many index models in the field of search engine, such as inverted document index, vector space model, probability model and so on. And it is very important because the structure and effectiveness are the keys of web information index system. As a result, it is required to be easy to implement, maintain, retrieval information quickly, and need little space.
Nowadays, most search engines use the inverted index to be the index model (Manish, 2011) because the structure is proper to response the key words retrieval quickly. Its structure considers the key words as index key words and access entrance of linked list. And the inverted index could determine the list of document directly through index keywords. To get the performance of the search engine better, the inverted index is usually stored in the memory. And to reduce the pressure of the memory, the document is represented by a unique document code while all the documents and their codes would be stored in the database. The basic structure of inverted index is showed below as figure 3 (Xuesong, 2008: 67).
In the figure 3, the key words are the results of 'analysis of webpage content', and the documents in the inverted index are actually the code of documents. Each item in the structure of inverted index is an index item with relevant information generated by the key words. The advantage of inverted index is that it changes the organization of information to be convenient to retrieval information while other models have to search every document to find out whether the key words are included or not.
Sort of results
There are a lot of webpages (web spams) which include most keywords but only the keywords. Absolutely, they are not the users' needs. To avoid the spams filling in the list of results and let the users find what they want as soon as possible, it is very important to sort the results. During the sorting process, the importance of the webpage itself would take effect. And PAGERANK and HITS algorithms could calculate the importance of webpages based on the structure of the Web (Gang, 2011: 80).
A famous and popular way to sort the results is the PAGERANK algorithm. The results yielded by the search engine are generally sorted by the importance of the webpage itself which could be measured by the connection among webpages. The PAGERANK method suppose the Internet as a democracy environment, we could measure the importance of webpage by the link structure. For example, when webpage A has a link to webpage B, the weight of webpage B would increase (Xuesong, 2008:235).
There are two hypothesizes in the PAGERANK algorithm (Junlin, 2012: 138). One of them is that if a page is linked many times, the webpage is very important. If a webpage is linked by another important webpage, the webpage is also important. In a word, the importance could be passed among webpages. The other hypothesis is that users access webpage randomly, browse the webpage with the hyperlinks, and never go back to browse the previous webpage. As a result, the next webpage is more important than the previous one.
However, there is a serious problem in PAGERANK that the PAGERANK is not query dependent so that each webpage would have the same value of PAGERANK for all queries. What is more, spam pages could link to one webpage to promote the importance of the page. Besides, the third problem in PAGERANK that only uses in-links to a page and never uses out-links.
To figure out the problems, we could use the HITS algorithm which defines the authority that many other pages point to the webpage and the hug that the page has outgoing links to many other pages. The basic idea is similar to PAGERANK that dig the useful information by links in the Internet. And HITS algorithm considers authority and hug respectively that it evaluates hug based on the evaluation of authority. Finally, a weight would be assigned to the webpage. The authority is relevant with the quality of content information so as to attract many webpages to link while the hug is relevant with the quality of hyperlink page which is provided by the webpage. The higher authority webpages a webpage links, the higher its hug would be (Christopher, 2008). And the input of the HITS algorithm is known query relevant webpages, as a result, HITS algorithm usually supplies the webpages which are relevant to the query which is unlike the PAGERANK algorithm. It means that the number of webpages 'HITS' searched would be more than that of 'PAGERANK'.
However, there are some disadvantages in the HITS algorithm. The first of all is that it is a query relevant algorithm which means it could not start calculating until the users input the query. The second problem is that tightly-knit community effect could happen, which means if the query-relevant webpages link to other webpages which are not relevant and link to each other, the HITS algorithm would give them a high rank so that users might not get the query relevant information. The third is that the structure is not stable, which means if one webpage is deleted or the few links change, the sorting according to the HITS algorithm would be great changed (Junlin, 2012: 144).
Graphical User Interface
In the section of graphical user interface (GUI), based on the famous search engine company Google, the home page would be very simple and convenient for users to search the information that he/she wants to. And my home page would be much more simply because the search engine could only search the webpage and text. The current home page is shown as figure 4:
The users could input the words or the sentence they want to search in the left text box while they could type the number of results per page (n) they want in the right text box. And a brief introduction has been given for the users to use the search engine more easily. The next webpage would show the previous n results of all the results. And at the same time the page would tell the user how many results in total, the titles of searched webpages, the contents which include the words that the user searches, the websites of searched webpages, the time when the webpage is crawled which is helpful for the update of the database of the whole search engine, and which page current webpage is. Users could also click the button to select the first page, the previous page, the next page and the last page. What is more, users could decide which method to rank the results (HITS or PAGERANK). Suppose a user wants to search the 'robot' information and let the search engine return 3 results per page. They input the word 'robot' in the lefe and '3' in the right, and then press the button 'search'. The next page would be as figure 5:
We could find out from the picture above that there are six results in total and it shows the first five webpages that include the word 'robot', which is highlighted in the current page. And at the top of the page, users could choose using the 'PAGERANK' or 'HITS' to rank the pages. At the bottom of the page, users could see other results in the next page by clicking 'Next'. The pattern of next page is almost the same as the one above. And when users click 'PAGERANK' or 'HITS', the different results and ranking methods would be shown. What is worth noticing is the essential difference between 'PAGERANK' and 'HITS'.
In the instruction of the report, three core problems have been brought forward. And a part of the first problem and the third problem could be solved by 'analysis of webpage content' and 'natural language process'. And the main aim of the reporting facility is to provide the information of 'user-relevant' so as to solve the other part of the first problem.
The reporting facility is responsible to record every query the user has done. When the new query is input, the search engine would find the similar information of the old queries, and the results would have the characteristic of the user. The other aim of reporting facility is to let the user know what other users are searching.
When the user inputs the query, the query would be stored in the database. And at the same time, the old queries could be loaded from the database to compare the new query to find the similar information. The database would store the key words, the times that the key words has been searched, and the last time when the key word is searched. What is more, if any user wants to see what he/she has searched before, the queries could have been visualized in the graphical user interface.
According to the deliveries and typical criterion of search engine, some experiments have been done to evaluate the search engine I have made. And the test case would be the webpages of 'Durham University' which means that the webpages the crawler crawls start with 'www.dur.ac.uk'. The criterion and the results of the experiments would be shown. Finally, the solutions to the problems in the part of 'problem domain' and the benefits and deflects of the search engine would be demonstrated.
Criterion of Evaluating the Search Engine
In 1973, Lancaster and Fayen listed 6 factors to evaluate the search engines, coverage, recall ratio, precision ratio, response time, user effort, and format output (Lancaster and Fayen, 1973). Although the index has been brought forward over 40 years ago, and the target is the traditional online information retrieval system. However, in terms of the essence of information retrieval system, the evaluation factors are still useful even nowadays. Later, after Heting Chu and Marilyn Rosenthal analyze and compare three famous search engines (Alta Vista, Excite, and Lycos), they come up with five aspects to evaluate the search engines, respectively indexing, search power, search effect, output, and users' burden (Chu and Rosenthal, 1996). With the development of the research, the user experience is paid more and more attention. Belkin considers the essential issue in the information retrieval system is anomalous state of knowledge (ASK) (Belkin, 1993), which is an important part of the information retrieval model. Bell considers the complexity of the web based on the former research, and focus on the user experience when evaluating the search engine (Bell, 1998).
Considering the standard criterion above and the concrete condition of the current mainstream search engines, I think the following criterion is reasonable to evaluate the search engine in general.
Capability of the Search Engine
The factor includes the capacity and coverage of the database of the search engine, the content and the depth of indexing, and the novel rate of the database. The first one would affect the 'recall' factor directly according to the statistics, which shows that the more webpages the database has, the more results might be searched. Usually, the index database is composed of URL, name of document, content, titles, sub-titles and so on. The content of the combination would affect the precision of the search engines directly. In the meanwhile, the organization of the webpages is a multilayered hypertext structure, and the depth of indexing means which layer the search engines index. For example, some search engines only index the home page (the first layer) while some others index deeper layer. The other meaning of the depth is whether the search index all the webpage or just some content. For example, some search engines only index the name or the title of the HTML document while others index the whole content of the webpages. Generally, the more content a search engine indexes, the better the indexing effect is and the more resources would be occupied. The novel rate includes two aspects, uniqueness and update frequency. Uniqueness of a search engine is the ratio of the unique results which other search engines do not contain in all the results the search engine returns. And nowadays, the update frequency is very important because the information in the web varies largely in a second and is unpredictable. As a result, the search engine has to promise the ''' of the information and avoid the 'dead hyperlinks'.
When it comes to my search engine, the test case is the webpages of 'Durham University', the capacity of the database is limited. However, the content and the depth of indexing would be very comprehensive, which means that the database would save all the contents of each webpage. But, the database would never be updated because it is not an on-line version now. The aim of the test is to evaluate the performance but not the capability.
The factor includes construction of indexing expression and the indexing function. In 1998, Rousseau et al. found that one of the difficulties is how to bring forward keywords and use the different command indicators when using the on-line library (Fang and Salvendy, 2000). In fact, the search engine has the same problem with the on-line library. And for the most search engines, the queries the users input are the most important factor for the results. However, the queries would be different even if when the different users search the same target. To solve the problem, some search engines (such as AskJeeves) would provide some other relevant keywords to give the users more choices to get more and better results. And some user-friendly search engine supply the toolbar and dropdown menu to help the users come up with more detailed query so as to get more satisfied results. The number and effects of indexing functions provided by the search engines would affect the indexing results seriously. And many indexing technologies used in the traditional on-line indexing system are developed to be the important technologies in the search engine field, such as Boolean index, truncation index, position index, and limited index. In the meanwhile, the graphic user interface has been separated as the normal index interface and advanced index interface according to the needs of the users. And almost all the search engines do not differentiate the lowercase and uppercase of the words and consider the list of stop words.
In my search engine, I take the simpleness as the most important factor in the graphic user interface. As a result, there is no toolbar and dropdown menu existing. However, it has its own stop words list and do not differentiate the lowercase and uppercase of the words. Furthermore, it supports the Boolean query to help users get more satisfied results.
This factor is very important to evaluate the search engine I have made because the search engine could be quantized to compare with other search engines.
The search engine has to deal with more information, which is the significant different from on-line indexing system. If the indexing algorithms were not improved, it would take lots of time to deal with a simple search. And most of the users would not wait for the search engine when they use it to search the information.
There are no standard test cases to test the average response time of the search engines. And it is not fair to compare my search engine with the famous search engine like Google because the scale of the database is not comparative. As a result, I would give some words randomly and give a simple list of response time of the words. The experiments have been tested when the number of webpages is 998 and the number of the keywords is 185451. The detail information would be listed as table 1.
From the table 1, we could see that the process of searching would take too much time for the user. There are two main reasons slowing down the speed of the search engine. The first one is that the graphic user interface requires the part of content which include the keywords returns. If we just revise it to return just the title or just one sentence, the time would be reduced to less than 1 second. The second one affecting time is the query in the database. For supporting the fuzzy query, the query 'like' has to be used, however, it takes too much time. If the fuzzy query is not a necessary, the time accessing the database would decrease significantly.
Recall and Precision
These two factors are very important in the past when the information is limited. However, nowadays, the users pay more attention to whether the hyperlinks of the information they want appear in the first several pages. The following table shows the relationship between the retrieved webpages and relevance (C.J., 2010).
The calculation of recall and precision is as follows:
In the function, |.| means the number of '.'.
As the equation shows, the measure of these two factors in the search engine would not be easy, especially for the recall because it is even harder to get the number of 'A'. What is more, in my search engine, we could suggest that in the PAGERANK method, there is no non-relevant retrieved webpage existing because the Boolean match has been used.
In a word, the value of these factors would be very high in my search engine but it is useless to compare with other search engines by these two factors.
The format and content of output
The outputs of the results are different among search engines, however, there are almost three parts: title, summary, and URL in all the search engines, what is more, the keywords are highlighted in some search engines. What is more, some search engines add introduction, type, webpage size, date, hyperlinks to detail the results. Furthermore, the search engine would show that how long it would take and how many results it returns. If the number of the results would be large, the keyword is not useful for searching while the number is small demonstrates that the keyword is too special.
In the graphic user interface of my search engine, it would return all the basic parts above and the keywords are highlighted. What is more, it would return time and the number of results.
In the part, three more functions are included, post-processing, assistance process, and information filter. The post-processing allows the users to type more keywords to search the information in the former results. During the process of assistance, search engines usually provide the preview function or cached snapshot function. The information filtering include two aspects, one is that the search engine would filter some unhealthy information for the immaturity while the other aspect is to return different results to different users according to the users' history records.
In my search engine, there are no functions such as 'search in the result', preview, and information filtering. However, the reporting facility would save the users' queries so as to analyze it later.
Comparison between the PAGERANK and HITS
According to the principle of the PAGERANK and HITS algorithm, it seems that they are similar which are based on the hyperlinks and use the eigenvector. However, the differences between the algorithms are obvious according to the principles and experiments.
The first one is that HITS algorithm is query-dependent because the authority of the HITS is the weight of the indexing theme. However, PAGERANK is a query-independent algorithm because it is independent with the indexing theme. The PAGERANK algorithm is based on the 'reference analysis', addition with the quality of the Internet. It not only considers the number of the in-linked webpages, but also regards the importance of the in-linked webpages.
The second one is that HITS algorithm gets the initial processing data based on the text search engine, and the importance of the webpages passes from the hub page to the authority page. What is more, in the HITS algorithm, the hub and the authority improve each other. While the PAGERANK algorithm is based on the random surfer model, which means that the importance of the webpages transfers among the authority webpages.
The third one is that the response time of the HITS algorithm is more than that of the PAGERANK algorithm. Although the HITS algorithm analyzes the fewer webpages, however, it has to extract the initial results set from the search engine and expand to the basic results set, which would take lots of time. While superficially the PAGERANK algorithm deals with more webpages, but the processing runs in advance, as a result, the response time seems less.
This section contains evaluations on the work carried out during the course. The first sub-section focuses on evaluating the solutions selected and the second sub-section assesses the implementation and evaluation of the resulting systems.
Evaluation of Solutions
The final aim of the project is to build a search engine and compare the ranking algorithm PAGERANK and HITS. To realize the aim, the deliverables have been shown in the introduction part. The following part is to evaluate each part in the deliverables.
The first one in the minimum objectives is to design a web crawler. The detail has been introduced and it actually works. However, there are some problems in the crawler part. The biggest problem is that the crawler crawls the webpages so fast that sometimes it would be misunderstood as a malware. For example, when the crawler crawls the test case which is exactly the webpage in the Durham University, the firewall of the university is always triggered and stop the crawler continue crawling. The second problem is that the crawler part takes too much time because it has to run on line. Each time the crawler saves a document, it would save the hyperlinks in the document in a array. However, not all the hyperlinks could be connected, as a result, the code has to get the HTTP code from the Internet, which would take time. Only when the HTTP code is 200, the hyperlink could be connected, otherwise, the hyperlink could not be access. Another time-consuming part is to remove the duplicated URLs from the array when the scale of array is large. As a result, it is very easy to solve the problem actually. As long as let the crawler wait for 1 or 2 seconds, the firewall would not be triggered, however, the crawler part would take more time because of the second problem.
The second in the minimum objectives is to build and maintain an index. Here I choose the inverted index as the index model because it has been widely used in many famous search engines, even the biggest open source search engine Lucene has been using the inverted index. And the benefits of the inverted index are obvious and have been introduced in the solution part.
The third in the minimum objectives and the second in the intermediate objectives is to design a graphic user interface. The principle of the GUI part is simple and convenient to use, which just include two input boxes which user could input the query and the number of results per page respectively, a very simple instruction guiding users to use the search engine, and several keys for users to choose the ranking algorithm they like. If I have more time, there are lots of things to do. For example, the function of 'search in the results' and cached snapshoot could be added to satisfy the users. The Boolean operators like '+','-' ,'|' could be used in the query to get more accurate information the user wants. More Boolean operators could be added if the more time is given.
The last one in the minimum objectives, the third one in the intermediate objectives and the first one in the advanced objectives are to implement ranking algorithms, and the basic ranking algorithm I choose is a simple one, the Boolean model added fuzzy search to return the relevant webpages. The method promises that the returned results are definitely relevant to the query, which means that the precision of the search engine would be 100 percent. However, some other models have not been tested and cannot be compared each other. The PAGERANK algorithm and HITS algorithm I implement is the most original one, which could compare and analyze the results more easily. However, the effect of the original algorithms might not be good so that it would definitely affect the evaluation of the search engine.
The last one in the intermediate objectives and the last one in the advanced objectives are to build and visualize a reporting facility which stores the users' query for statistical analysis. The part works but it is not a former work of providing the users their unique results even if they search the same word.
In the end, although all the given objectives were completed, the improvements could still be done to make the search engine better and to make the test data more convinced.
Evaluation of Implementation
According to the 'Results' part, the search engine has been evaluated. However, there are still some problems need to be solved.
First of all, the test case is not good. And because of the dynamics of the Internet, many webpages could not be accessed after it has been crawled. Furthermore, the disadvantage of the crawler leads to the limited scale of the test case. As a result, it seems that the capability of the search engine is not good. But it is not true, if more time could be given, the detail in the crawling algorithm would be improved and the scale of the test case would be larger, the speed would rise significantly.
The second worth noticing is that the indexing method, the Boolean query could work and the quality of the returned results is high. However, the toolbar and dropdown menu would be added if I have more time.
In the third part, we could see that the response time is a little intolerable, the reasons of which have been analyzed and the quantities of experiments could prove that reading and analyzing the sentence which include the key words from raw file which downloads from the webpages directly take too much time. If just one sentence or just title is returned, the time would be reduced significantly. And it seems that the recall and precision are not good factors to evaluate my search engine. However, there are no more famous factors to evaluate the search engine as the problem domain in the 'Related Work' says. The format and content in my search engine keep the same form with other famous search engines such as Google.
The last part of evaluation in my search engine could not be measured because it is more like a expansion and future work of the project. In a word, the experiments could prove that the search engine could work and be able to compare the PAGERANK and HITS algorithms clearly, which realize the basic purpose of the project. But there are lots of details that could be improved.
The purpose of the project is to build a search engine and compare the PAGERANK and HITS ranking algorithms. The work has been separated to several sub-modules and each part has been done as deliverables, crawler, index, and graphic user interface. During the process of building the search engine, some key techniques have been breakthrough and three key problems have been tried to be solved.
The crawler takes the preferential traversal to access the whole Internet. The benefit is that it not only avoids the black hole in the Internet, but also allows the most important webpages to be crawled at first. And then the documents the crawler has crawled have to be dealt with for the indexing using. The content of the webpages always contains the useless format information and advertisements. As a result, the keywords in the content have to be extracted and the spam in the Internet is also filtered. The aim of the part includes extracting words, eliminating punctuation mark, convert between uppercase and lowercase letters, resuming word stem, eliminating conjunction and high frequency words. What is more, the dictionary of stop words is built to extract the useful key words from the content and the stems of the words are used to neglect the tense of the verbs and the plural of the nouns. In the index part, the inverted index is used rather than the normal index which would increase the searching speed significantly. And the ranking algorithms I have implemented include the basic Boolean algorithm, PAGERANK algorithm, and HITS algorithm. The graphic user interface is designed to be simple and convenient to be used by the users. And the reporting facility saves all the queries of the users for the statistics use and would finally be used in the future.
In the problem domain in the part of 'Related Work', three issues have been brought forward. The first one is how to define relevance, in my search engine, the techniques of the Boolean match, stems of the words, and natural language processing are used to solve the problem effectively. The second issue is how to evaluate the search engine. In the part of 'Results', the whole system of evaluation has been given, although it is not appropriate for evaluating my search engine, however, it is quite a good system for evaluating the common search engine because it covers almost every aspect of the search engine. The last issue is how to satisfy the users. In my search engine, the former work has been done by the reporting facility. However, the problem has not been solved in the project. But it is definitely the focus of the future work.
Luo Gang and Wang Zhendong , (2010). Writing a web crawler by yourself, 1st Ed., Tsinghua Univeristy Press
Wang Xuesong, (2008). Development of Lecene+Nutch Search Engine, 1st Ed, Post & Telecom Press
Yu Jing, (2010). Course of JavaWeb Application Development, 1st Ed, Beijing University of Posts and Telecommunications Press
W.Bruce Croft, Donald Metzler, and Trevor Strohman, (2010). Search Engines Information Retrieval in Practice, 1st Ed, China Machine Press
Jawadekar, Waman, (2011). Knowledge Management: Text & Cases, Tata McGraw-Hill Education
Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter and Sabrina Chandrasekaran, (2011). Inverted Indexes for Phrases and Strings. Available at:
http://www.cs.nthu.edu.tw/~wkhon/papers/PTSHVC11.pdf (Accessed: 25 January, 2014)
Furnas, G., et al.,(1987). 'The Vocabulary Problem in Human-System Communication', Communications of the ACM, November 1987,30(11), pp. 964-971
Cyril Cleverdon, (1970) 'Evaluation Tests of Information Retrieval Systems', Journal of Documentation, Vol. 26 Iss:1, pp. 55-67
Bing Liu, (2009). Web Data Mining, 1st Ed, Tsinghua Univeristy Press
Luo Gang, (2011). Encoding Technology in the Search Engine ' Lucene & Java, 1st Ed, Publishing House of Electronics Industry
Junlin Zhang, (2012). This is the Search Engine, 2nd Ed, Publishing House of Electronics Industry
Christopher D. Manning, Prabhakar Raghavan and Hinrich Sch??tze,(2008). Introduction to Information Retrieval. Cambridge University Press
Wei Zhang (2005), 'Search Engine Optimization Strategy Research based on PageRank Algorithm', unpublished
Yibing Li (2011), 'Webpage Ranking Algorithm Study based on Search Engine', umpublished
China Internet Network Information Center(CNNIC)(2013). The 33rd Statistics Report for China Internet Developing Situation. Available at: http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201403/t20140305_46240.htm (Accessed: 24 April 2014)
Shuling Dong (2011), 'Research and Improvement of Ranking Algorithms on Search Engine', umpublished
Lina Zhang, Zhiyin Yang, Bo Yang (2011), 'Research on the Current Development Situation on the Third-Generation Search Engine', Sci-Tech Information Development & Economy, 21(34):141-143
Lancaster, F. W. & Fayen, E. G. (1973). Information Retrieval On-Line. Los Angeles: Melville Pub. Co.
Heting Chu and Marilyn Rosenthal.,(1996). 'Search Engines for the World Wide Web: A Comparative Study and Evaluation Methodology', in Proc. Of the 59th Annual Meeting of the American Society for Information Science, pp.127-135.
Belkin, IV.j.,(1993). 'Interaction with texts: information retrieval as information-seeking behavior, Information Retrieval '93'55-66
Bell, S. 'Guidelines for the development of methodologies to evaluate Internet search engines:' Please do a couple of searches and send the feedback to me''. Glasgow: Department of Information Science, University of Strathelyde (MSeThesis), 1998
X. Fang and G. Salvendy(2000). 'Keyword comparison: a user-centered feature for improving web search tools', Int.J.Human-Computer Studies, Vol.52:915-931
C.J. van RIJSBERGEN(2010). Information Retrieval. Available at: http://www.dcs.gla.ac.uk/Keith/pdf/Chapter7.pdf (Accessed: 28 April 2014)
Source: Essay UK - http://www.essay.uk.com/free-essays/information-technology/search-engine.php