Information Retrieval From Web Data Using Sqfl

Abstract: Here we have presented a query formulation language called as (SQFL) for retrieving structured data from the web. We assume that web data sources are represented in different format like RDF, JSON, ATOM etc. The objective of SQFL is to allow user to mashup data diagrammatically. In the background, SQFL queries are translated into and executed as SPARQL queries. It is possible for any user to query one or more data sources. To query data sources, there is no need to be aware of queried data's structure or the data itself to adhere to a schema. We are going to design Web mashups where Internet is considered as a database, where each data source is treated as a table, and a mashup is treated as a query. We can achieve interactive performance during query formulation with the help of new approach of indexing RDF data using Graph Signature. SQFL can be used by any non-technical user, not only by developer. The proposed system will formulate the query without having knowledge of the schema of data. SQFL comprises of two implements i.e. server related query and Editor and secondly a browser related Firefox hook up available for download in their website.

Keywords ' ATOM, CSV, C-Store, Deep Web, Oracle Semantic Technology, RDF, RDF3X, Semantic Web, SPARQL, Structured Data, Web feeds.


We are currently running on the radiance of glory of Web 4.0 World Wide Web technology, by leaving Web 2.0 far away. Every day we use Web 4.0 and we realize that it is established in our life. But, who wonders, how we came here? Who is extending the Web 2.0 to have today the Web 4.0? The answer for all this is very simple; we are the creators of the Web 4.0. In this era, it is the users who add value to services. We can regard Web 4.0 as a read/write platform that everybody can share everything such as photos, videos, music, bookmarks, articles, viruses' threads etc. Additionally we can see Web 4.0's pioneer companies to moving in 'long tail' philosophy for their target market rather than focus on center target market. In addition companies are taking advantage of their Web services using the edges of the internet rather than the center. The advent of mashups has made possible the combination of data into a new service. One no longer has to navigate across the internet to find the different data he/she wants to have. He/She can be informed about the recent news through a mashup of his own. He can get the sports news along with the politics news in one application using technologies like RSS, Atom and JSON. One can access available data on the Web through xml files that can be imported to his application and can also filter or sort them in any kind of way one likes. Semantic Web allows two different applications to exchange information meaningfully, thus, increase the use of information to their full potential. The Resource Description Framework (RDF) is an attempt to provide the means to represent the metadata of the Web applications in well-formed manner. Semantic Web pipes, is a powerful paradigm for building RDF-based mashups. It works by querying, operating, and producing an output which is accessible via a stable URL, by using the RDF resources. RDFa enables the developers to make an HTML page have double duty. That is, the page works as a presentation page and as a machine-readable source which is structured in RDF data. To expose the massive amount of public content and allow people to build mash-ups in an easy way, several mash-up editors have been launched, like Google Mashup, Yahoo! Pipes and others. However, Yahoo! Pipes are most popular mashup today, as its simplicity and user-friendly environment. [18]. But, Yahoo! Pipes only focuses on mashing up Web feeds (only capable to presenting news items and not capable representing data retrieved from Deep Web and encoded in RDF and XML). An attempt is made to present an implementation of an interactive query formulation language, SQFL, to regard mash-ups as data queries and view the Internet as a database, where each source will be considered as a table and a mash-up as a query. Assuming the Internet data is represented in RDF (plays the role of a semantically enabled metadata model), with the help of SPARQL a query can be generated for this data.
There are some existing Query formulation technique but they are not able to formulate query easily.
' Query 'By- Form:
In the Query By Form [4] user give the query by filling the form, where fields of form are treated as query variables. Though it is simple method but is not flexible. For each query, a form needs to be developed, and as query get changed, need to change form.
' Query-by-example:
Here users formulate queries as filling tables. However, it requires the data be schematized and the users to be aware of the schema
' Conceptual Queries:
Here user queries existing database model like UML diagrams, ER-Model at conceptual level. Here user first selects the part of a given diagram and then selection is converted into SQL. (e.g. LISA[6], ConQuer[7] and Mquery[8]).But again for these approaches users need to be aware of schema, structure of data and a good knowledge of the conceptual schema.
' Natural Language Queries:
In this user can write their queries in natural language sentences and then these sentences are translated into a formal language (e.g., SQL[10], XQuery[9]). Here user need not have to know the schema in advance. But problem with this approach is language ambiguity- i.e. multiple meanings of terms and the mapping between these terms and the elements of a data schema.
' Visualize Queries:
In visualize queries SPARQL Query is formulated by visualizing its triple patterns as ellipses connected with arrows. There are Several Semantic Web approaches (Isparql[11], RDFAuthor[14], GRQL [5], and Nitelight[13]) are available which formulate a SPARQL query. But here user requires little technical skill to formulate query. There are also some tools which assists developers to formulate XQueries graphically like Altova XMLSpy, Stylus Studio, Bea XQuery Builder, XML-GL, and QURSED) as they require technical knowledge about the queried sources and their Schemas/DTDs.
' Interactive Queries:
In Interactive Queries like LOREL[17] user can query the schema free data like XML and user need not to be aware of schema. But LOREL partially handles the schema free queries. And also it does not support querying multiple sources.
' Interactive Searching Box:
Interactive searching Box is today's most popular query formulation technique where user can write a keyword, the system then suggests to auto complete this keyword. The problem with this is, it cannot play the role of a query language.
' Mashup Editor:
A Web application which combines data from two or more sources to form completely new service is nothing but mashup editor. It just like as a remix and process of data from different resources. Mashups can be divided into 3 types based on the way they extract data like,
1) Consumer mashups,
2) Data mashups and
3) Business mashups
As development point of view we can classify mashup editors as:
1. User friendly mashups, like Yahoo pipes where you can drag and drop elements on a form and connect them with pipes. Serena can also be considered as a user friendly tool using GUI interface.
2. Hard coding mashups , where in the editor you are actually building an HTML file with certain tags at the beginning and at the end of the file.
Both categories use the RSS files to bring their data into mashups. We will discuss these in the following sections.

' Yahoo Pipes:
Yahoo Pipes is a web application tool/service made from Yahoo [18]. It is called 'Pipes' because you can fetch data from any source that supports RSS, Atom, or other XML feeds, extract the data you want, combine them, apply filters and have an output. It's a very easy-to-learn and easy-to-understand tool and from our experience while exploring the tool, no more than few minutes are needed to create a mashup, even if it contains data that must be displayed in a map, like Yahoo or Google maps. The drawback of this pipe is - It is not flexible, is used only in the Yahoo environment and you cannot use it in your own web site. It doesn't support RDF format.

' Google Mashup Editor:
Google Mashups use the Google Mashup Editor (GME) as a tool to edit, compile, test and manage the application. The basic difference of the Google mashups with the others is that everything has to be implemented with code. The code is included between the tags <gm: page> and </gm: page >.
One can add HTML or CSS or JavaScript code within the <body> and </body> tags. The drawback of this editor is that the construction of this pipe has to be implemented with code, so it is not user-friendly.

' Deri Pipes:
Deri Pipe[15] mode has a simple construction model, which consists of liked operators. Each of the operators allows a set of unordered inputs inside the different formats and a list of ordered optional inputs. The usage of those operators is to help mashing up information from the semantic web. The interface consists of the pipes menu where you can add a readymade code to the pipe code interface, the pipe code where you can customize by writing code, the published pipes menu and the output of your code where you can see the calculated output of the pipe when the pipe URL is fetched. The drawback of this editor is same as Google mashup editor like the construction of this pipe has to be implemented with code, so it is not user-friendly.
We will implement our editor with an interface similar to Yahoo! Pipes, and implementation that uses the RDF framework that Google and Deri pipes uses. We are designing a query formulation technique known as SQFL where user need not to be aware of schema, user can query schema free data and we can query multiple data sources. Here we treat web as a database, and each data source is considered as table.
A. System Analysis and Implementation Strategy

System will consists of Query language, Query formulation algorithm, Graph Signature, algorithm, Server side editor and Firefox add-on extension. We present two implementations of SQFL: a server-side editor, and a Firefox add-on extension. We will evaluate the response time of SQFL on two large data sets: DBLP and DBPedia; and compare it with Oracle's Semantic Technology. We will show how queries can be answered instantly, regardless of the data size. The Flow of System will be as follows '

Figure 1. System Flow

' Definition of SQFL
In SQFL, the data set to be queried is in the form of triplet like <Subject, Predicate, Object> where

S I, P ?? I, and O ?? I U L

' The Intuition of SQFL
In SQFL structure of query is like tree where query subject is root of tree. Here subject may be instance, ?? instance type or even user defined variable. Property of subject is nothing but branch of tree. As branches grows sub trees are formed, called as query path. Here the object of a property is treated as the subject of its sub query. In this way, one can navigate through the underlying dataset and build complex queries. Objects marked with ''' will be returned in the query results, i.e., projection. When querying different sources, two properties (or two instances) are considered the same if they have the same URI. SQFL queries are translated into SPARQL queries, which are submitted for execution. SQFL is translated into and executed as SPARQL queries Translating SQFL into SPARQL is done according to the mapping rules. These rules are implemented by the SQFL-to-SPARQL translator.

B. The SQFL Editor
We will implement SQFL in two scenario
1) Online Server side Query and Mashup Editor
2) Browser Side Firefox add On Editor.
It's Architecture is as follows

Figure 2. System Model

System Components:

' Loader: Function to send a request to the database to download the content of the RDF resource.
' Pipe Generator: Function responsible to save or load a pipe generated in the SQFL editor.
' Parser: It parses the SQFL pipe diagram into XML format and vice versa. The storage of the XML file is made in the database.
' Results Renderer: Function responsible to send the appropriate query to the database, get the results from the execution of the query, formulate them in the appropriate format and send them back.
C. Algorithmic Approach for Query Formulation

This algorithm is used by the SQFL editor. Its novelty is that it is one to navigate through and query a data graph(s) without assuming the end user to know the schema or the data to adhere to a schema. Background Query is generated by using Query Formulation Algorithm.
The Algorithm is like as follows -
Step 0: Specify the dataset D in the input module. D can be a merge of multiple data sources.
Step 1: Select a subject S, where S ' ST ' SI ' V.
The user can select S from a drop down list that contains:
ST: the set of all subject-types,
SI: the union of all subjects objects identifiers in the dataset (URI and key),
V: not select from a list and introduce own label for S
Repeat Step 2-3 until the user stops
Step 2: Select a predicate/property P
It has four possibilities
1) S ' ST
2) S ' SI
3) S is a variable
4) User can choose the property to be variable by introducing their own variable.
Step 3: Select an object filter. This selection stands for filtering the P property selected in step 2. We have three types of filtering:
1) Filtering function ( Like equal to , Not equal to etc.)
2) Object identifier
3) Query path
Step 4: Indicate the return values (projections) of the query.

Graph-Signature (GS) Index:

As we considered data might be schema free, query formulation algorithm queries whole dataset like queries which involves many self joins. So performance of algorithm may degrade. To improve performance we have proposed a new technique named as Graph signature for indexing RDF. With graph signature we require less time for execution of queries as size of graph signature is smaller than original graph. In SQFL, main thing is that background queries need to execute on the dataset within less time i.e. user should get answered as fast as possible i.e. within 100ms. So to achieve such short interaction time for background queries with graph shaped data is difficult as graph stored in a relational table, and to form background queries table need to be self joined many times. A query with n levels involves n -1 joins. Precomputing and materializing all possible SQFL's background queries is not an option since the space requirements are too high. So RDF indexing is needed. There are several approaches are available like Oracle3 [22], C-Store4 [16] and RDF3X5 [2]. Though these approaches give good performance but for large graphs they give poor performance. In this section, we present the Graph Signature for indexing RDF graphs. Graph signature S can be classified as O-signature SO and I-signature SI. SO is a summary of the original graph such that nodes that have the same outgoing paths are grouped together. SI summarizes a graph by grouping nodes that have the same incoming paths which is analogous to the 1-index [12]. In graph signature algorithm the input is a data graph and the output is the O/I Signature. To compute the O Signature, first, we will group nodes having the same properties; after that we iterate-to split groupings that are not O-bisimilar-until all groupings are stable. An equivalent class A is stable iff for every path P from A into another group node B, each instance of A has a successor in B; then A should be a subset of or equal to X. Otherwise, A should be split into two nodes: (A \ X) and(A - S ). The same (but the opposite) way is used to compute the I-Signature.

D. Experimental Setup

SQFL Editor takes inputs from multiple sites in different formats like ATOM, JSON,RDF etc. These queries then translated in to RDBMS by SQFL-SPARQL translator. At the output end we get mashed up data from different sites. When user specifies RDF,JSON like input it is loaded into oracle10 g which is installed on an server with 2 GHz dual CPU, 2 GB RAM, 500 GB HHD, and windows OS and its graph signature is generated. Subsequently, SQFL editor uses JDK (Netbeans) to dispatch the background queries and SPARQL Translation of formulated SQFL queries for execution by the Oracle 10g.

E. Performance Evaluation

Our evaluation is based on two public data sets: 1) DBLP and 2) DBPedia. The DBLP (700 MB size) is a graph of eight million edges. The DBPedia (6.7 GB) is a graph of 32 million edges, which is an RDF version of the Wikipedia. We choose these data sets in order to illustrate the scalability of our Graph Index in case of homogenously and heterogeneously structured graphs. In the following we present a SQFL Query. We identify the set of background queries and evaluate them on Oracle semantic technology , which is the native RDF Index in Oracle 10g as well as on the Graph Signature Index technique. Consider the Query : Retrieve everything related to Article that: has a title Skyline Query (Q1) , has creator of type person (Q2), has Name C. shahabi (Q3), and published year 2006 (Q4). These queries are executed on each partition using Oracle Semantic technology and Graph Signature Indexing. The cost(in seconds) required for these are shown in following table and graphically presented in figure

Query Oracle Semantic Technology SQFL(Graph Signature Indexing)
Q1 0.005 sec 0.003 sec
Q2 0.136 sec 0.031 sec
Q3 0.871 sec 0.058 sec
Q4 1.208 sec 0.091 sec




Table 1. Time Cost (in Seconds) of background queries

Figure 3. Response Time Evaluation
We have proposed a query-by-diagram language called SQFL in order to allow building data mash-ups easily. SQFL is user-friendly for non-IT people and also allows querying and navigating RDF sources without having to know the schema or any technical details of the data sources. We will present a new query formulation approach that allows people to mash-up and query structured data without any prior knowledge of the schema, vocabulary, structure, and technical details of the datasets. The SQFL is easy to learn as it is close to the logic and natural language that people use when asking questions. Also it enables semantic pipes for mashing up RDF, JSON, ATOM, and CSV like data easily though the user-friendly module designs.

[1] Miller, 'Response Time in Man-Computer Conversational Transactions,' Proc. Fall Joint Computer Conf., 1968.
[2] T. Neumann and G. Weikum, 'RDF3X: RISC Style Engine for RDF', Proc. VLDB Endowment, 2008.
[3] R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes, 'Exploiting Local Similarity for Indexing of Paths in Graph Structured Data', Proc. Int'l Conf. Data Eng. (ICDE) , 2002.
[4] M. Jayapandian and H. Jagadish,, 'Automated Creation of a Form Based Database Query Interface ', Proc. VLDB Endowment,2008.
[5] N. Athanasis, V. Christophides, and D. Kotzinos, 'Generating On the Fly Queries for the Semantic Web,' Proc. Int'l Semantic Web Conf. (ISWC '04), 2004.
[6] A. Hofstede, H. Proper, and T. Weide, 'Computer Supported Query Formulation in an Evolving Context,' Proc. Australasian Database Conf., 1995.
[7] A. Bloesch and T. Halpin, 'Conceptual Queries Using ConQuer- II,' Proc. Int'l Conf. Conceptual Modeling (ER), 1997.
[8] J. Dionisiof and A. Cardenasf, 'MQuery: A Visual Query Language for Multimedia, Timeline and Simulation Data,' J. Visual Languages and Computing, vol. 7, no. 4, pp. 377-401, 1996.
[9] Y. Li, H. Yang, and H. Jagadish, 'NaLIX: An Interactive Natural Language Interface for Querying XML,' Proc. ACM SIGMOD Int'l Conf. Management of Data, 2005.
[10] A. Popescu, O. Etzioni, and H. Kautz, 'Towards a Theory of Natural Language Interfaces to Databases,' Proc. Eighth Int'l Conf. Intelligent User Interfaces, 2003.
[11] Stylus Studio, html, Feb. 2010.
[12] T. Milo and D. Suciu, 'Index Structures for Path Expressions,' Proc. Int'l Conf. Database Theory (ICDT), 1999.
[13] A. Russell, R. Smart, D. Braines, and R. Shadbolt, 'NITELIGHT: A Graphical Tool for Semantic Query Construction,' Proc. Semantic Web User Interaction Workshop (SWUI), 2008.
[14] D. Steer, L. Miller, and D. Brickley, 'RDFAuthor: Enabling Everyone to Author Rdf,' Proc. Int'l World Wide Web Conf. (WWW '02 Developers Day), 2002.
[15] G. Tummarello, A. Polleres, and C. Morbidoni, 'Who the FOAF Knows Alice'? Proc. Int'l Semantic Web Conf. (ISWC), 2007.
[16]D. Abadi, A. Marcus, S. Madden, and K. Hollenbach, 'Scalable Semantic Web Data Management Using Vertical Partitioning,'Proc. Int'l Conf. Very Large Data Bases (VLDB), 2007.
[17] N. Athanasis, V. Christophides, and D. Kotzinos, Generating On the Fly Queries for the Semantic Web, Proc. Intl Semantic Web Conf. (ISWC 04), 2004.
[18] Yahoo Pipes, '' , Feb. 2010.
[19] R. Goldman and J. Widom, 'DataGuides: Enabling Query Formulation and Optimization in Semi structured Databases', Proc. Int'l Conf. Very Large Data Bases (VLDB), 1997.
[20] M. Jarrar and M. Dikaiakos, 'Querying the Data Web' Univ. of Cyprus, 2009.
[21] Jarrar M, Dikaiakos M, 'A query-by-diagram', Technical ArticleTAR200805. University of Cyprus , 2008.
[22] E. Chong, S. Das, G. Eadon, and J. Srinivasan, 'An Efficient SQL Based RDF Querying Scheme', Proc. Int'l Conf. Very Large Databases (VLDB '05), 2005.

Source: Essay UK -

About this resource

This Information Technology essay was submitted to us by a student in order to help you with your studies.

Search our content:

  • Download this page
  • Print this page
  • Search again

  • Word count:

    This page has approximately words.



    If you use part of this page in your own work, you need to provide a citation, as follows:

    Essay UK, Information Retrieval From Web Data Using Sqfl. Available from: <> [05-06-20].

    More information:

    If you are the original author of this content and no longer wish to have it published on our website then please click on the link below to request removal: