Information extraction for data records with varying structures from the deep web
thesisposted on 26.05.2017, 07:13 authored by Hong, Jer Lang
Arguably the Web now represents the largest database of information in the world. However, unlike regular databases, most of the information in the Web is stored in a format for human consumption only, typically in the form of HTML pages for presentation within a web browser. Wrappers offer a way of extracting information from the Web into a form suitable for processing by a computer. Thus, rendering the information is useful for many data processing applications. Recent development has seen the design of automatic wrappers which have basically replaced manual, supervised, and semi supervised wrappers as these earlier wrappers need human labeling and intervention in their operations. Automatic wrappers are robust and able to automatically understand human readable formatting in inducing the underlying data structures. In this thesis, we focus on the development of robust automatic wrappers for data extraction at record level and data unit level. Data extraction at record level is the extraction of data records generated from a database server following a predefined template. The partitioning of data records into smaller units termed data items is the extraction of record at data unit level. The importance of automatic wrapper is its use to automate meta search and in comparing and evaluating shopping lists. For data extraction at record level, our objective is to develop a set of fast wrapper heuristics to extract data records of varying structures from deep web. Our heuristics are based on our observations of how information within a typical HTML page is structured and they extract a set of statistical measures from the Document Object Model (DOM) tree of a HTML page. This information is then used to robustly extract the data records from the web page. Our results show that our heuristics based wrapper, called WISH, is as robust as the current state of the art wrappers such as ViNT, VSDR and ViPER. Moreover, WISH is a non visual wrapper and the results bring into question the underlying assumptions the current state of the art wrappers were founded on. This simplified wrapper approach could have significant speed advantages when processing large volumes of web site data, which could prove helpful for meta search engine development. Our heuristic technique is able to simplify the complicated process of comparing all the nodes of the tree structures as used in the tree matching algorithms. We find that tree matching algorithm works on the basis of comparing the identity and position of the nodes of two trees to determine the similarity of these trees. These algorithms are normally complicated and slow although accurate. As data records from deep web usually contain complicated tree structures, comparing the tree structures is time consuming and needs a lot of computing works, particularly when a tree structure contains a large number of nodes. Our study shows that the similarity of tree structures could be checked by calculating the number of nodes of the respective trees. Our simple heuristic method thus simplifies the coding procedure and reduces the work of a designer. This is an added advantage as fewer nodes are required for matching and comparing the tree structures. Data extracted from a HTML page can be rearranged and presented in a clear and easily read way, especially in a tabular form. This process is known as data extraction at data unit level (also known as data alignment). This will be of great help in shopping list comparisons, for example. Current data alignment algorithms incorporated in wrappers such as DEPTA and ViPER are unable to align disjunctive (optional data items) and iterative data items (data items having similar identity and structure). To overcome this limitation, we use a template detection algorithm to match data records structure and align them accordingly. We enhance the algorithm of WISH further by incorporating visual cue as part of our wrapper design. This wrapper is known as ViWEA wrapper. The use of visual cue in our wrapper design leads to higher data extraction and data alignment accuracy. First, we use visual boundary of data records to extract them from search engine results pages. Then, we use visual cue in addition to DOM Tree to solve the problems of aligning disjunctive and iterative data items. We achieve this by measuring the relative position and the size of a data item to differentiate data items which are disjunctive and iterative. Data records from different web pages can be visually similar from the visual perspective of a human user, but the underlying coding of the respective data records can be different from each other. These are irregular structured data records such as multiple sections data records and loosely structured data records. To distinguish and identify the different coding of such data records, we introduce an adaptive search technique to identify sections and data records as normally data records are encapsulated by sections. Once the sections and data records are identified, data records are partitioned according to the particular sections. Our heuristic and filtering methods of WISH are then applied to extract multiple sections data records. This wrapper is called WEAMS. We include in our study the latest technology in ontology, which is an approach dealing with the semantic characteristic of data records as data records in a deep web page are generally having similar meaning in their contents. Ontological approach can be applied for extracting data records with varying structures and aligning data items which are disjunctive and iterative. This wrapper is known as OW. Experimental tests show that our wrappers can perform better than the existing wrappers on a wide range of data records.