Web information extraction system

A technology of information extraction and information points, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of failure to extract data analysis and prediction, lack of versatility, etc., to achieve easy understanding, save time and physical strength , the effect of simple interface operation

Inactive Publication Date: 2009-11-18
DALIAN MARITIME UNIVERSITY
View PDF0 Cites 80 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

These information extraction tools are quite different in terms of degree of automation, types of web pages processed, and data storage methods, but their main problem is that they lack certain versatility and fail to analyze and predict the extracted data.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web information extraction system
  • Web information extraction system
  • Web information extraction system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment approach

[0088] 1. Single-slot extraction rules page

[0089] Taking the webpage "Sina-Weather-Dalian" as an example, this paper introduces the definition method of single-slot extraction rules. Assume that the information of interest to this webpage is the weather conditions of the day, including the following information points: city name, day of the week, temperature, and wind. The process of defining a rule is as follows:

[0090] (1) First, the user enters the inspection rules (corresponding URL), and chooses to define the extraction rule method (script web page and DOM tree); select the rule storage path according to the prompt interface, and enter the rule file name (the extension is rul).

[0091] (2) If the scripted webpage mode is selected, the system will automatically download the webpage corresponding to the URL and parse it to obtain the scripted webpage. The system automatically starts the browser to open the script webpage, and the user can click on the weather inform...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Web information extractions system, which is characterized by comprising a retrieve analyzing module, a rule generation module and a data extraction storage module, wherein the retrieve analyzing module comprises a web crawler unit and an HTML resolver; the rule generation module comprises a single-slot extraction rule generation unit and a multi-slot extraction rule generation unit; and the data extraction storage module extracts data from web pages downloaded from the retrieve analyzing module and stores the data in a structural form according to the extraction rule generated by the rule generation module. The system has the following advantages: when single-slot extraction rules are generated, the interface operation is simple and easy to understand; for generating multi-slot extraction rules, the system provides a graphical interface to help a user label so as to save the time and the physical power for the user; for pre-generated extraction rules and mission sequences, the system provides two ways to achieve the extraction and the storage of batch tasks; and the system can finish the tasks of the extraction and the storage in preset period and time according to the parameters configured by the user.

Description

technical field [0001] The invention relates to a Web information extraction system, in particular to a semi-automatic Web information extraction system for analyzing web pages, defining and generating extraction rules, and storing and analyzing data. Background technique [0002] Currently, search engines have become one of the main tools for people to obtain information from the World Wide Web. However, the results of information retrieval using search engines often contain a large number of irrelevant Web pages, and users need to browse each result page to really obtain the information they need. The main way to solve this problem is to develop corresponding information extraction tools. Web Information Extraction (WIE) refers to automatically or semi-automatically extracting information of interest to users from structured or semi-structured Web pages, and storing it in a database in a structured form. Information extraction has a wide range of applications: online com...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 陈荣郭银蕊刘亚清陈涛陈娟孙向伟史玉翡
Owner DALIAN MARITIME UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products