Method and device for extracting information based on multistage rule base

A technology of information extraction and rules, applied in the fields of instruments, computing, electrical and digital data processing, etc., can solve the problems of low accuracy, low degree of automation, complex structure, etc., to achieve the effect of low price and improve the degree of automation

Inactive Publication Date: 2014-08-06
CHONGQING UNIV
View PDF5 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Disadvantage: When the page changes too much, the information cannot be extracted;
Disadvantages: The speed of information extraction is slow. When processing multi-subject WEB documents, if the subject is not divided into blocks, it will easily lead to failure of information extraction;
Disadvantages: Although this method has strong flexibility and adaptability, its degree of automation is low;
Disadvantages: For a large number of web pages, a large number of structures need to be analyzed, and the structure of many websites is relatively complicated. Even for professionals, the time spent writing each wrapper is huge, and people spend a lot of energy on the website structure Analysis and program debugging above
[0007] Summarizing the above four methods, it will be found that: although the method that is not highly dependent on the structure of the HTML document has a high degree of automation, it cannot handle web pages with complex structures, and its extraction accuracy is low and its practicability is poor; The method that is highly dependent on the structure of HTML documents can handle web pages with complex structures, but its degree of automation is low, and the information extraction method that relies on manual participation has high extraction accuracy, but the degree of automation is low, while the information extraction method with a high degree of automation is usually Has the disadvantage of low accuracy and poor practicability

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting information based on multistage rule base
  • Method and device for extracting information based on multistage rule base
  • Method and device for extracting information based on multistage rule base

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0046] The present invention will be further described below in conjunction with drawings and embodiments.

[0047] An information extraction method based on a multi-level rule base, the specific steps are as follows:

[0048]1) URL address acquisition. Firstly, the search sequence is used to search the relevant webpages of the search keyword to obtain the URL address of the webpage. The URL addresses obtained here cover all URL addresses related to the query sequence, and are a large number of addresses, not a single address.

[0049] 2) Web page download. Use web crawler technology to download relevant web page codes for the obtained web page URL addresses.

[0050] 3) Web page preprocessing. Process the obtained web pages to obtain a standard Dom Tree. Including: webpage cleaning, DOM analysis and graphical display of webpage structure.

[0051] Web page cleaning refers to: repairing and converting HTML pages into standard XML documents. Since HTML does not strictly ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A method for extracting information based on a multistage rule base comprises the steps that (1) a URL address of web pages is obtained; (2) the web pages corresponding to the URL address are downloaded; (3) a web page tree-type structure chart is obtained; (4) web page clustering is conducted, web pages are selected from the web pages to be clustered to serve as a training set, and a clustering rule of the web pages is defined according to a robot learning method; (5) a searching result is extracted; (6) information is collected and displayed. After the web page tree-type structure chart is obtained in the step (3) and the web pages are clustered in the step (4), the recall ratio of the retrieved information can be effectively increased, the clustering rule is automatically generated by means of robot learning in a training set mode, manual clustering is not needed, the automation degree of searching is effectively increased, and the condition of large-area use is achieved on the premise that the recall ratio is guaranteed. According to a device for extracting the information based on the multistage rule base, a hardware foundation is provided for an information extraction process, cost is low, and the device is suitable for large-scale use.

Description

technical field [0001] The invention relates to the technical field of computer search engines, in particular to an information extraction method and device. Background technique [0002] With the large-scale promotion and application of computers and networks, the world has entered the era of big information. For the era of big information, information search engines have become an indispensable key technology. The current information search engine adopts the following four information search methods: [0003] 1. Information extraction technology based on HTML structure; this technology completes information extraction according to the structural characteristics of HTML, and the extraction of information in web pages is equivalent to the extraction of node information in tree structure through the tree structure of DOM model. Disadvantage: When the page changes too much, the information cannot be extracted; [0004] 2. WEB information extraction technology based on natura...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/3335G06F16/335G06F16/355
Inventor 张可柴毅马号刘建环田甜
Owner CHONGQING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products