Subject area identifying method based on weight of text structure

A technology of subject area and text structure, applied in the field of Web information extraction, can solve problems such as affecting the application effect, slow extraction speed, usage restrictions, etc., to save time and energy, run fast, and achieve simple effects

Inactive Publication Date: 2012-01-04
WUHAN UNIV
View PDF5 Cites 28 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Moreover, most of the existing fully automatic extraction technologies use artificial intelligence and machine learning methods. These methods have a large amount of calculation and slow extraction speed, which affects the actual applic...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Subject area identifying method based on weight of text structure
  • Subject area identifying method based on weight of text structure
  • Subject area identifying method based on weight of text structure

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0096] The technical solution of the present invention will be described in detail below in conjunction with the drawings and embodiments.

[0097] Such as figure 1 As shown, in the embodiment, the web page is acquired first, and then the web page is denoised, so as to obtain the web page to be identified. Web page acquisition is the most original data source and is responsible for providing Web pages to be identified. Concrete implementation can adopt a simple and easy breadth priority crawler to realize webpage acquisition, at first obtain webpage from Internet (Internet) by seed URL address, analyze wherein link then, fresh link is stored in the queue, then cycle takes out link from queue, until Stop when the user goal is reached or the queue is empty. Web page denoising is to standardize the obtained web pages, which can improve the recognition accuracy. During the specific implementation, the HTML document of the web page to be identified can be standardized acco...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a subject area identifying method based on weight of a text structure. Based on a tree structure, the method provided by the invention comprises the following steps: resolving a webpage into a label tree and improving a data area excavation and semanteme link block identifying technology on the basis of tree matching, thereby realizing pre-processing for removing links; providing a concept of the weight of the text structure and identifying a subject area with the assistance of the calculation result of the weight of the text structure; and finally adopting a normalization method to calculate a relative length value of a text node in the subject area, and using the normalized relative length value for effectively removing the text node independent from a subject content, thereby realizing denoising in the subject area and acquiring an accurate subject content. By applying the technical scheme of the invention, valuable information on the webpage can be accurately and rapidly excavated, thus the subject area identifying method provided by the invention has a wide application prospect.

Description

technical field [0001] The invention relates to a Web information extraction technology in the field of Web data mining, in particular to a method for extracting text content consistent with the subject of the Web page for text-based semi-structured Web pages. Background technique [0002] Currently, Web information extraction can be divided into the following three ways according to the degree of automation: [0003] (1) Manual method: This method is to manually observe the characteristics of the Web page, then manually mark it, extract the pattern of the target information, and then write a program to generate a wrapper (Wrapper) according to this pattern, and then pass the Wrapper Extract target information. This method can only be used for specific sites and is not universal. Such systems require users to have a solid foundation in computer programming. Because of this, the manual method is suitable for a small number of sites, but cannot be used for a large number of...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 徐武平徐爱萍杨少博
Owner WUHAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products