Subject area identifying method based on weight of text structure

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A technology of subject area and text structure, applied in the field of Web information extraction, can solve problems such as affecting the application effect, slow extraction speed, usage restrictions, etc., to save time and energy, run fast, and achieve simple effects

Inactive Publication Date: 2012-01-04

WUHAN UNIV

View PDF5 Cites 28 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Moreover, most of the existing fully automatic extraction technologies use artificial intelligence and machine learning methods. These methods have a large amount of calculation and slow extraction speed, which affects the actual application effect.

In addition, this kind of method often needs to add some prerequisites. For example, RoadRunner needs to provide two pages generated by the same template, and requires the pages to contain repeated patterns, which limits its use.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0096] The technical solution of the present invention will be described in detail below in conjunction with the drawings and embodiments.

[0097] Such as figure 1 As shown, in the embodiment, the web page is acquired first, and then the web page is denoised, so as to obtain the web page to be identified. Web page acquisition is the most original data source and is responsible for providing Web pages to be identified. Concrete implementation can adopt a simple and easy breadth priority crawler to realize webpage acquisition, at first obtain webpage from Internet (Internet) by seed URL address, analyze wherein link then, fresh link is stored in the queue, then cycle takes out link from queue, until Stop when the user goal is reached or the queue is empty. Web page denoising is to standardize the obtained web pages, which can improve the recognition accuracy. During the specific implementation, the HTML document of the web page to be identified can be standardized acco...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a subject area identifying method based on weight of a text structure. Based on a tree structure, the method provided by the invention comprises the following steps: resolving a webpage into a label tree and improving a data area excavation and semanteme link block identifying technology on the basis of tree matching, thereby realizing pre-processing for removing links; providing a concept of the weight of the text structure and identifying a subject area with the assistance of the calculation result of the weight of the text structure; and finally adopting a normalization method to calculate a relative length value of a text node in the subject area, and using the normalized relative length value for effectively removing the text node independent from a subject content, thereby realizing denoising in the subject area and acquiring an accurate subject content. By applying the technical scheme of the invention, valuable information on the webpage can be accurately and rapidly excavated, thus the subject area identifying method provided by the invention has a wide application prospect.

Description

technical field [0001] The invention relates to a Web information extraction technology in the field of Web data mining, in particular to a method for extracting text content consistent with the subject of the Web page for text-based semi-structured Web pages. Background technique [0002] Currently, Web information extraction can be divided into the following three ways according to the degree of automation: [0003] (1) Manual method: This method is to manually observe the characteristics of the Web page, then manually mark it, extract the pattern of the target information, and then write a program to generate a wrapper (Wrapper) according to this pattern, and then pass the Wrapper Extract target information. This method can only be used for specific sites and is not universal. Such systems require users to have a solid foundation in computer programming. Because of this, the manual method is suitable for a small number of sites, but cannot be used for a large number of...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

Inventor徐武平徐爱萍杨少博

OwnerWUHAN UNIV

Subject area identifying method based on weight of text structure

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology