Web text information extraction method

A technology of information extraction and text, applied in the direction of website content management, instrumentation, and other database retrieval, etc., to achieve the effect of fast extraction speed, good performance, and small memory usage

Inactive Publication Date: 2015-11-25
SHANDONG UNIV
View PDF3 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Aiming at the deficiencies of the existing information extraction algorithms, the present invention provides a multi-feature webpage text extraction method based on text density distribution. This method aims at the text distribution density of the web interface, and first analyzes and integrates the text in combination with the page label distribution, and then uses The unit window is the basic unit to conduct a preliminary feature analysis of the data in the window, and then use the noise variance and text similarity as the secondary features to further eliminate noise and purify the text, and then achieve accurate text extraction

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web text information extraction method
  • Web text information extraction method
  • Web text information extraction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0032] In order to further illustrate the technical means and effects that the present invention adopts to achieve the intended invention, the text extraction algorithm based on the text distribution density of the multi-featured webpage proposed by the present invention will be further described in detail below in conjunction with the accompanying drawings and specific implementation methods .

[0033] figure 1 Provided the overall frame diagram of the inventive method, press figure 1 As shown in the dotted box, the web information extraction algorithm includes three parts: an HTML source code pre-organization module 100 , a text density distribution algorithm application module 101 , and a follow-up text integration and output module 102 . The module 100 specifically includes an HTML source code capturing unit S11 and an HTML source code parsing unit S12; the module 101 relates to a specific algorithm example of the present invention; the module 102 includes a standard sele...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a web information extraction method aiming to solve the problems of high complexity and low accurate rate of an existing information extraction method. The method includes the steps that an HTML source code is obtained and loaded to a first memory zone; an HTML parser is built, the HTML source code is parsed in combination with page label distribution, and parsed data are stored to a second memory zone; unit windows are defined, the unit windows are moved to conduct primary feature statistics on data in the windows, voice variance and text similarity are sequentially adopted to serve as secondary features to further remove noise, and the density quantitative value of all the windows is obtained; the relation between the density threshold and the text density of all the windows is concluded according to sample data, and a reasonable extraction scheme is formulated; finally, a text integration module is entered, and a text with a normative format is output. Above all, a loading-parsing-quantizing-selecting-outputting processing scheme is adopted for the HTML source code, different web pages are adapted automatically according to the text density relative value, a great number of web pages can be processed, and webpage information is extracted automatically.

Description

technical field [0001] The invention relates to an information processing technology, in particular to a text information extraction technology of HTML source code, and belongs to the field of Internet information processing. Background technique [0002] With the popularization of Internet technology, the data on the web has increased rapidly, and the information carried on the web page has become an important source of human information. However, a large number of so-called "noises" such as web links, advertisements, and plug-ins are embedded in web pages, which makes the complexity of information processing such as information retrieval, data mining, machine translation, and text summarization soar. In this context, there is an urgent need for a fast and effective web text extraction method to remove the noise outside the text of the web page and correctly extract the text information of the web page. [0003] Among the currently commonly used information extraction algo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/986G06F16/951
Inventor 刘琚彭寿钧郑丽娜
Owner SHANDONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products