Webpage content extraction method based on Markov random field

A Markov random field, web page text extraction technology, applied in special data processing applications, instruments, electrical digital data processing and other directions, can solve problems such as limited accuracy

Active Publication Date: 2013-09-18
BEIJING ZHIHAI CHUANGXUN INFORMATION TECH
View PDF4 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method is simple to implement and does not require writing a wrapper, but the accuracy of the extraction is li...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage content extraction method based on Markov random field
  • Webpage content extraction method based on Markov random field
  • Webpage content extraction method based on Markov random field

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0090] The embodiments of the present invention will be described below through specific examples and in conjunction with the accompanying drawings, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific examples, and various details in this specification can also be modified and changed based on different viewpoints and applications without departing from the spirit of the present invention.

[0091] Before introducing the present invention, first explain the concepts and basic ideas involved in the present invention: Markov is the abbreviation of the Markov property, and it refers to that when a sequence of random variables is arranged in sequence according to time sequence, the N+ The distribution characteristics at time 1 have nothing to do with the value of the random variable before t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a webpage content extraction method and device based on a Markov random field. The method comprises the following steps: sequentially parsing HTML (hypertext markup language) texts and preprocessing the HTML texts; extracting label text windows from the preprocessed HTML texts to obtain a label text window set, wherein each label text window comprises a content text surrounded by a label and the related attributes of the content text; creating a Markov random field model by the label text windows according to an adjacent relation; taking text length and label type as basic characteristics, and initializing the Markov random field model by a minimum deviation threshold value method; optimizing the Markov random field model by using an ICM method according to the line numbers of the label text windows and the character intervals between each adjacent label text windows and reconstructing the content according to the optimized Markov random field model to obtain extracted content. The method can be applied to automatic abstracting and classifying systems in the field of information retrieval, and has the advantages of high extraction precision, high extraction speed, low maintenance cost, good adaptability, high flexibility and the like.

Description

technical field [0001] The present invention relates to a method for extracting webpage text, in particular to a method for extracting webpage text based on Markov random field Background technique [0002] The rapid development of the network has brought massive network information, how to extract the required network information has attracted more and more attention. At present, the data provided on the webpage basically consists of unstructured static Hypertext Markup Language (HTML, Hypertext Markup Language) codes, which cannot be directly used by the information analysis system, and information extraction is often required for subsequent processing. Network information extraction refers to the extraction of structured information from semi-structured documents such as web pages. These pages are often automatically generated by server-side applications. The structured information generated by network information extraction provides the most basic analysis data for imp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 柳立宁
Owner BEIJING ZHIHAI CHUANGXUN INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products