Recognition method for Web page link blocks based on block tree

An identification method and technology of linking blocks, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of accurate judgment of link blocks, accurate judgment of interference of link blocks, ignoring the number of links, etc., to achieve easy Flexible quantity scale, fast recognition speed, and guaranteed fine effect

Active Publication Date: 2014-07-16
湖北云服科技有限公司
View PDF4 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0011] The second is to ignore some characteristics of non-linked text, such as dates, numbers, unlinked information source annotation text, some special symbols, etc.
In many link blocks, there are a large number of other non-link content such as dates before or after the link, which greatly interferes with the accurate identification of link blocks.
[0012] The third is that the discriminant method is extremely sensitive to the length of the link text when the length of the entire text is not long enough, that is, a certain threshold works well on some pages using short link text, but it is extremely difficult when encountering pages with long link text. possible misjudgment
It is an extremely common phenomenon that the link text lengths of different websites or different webpages vary greatly, which brings great uncertainty to the accurate judgment of link blocks. If the link text becomes shorter, it is likely to mistake the link block. unlinked block
[0013] The fourth is that the error caused by the block problem will affect the accurate identification of link blocks. It is especially easy to identify text blocks that are not separated by block-level elements between the body of the text and link blocks but are in the same block-level element node as link blocks, or Misidentifying link blocks as non-noisy links in the body text
[0014] The fifth is to ignore the number of links
Then obtain the vector containing these four feature values ​​through training, and then realize the block type judgment, but in the face of complex networks, the four features designed here are not universal.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Recognition method for Web page link blocks based on block tree
  • Recognition method for Web page link blocks based on block tree
  • Recognition method for Web page link blocks based on block tree

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0100] The present invention will be further described in detail below in conjunction with the embodiments and the accompanying drawings, but the embodiments of the present invention are not limited thereto.

[0101] refer to figure 1 , a flowchart of the present invention, a method for identifying web page link blocks based on a block tree, comprising the steps of:

[0102] Step 1: Input a collection of web pages, wherein, step 1 includes the following steps;

[0103] Step 1.1 Encoding identification: first obtain the web page encoding format UTF-8, GB2312, etc.;

[0104] Step 1.2 webpage reading: by character scanning the HTML document of the WEB webpage to be identified, identify the starting position and the ending position respectively;

[0105] Define the following concepts:

[0106] Text

[0107] The starting position starts with the character "", and there is no string of characters "" between the two;

[0108] The end position starts with the character "", and t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a recognition method for Web page link blocks based on a block tree. On the basis of the recognition method, link block distinguishing and estimating indexes are provided, and meanwhile two basic block traversing and distinguishing algorithms including a forward link block distinguishing method and a backward link block distinguishing method are provided; the forward distinguishing method is high in recognition speed and can be used for analysis of indexing type Web pages and text extraction and application, and the granularity of link blocks is large; the backward distinguishing method can control the granularity, quantity and scale of the link blocks easily and flexibly, ensures refinement and integrity of the link blocks and finally achieves comprehensiveness of covering Web page link with the link blocks. The method not only can be used in places with the fine-granularity requirement for the link blocks, but also can be used in places of page denoising, text extracting, automatic template generating through text abstracting and others; the provided block tree serves as the basis of Web page analysis and processing and can be widely applied to Web data preprocessing, data mining and other fields in combination with the two provided traversing and distinguishing methods.

Description

technical field [0001] The present invention relates to the fields of web page importance calculation, web page denoising, subject-related link block extraction, web text identification, web page text extraction, refinement of search engine processing unit granularity, and massive web data preprocessing. A method for identifying link blocks of a Web page in a block tree. Background technique [0002] The World Wide Web is a huge network built on links, and links are the soul of the World Wide Web. Every web page in the World Wide Web finally constitutes the most complex network in the world through the links between web pages. Web crawlers also rely on the links between web pages to finally crawl network data. The importance of web pages Sex is also often obtained through link analysis. The number of links in a web page is often between tens and thousands, especially in index (catalogue) type web pages, links account for nearly 100% of the ratio. Although there are many l...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/9558
Inventor 谷琼王贤明朱莉
Owner 湖北云服科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products