Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A method for identifying link blocks of web pages based on block tree

An identification method and technology of linking blocks, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of accurate judgment of link blocks, accurate judgment of interference of link blocks, ignoring the number of links, etc., to achieve easy Flexible quantity scale, fast recognition speed, and guaranteed fine effect

Active Publication Date: 2017-02-22
湖北云服科技有限公司
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0011] The second is to ignore some characteristics of non-linked text, such as dates, numbers, unlinked information source annotation text, some special symbols, etc.
In many link blocks, there are a large number of other non-link content such as dates before or after the link, which greatly interferes with the accurate identification of link blocks.
[0012] The third is that the discriminant method is extremely sensitive to the length of the link text when the length of the entire text is not long enough, that is, a certain threshold works well on some pages using short link text, but it is extremely difficult when encountering pages with long link text. possible misjudgment
It is an extremely common phenomenon that the link text lengths of different websites or different webpages vary greatly, which brings great uncertainty to the accurate judgment of link blocks. If the link text becomes shorter, it is likely to mistake the link block. unlinked block
[0013] The fourth is that the error caused by the block problem will affect the accurate identification of link blocks. It is especially easy to identify text blocks that are not separated by block-level elements between the body of the text and link blocks but are in the same block-level element node as link blocks, or Misidentifying link blocks as non-noisy links in the body text
[0014] The fifth is to ignore the number of links
Then obtain the vector containing these four feature values ​​through training, and then realize the block type judgment, but in the face of complex networks, the four features designed here are not universal.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for identifying link blocks of web pages based on block tree
  • A method for identifying link blocks of web pages based on block tree
  • A method for identifying link blocks of web pages based on block tree

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0100] The present invention will be further described in detail below in conjunction with the embodiments and the accompanying drawings, but the embodiments of the present invention are not limited thereto.

[0101] refer to figure 1 , a flowchart of the present invention, a method for identifying web page link blocks based on a block tree, comprising the steps of:

[0102] Step 1: Input a collection of web pages, wherein, step 1 includes the following steps;

[0103] Step 1.1 Encoding identification: first obtain the web page encoding format UTF-8, GB2312, etc.;

[0104] Step 1.2 webpage reading: by character scanning the HTML document of the WEB webpage to be identified, identify the starting position and the ending position respectively;

[0105] Define the following concepts:

[0106] Word

[0107] The starting position starts with the character "", and there is no string of characters "" between the two;

[0108] The end position starts with the character "", and...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention proposes a block tree-based identification method of Web page link blocks, and based on this, proposes link block discrimination and evaluation indicators, and combines the block tree to propose two basic areas: forward link block discrimination and reverse link block discrimination Block traversal and discriminant algorithm; forward discriminant method has fast recognition speed and large link block granularity, which can be used in the analysis of index type Web pages and text extraction applications; reverse discriminant method can easily and flexibly control the granularity and quantity of link blocks , to ensure the refinement and integrity of the link block, so as to finally realize the comprehensive coverage of the link block on the page link. This method can be used not only in occasions where the link block requires fine granularity, but also in page denoising, text extraction, and page extraction. Automatic generation of templates and other occasions; the block tree proposed by the present invention is used as the basis of Web page analysis and processing, combined with the two proposed traversal and discrimination methods, it can be widely used in the fields of Web data preprocessing and data mining.

Description

technical field [0001] The present invention relates to the fields of web page importance calculation, web page denoising, subject-related link block extraction, web text identification, web page text extraction, refinement of search engine processing unit granularity, and massive web data preprocessing. A method for identifying link blocks of a Web page in a block tree. Background technique [0002] The World Wide Web is a huge network built on links, and links are the soul of the World Wide Web. Every web page in the World Wide Web finally constitutes the most complex network in the world through the links between web pages. Web crawlers also rely on the links between web pages to finally crawl network data. The importance of web pages Sex is also often obtained through link analysis. The number of links in a web page is often between tens and thousands, especially in index (catalogue) type web pages, links account for nearly 100% of the ratio. Although there are many l...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/9558
Inventor 谷琼王贤明朱莉
Owner 湖北云服科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products