Method for identifying re-loading relation between internet news texts

A technology of relationship recognition and the Internet, applied in the field of Internet technology/data mining, to achieve the effects of efficient processing, noise resistance, and efficient identification

Inactive Publication Date: 2012-08-29
HUAZHONG UNIV OF SCI & TECH
View PDF3 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0014] However, although these methods have their own innovations, there is no method that can handle docu

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for identifying re-loading relation between internet news texts
  • Method for identifying re-loading relation between internet news texts
  • Method for identifying re-loading relation between internet news texts

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] Such as figure 1 Shown:

[0034] First, an offline HTML page is input. Visually, an HTML page can be divided into several independent blocks (regions), and each block displays different information. For example: A common HTML page contains the following blocks: top navigation bar, related links, body section, comments, bottom site links, etc. Details are attached in the accompanying drawings figure 2 shown.

[0035] For an HTML page, a theme content block refers to a text area containing events described on the page, which can be understood as a "text" part. For example, in addition to describing the news itself, a news web page often also contains a large amount of navigation information, related news links, advertisement information, comment information and so on.

[0036] Web page preprocessing, that is, the extraction of topic content blocks, is to remove useless structural information and noise content in web pages, extract the text part of the narrative event...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method for identifying re-loading relation between internet news texts. The method is used for identifying the re-loading relation between news pages on the internet. The method comprises the following steps of: extracting main body content from a page by using a statistical algorithm, filtering noise information, such as advertisements and navigations, automatically identifying new words, and tapping characteristic words of the news text so as to primarily identify the re-loading relation; and on the basis of primary identification, calculating the similarity of news texts by a kernel function method so as to further determine the re-loading relation between internet news texts, and the initial publishing stations of news can be obtained.

Description

technical field [0001] The invention belongs to the field of Internet technology / data mining, and relates to utilizing obtained offline news webpages to mine the mutual reprinting relationship among them, and discovering other news having a reprinting relationship with the news that a specified user is interested in. Background technique [0002] With the continuous deepening and extensive development of Internet applications, the spread speed, influence and scope of Internet public opinion are constantly increasing. News related to hot events on the Internet will be reprinted in large numbers, and it is of great significance to accurately and efficiently identify the reprinting relationship between these news. In order to accurately and efficiently identify the reprint relationship of news, the identification system should have the following characteristics: [0003] First, efficiently handle documents of shorter length. The size of a news text webpage generally does not ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
Inventor 王君泽黄本雄刘冬一胡广温杰刘玮文
Owner HUAZHONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products