Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Chinese similar web page de-emphasis method based on microcosmic characteristic

A world and web page technology, applied in the field of computer network intelligent information retrieval, can solve the problems of missed detection, misjudgment, and detection of similarity of documents with similar content

Inactive Publication Date: 2010-01-06
BEIJING INSTITUTE OF TECHNOLOGYGY
View PDF0 Cites 29 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In the worst case (all documents are approximate documents), the time complexity of the I-Match algorithm is O(nlogn)
[0010] These existing detection methods have the following defects and deficiencies: Shingle-based methods need to perform exact matching when detecting completely duplicate documents, which will cause documents with similar content to be missed
Term-based methods are not enough to only use keyword entries. Sometimes, web documents with different content may have the same keywords, which may cause misjudgment and is not enough for the detection of document similarity.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese similar web page de-emphasis method based on microcosmic characteristic
  • Chinese similar web page de-emphasis method based on microcosmic characteristic
  • Chinese similar web page de-emphasis method based on microcosmic characteristic

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0117] For example, for a webpage with the URL "http: / / cs.taoyuan.gov.cn / news / ReadNews.asp?NewsID=4727", which introduces the failed test launch of Russia's "Bulava" new intercontinental ballistic missile, check whether there is content Approximate web page.

[0118] Step 1. For the newly input webpage, extract effective information of the webpage to obtain effective text information, specifically as follows:

[0119] Russia's new Bulava intercontinental ballistic missile test fails

[0120] Date published: October 26, 2006

[0121] Source: Red Net

[0122] Edit entry: Xiang Xia

[0123] File photo of the Russian "Bulava" (also known as "Round Hammer") intercontinental ballistic missile

[0124] File photo of the "Dmitry Donskoy" nuclear submarine used to launch the "Round Hammer" sea-based intercontinental ballistic missile

[0125] Xinhua Net, Moscow, October 25 (Reporter Yue Lianguo) The Russian Navy Press and Public Relations Bureau told the press on the 25th that a "...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Chinese similar web page de-emphasis method based on microcosmic characteristics in order to solve the problem of automatic detection of content similar to Chinese web pages. The Chinese similar web page de-emphasis method considering syntactic information and semantic information of web pages both comprises the following steps: firstly, establishing a text term co-occurrence picture according to extracted web page effective information; secondly, extracting document characteristic vectors, wherein the document characteristic vectors comprise keyword position information and keyword terms ; finally, establishing a document keyword inverted index file by sufficiently using a retrieval system and classified information; completing document characteristic vector retrieval match according to the inverted index file, and thereby, detecting and investigating similar web pages. The Chinese similar web page de-emphasis method can effectively reduce the harmful effect of arithmetic accuracy by noise information, considers the content and structure information of the web page text, sufficiently uses the advantages of a retrieval and classification system simultaneously, obtains good effect of de-emphasis accuracy rate larger than 90 percent and average recalling rate larger than 80 percent and is especially suitable for large-scale web page de-emphasis.

Description

technical field [0001] The invention relates to a method for deduplicating Chinese approximate web pages, which belongs to the technical field of computer network intelligent information retrieval. technical background [0002] With the unprecedented development of Internet technology and scale, the Internet has become one of the main channels for obtaining information. According to the survey as of July 2007, there are more than 125 million websites in total. Because of its convenient and quick retrieval function, search engine has become the main tool for information retrieval by network users today. Among them, the quality of information retrieval and its work efficiency will directly affect the overall performance of search engines. According to the statistical report released by China Internet Network Information Center in July 2005, when users answered the question of "the biggest problem encountered when searching for information", 44.6% of users chose the option of ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 曹玉娟牛振东赵堃赵育民江鹏
Owner BEIJING INSTITUTE OF TECHNOLOGYGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products