Chinese Web document online clustering method based on common substrings

A technology of common substring and clustering method, applied in the field of information processing, can solve the problem of poor Chinese information retrieval, and achieve the effect of retaining semantic components, avoiding the influence of thesaurus, and improving clustering performance.

Inactive Publication Date: 2010-04-14
BEIHANG UNIV
View PDF0 Cites 33 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] The above-mentioned method is suitable for English information retrieval system, but there is no interval between Chinese words and must rely on word segmentation system, so the above method is not effective for Chinese information retrieval

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese Web document online clustering method based on common substrings
  • Chinese Web document online clustering method based on common substrings
  • Chinese Web document online clustering method based on common substrings

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] 1. Web document preprocessing

[0040] In the returned results of Chinese search engines (such as Baidu, etc.), some non-Chinese characters are often contained, such as English characters, spaces, punctuation marks or garbled characters. Since the research focus of the present invention is the clustering of Chinese Web documents, it is necessary to replace the non-Chinese content in the search results before clustering.

[0041] The preprocessing stage mainly replaces these non-Chinese characters with the separators predefined by the system. The non-Chinese characters that need to be replaced mainly include: spaces, numbers, English uppercase and lowercase letters, Chinese and English punctuation marks (including full-width and half-width), and Chinese pause characters (for example: "ah", "de", "le", etc.). After preprocessing, search engine result items containing only Chinese characters will be obtained, which will be used as the input for common substring extraction...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese Web document online clustering method based on common substrings. As known to all, search engines are important in application of information searching and positioning with sharp increase of information on the internet. Web document clustering can automatically classify return results of the search engines according to different themes so as to assist users to reduce query range and fast position needed information. The Web document online clustering is characterized in that non-numerical and non-structured characteristics of Web documents are required to be met on the one hand, and clustering time is required to meet online search requirements of users on the other hand. According to the two characteristics, the invention provides the Chinese Web document online clustering method based on common substrings, and the method comprises steps as follows: (1) firstly, preprocessing the first n query results returned by the search engines so as to realize deleting and replacing operation of non-Chinese characters in the return results of the search engines, (2) extracting common substrings in the Web documents by utilizing GSA, (3) presenting a weighting calculation formula referring to TF*IDF according to the common substrings which are extracted and then building a document characteristic vector model, (4) computing pairwise similarity of the Web documents on the basis of the model to acquire a similarity matrix, (5) adopting an improved hierarchical clustering algorithm to achieve clustering of the Web documents on the basis of the matrix, and (6) executing clustering description and label extraction. The Chinese Web document online clustering method based on common substrings has obvious advantages on performance, clustering label generation and clustering time effects.

Description

technical field [0001] The invention belongs to the technical field of information processing, is a data mining method, and in particular relates to an online clustering method for Web documents. Background technique [0002] The clustering process is essentially a mapping process. If given object set O={o 1 , o 2 ,...,o n}, the class set is π={c 1 , c 2 ,...,c m}, then the clustering is the following mapping: [0003] [0004] And satisfy: [0005] ( 1 ) , c i ⊆ O ( i = 1,2 , . . . , t ) [0006] ( 2 ) , ∪ i = 1 t c ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 张辉王德庆王晗杨高
Owner BEIHANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products