Chinese Web document online clustering method based on common substrings

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A technology of common substring and clustering method, applied in the field of information processing, can solve the problem of poor Chinese information retrieval, and achieve the effect of retaining semantic components, avoiding the influence of thesaurus, and improving clustering performance.

Inactive Publication Date: 2010-04-14

BEIHANG UNIV

View PDF0 Cites 33 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0010] The above-mentioned method is suitable for English information retrieval system, but there is no interval between Chinese words and must rely on word segmentation system, so the above method is not effective for Chinese information retrieval

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0039] 1. Web document preprocessing

[0040] In the returned results of Chinese search engines (such as Baidu, etc.), some non-Chinese characters are often contained, such as English characters, spaces, punctuation marks or garbled characters. Since the research focus of the present invention is the clustering of Chinese Web documents, it is necessary to replace the non-Chinese content in the search results before clustering.

[0041] The preprocessing stage mainly replaces these non-Chinese characters with the separators predefined by the system. The non-Chinese characters that need to be replaced mainly include: spaces, numbers, English uppercase and lowercase letters, Chinese and English punctuation marks (including full-width and half-width), and Chinese pause characters (for example: "ah", "de", "le", etc.). After preprocessing, search engine result items containing only Chinese characters will be obtained, which will be used as the input for common substring extraction...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a Chinese Web document online clustering method based on common substrings. As known to all, search engines are important in application of information searching and positioning with sharp increase of information on the internet. Web document clustering can automatically classify return results of the search engines according to different themes so as to assist users to reduce query range and fast position needed information. The Web document online clustering is characterized in that non-numerical and non-structured characteristics of Web documents are required to be met on the one hand, and clustering time is required to meet online search requirements of users on the other hand. According to the two characteristics, the invention provides the Chinese Web document online clustering method based on common substrings, and the method comprises steps as follows: (1) firstly, preprocessing the first n query results returned by the search engines so as to realize deleting and replacing operation of non-Chinese characters in the return results of the search engines, (2) extracting common substrings in the Web documents by utilizing GSA, (3) presenting a weighting calculation formula referring to TF*IDF according to the common substrings which are extracted and then building a document characteristic vector model, (4) computing pairwise similarity of the Web documents on the basis of the model to acquire a similarity matrix, (5) adopting an improved hierarchical clustering algorithm to achieve clustering of the Web documents on the basis of the matrix, and (6) executing clustering description and label extraction. The Chinese Web document online clustering method based on common substrings has obvious advantages on performance, clustering label generation and clustering time effects.

Description

technical field [0001] The invention belongs to the technical field of information processing, is a data mining method, and in particular relates to an online clustering method for Web documents. Background technique [0002] The clustering process is essentially a mapping process. If given object set O={o 1 , o 2 ,...,o n}, the class set is π={c 1 , c 2 ,...,c m}, then the clustering is the following mapping: [0003] [0004] And satisfy: [0005] ( 1 ) , c i ⊆ O ( i = 1,2 , . . . , t ) [0006] ( 2 ) , ∪ i = 1 t c ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/30

Inventor张辉王德庆王晗杨高

OwnerBEIHANG UNIV

Chinese Web document online clustering method based on common substrings

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology