Chinese label extraction method for clustering search results of search engine

A search engine and tag extraction technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as high computational overhead, difficulty in meeting user query needs, and low proportion of documents

Inactive Publication Date: 2011-06-01
SOUTH CHINA UNIV OF TECH +1
View PDF4 Cites 32 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] (1) Label noise problem: The clustering of retrieval results is generally based on titles and abstracts. However, titles and abstracts contain a large number of words that are not related to document content and topics, thus introducing a lot of noise in the process of label extraction.
However, the existing noise filtering technology mainly uses some simple methods such as removing html tags, removing meaningless symbols, removing stop words, etc., which cannot solve the noise problem well.
[0008] (2) The label does not have a good representativeness of

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese label extraction method for clustering search results of search engine
  • Chinese label extraction method for clustering search results of search engine
  • Chinese label extraction method for clustering search results of search engine

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0078] The Chinese label extraction method for the clustering of search results of this search engine, such as figure 1 shown, including the following steps:

[0079] S1. The user inputs query words, and after obtaining the retrieval results, selects the abstracts of the first M result pages of the retrieval results as input documents to form a document collection Snippets; the M is a positive integer;

[0080] S2, word segmentation to the input document: (the following input documents all refer to the retrieval result webpage, and the input of each result webpage only includes title and abstract, does not include original content.)

[0081] Segment all input documents, divide each input document into an ordered sequence of words, and obtain the part-of-speech annotation of each word, and these ordered word sequences form a new set R1;

[0082] S3, select candidate words

[0083] Extract all verbs and nouns with a frequency of not less than 3 times in the set R1 as candidate...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Chinese label extraction method for clustering search results of a search engine, which comprises the following steps of: S1, inputting search words by a user to form an input document; S2, selecting candidate words, and scoring all the candidate words; S3, judging whether unmarked candidate words exist, if not existing, skipping to a step S8; if existing, selecting a candidate word with highest score, expanding the selected candidate word into a set of ordered word sequences containing the word, and entering a step S4; S4, calculating the frequency of each ordered word sequence, and extracting the high-frequency word sequence; S5, scoring the high-frequency word sequence, and selecting a candidate word sequence; S6, judging whether the candidate word sequence is accepted as a label, if so, entering a step S7, otherwise, returning to the step S3; S7, performing clustering according to the generated label; and S8, completing the operation. The method can reduce noise labels, and the labels have better representativeness, simplicity and integrity.

Description

technical field [0001] The invention relates to the technical field of search engine-based search result clustering, in particular to a Chinese label extraction method for search engine search result clustering. Background technique [0002] Clustering the results returned by search engines is an important means to improve the service quality of search engines. It divides the webpages of the same subtopic into the same category, and describes each category with a label as a summary of the topic, which is convenient for users to quickly locate the webpage of the topic they are interested in. The research on clustering of search engine results is a hot and difficult point in modern search engine research. [0003] At present, the label generation methods for clustering the retrieval results of search engines can be divided into two categories: one is the method of clustering first and then extracting labels; the other is the method of extracting labels first and then dividing...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 董守斌张丽平张凌李粤袁华
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products