Search algorithm for Chinese word segmentation

A search algorithm and Chinese word segmentation technology, which is applied in the field of text search engines to achieve high search efficiency, balance index construction time and space, and reduce construction time and memory cost.

Active Publication Date: 2018-11-20
FUDAN UNIV
View PDF4 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Among them, the inverted index has better comprehensive performance and is the most commonly used. However, in practical applications, when applying the inverted index model to process large text collections, it is a very severe test for CPU resources, memory space, and I / O.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Search algorithm for Chinese word segmentation
  • Search algorithm for Chinese word segmentation
  • Search algorithm for Chinese word segmentation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] In order to study the search performance of the present invention on data sets of different sizes, we constructed five data sets with a data volume of 10,000, 20,000, 50,000, 100,000 and 200,000 respectively, and compared each data set with the Lucene engine based on the inverted list Carry out multiple comparison experiments.

[0050] Randomly generate 25 search strings each with a length ranging from 2 to 4 to form 75 search strings. For each search string, 100,000 searches are performed, and the time consumption of each search is recorded on the premise that the search results are correct.

[0051] In order to allow Lucene to complete the same task as the index of the present invention, a space is added between each character of the initial sequence when establishing the initial index, so that each character is considered as a word, and between each character of the search string Spaces are also added to realize the same search function of the present invention.

...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of text search engines and specifically relates to a search algorithm for Chinese word segmentation. The algorithm is mainly divided into two phases including an offline indexing phase and an online searching phase. In the offline indexing phase, firstly suffix string sets of all original string sets are extracted, and then an improved suffix tree is generated by the suffix string sets. In the online searching phase, firstly query results of a keyword are obtained according to an index model based on the suffix tree, then a matching degree between the keyword and the query result is quantified, and finally, the query results are sorted from high to low according to a matching program followed by return. According to the search algorithm, an index construction time and an occupation space are balanced through an improved index structure based on the suffix tree, thus the search efficiency of the index structure with the search algorithm is much higher than the efficiency of violently calculating the matching degree and sorting efficiency of a result set.

Description

technical field [0001] The invention belongs to the technical field of text search engines, and in particular relates to a search algorithm oriented to Chinese word segmentation. Background technique [0002] A search engine is an online information search tool that returns a series of search results that match the user's search keywords to the user. Today's society is an era of information explosion. Faced with countless information, how to quickly and accurately locate the information users want is one of the most urgent needs, and information search technology has also been rapidly developed and applied. [0003] The most common form of search is text search. No matter the user's target resource is text, image, audio or even video, as long as the input format is text, it can be attributed to the scope of the search of the present invention. Now, in addition to the full-site search functions provided by Google, Bing, Yahoo, etc., the demand for search in specific fields i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F40/289
Inventor 金城陶仕谦唐士芳吴渊张玥杰冯瑞薛向阳
Owner FUDAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products