Unlock instant, AI-driven research and patent intelligence for your innovation.

Method and system for Chinese full-text search in database

A database and full-text technology, applied in the direction of text database indexing, digital data information retrieval, unstructured text data retrieval, etc., can solve the problems of large amount of calculation, large amount of data, low efficiency, etc., achieve good recognition and retrieval, increase The effect of reading and writing speed and storing a large amount of data

Active Publication Date: 2021-03-09
HIGHGO SOFTWARE
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The word segmentation method based on string matching needs to match the Chinese character string to be analyzed with the entry in a dictionary. If a certain string is found in the dictionary, it is considered that a word has been recognized. This word segmentation method requires a "complete enough "Dictionary, but due to the rapid update of new words on the Internet, the update of the dictionary is difficult to adapt to the update speed of new words
If the text to be retrieved contains a new word on the Internet but not in the dictionary, the word cannot be segmented and processed correctly, so that the text containing the new word cannot be retrieved, resulting in missed detection
[0004] The word segmentation method based on statistics uses the frequency or probability of co-occurrence of adjacent words in the text to perform word segmentation. This method only needs to count the frequency of word groups in the corpus and does not require a dictionary. However, this method often extracts some co-occurrences. Common word groups with high occurrence frequency but not words have a certain recognition effect on new words, but the recognition accuracy of common words is poor, and the calculation is time-consuming, and the amount of data generated by word segmentation is also relatively large, which affects subsequent retrieval s efficiency
[0005] On the basis of the word segmentation, in order to speed up the retrieval efficiency, the inverted index is commonly used in database products for data processing. Specifically, after the database receives the data file to be inserted, it first reads the data file for Chinese word segmentation. After the word segmentation, it needs Read it again, get the position of each phrase in the data file and write it into the inverted index, that is, read the data file twice. In the case of a large data file or a large amount of data inserted into the database, this The processing method has a large amount of computation and low efficiency; moreover, when the general inverted index stores the word position, it only stores the row position where the phrase is located. The data is read out, and then the similarity is calculated, and the retrieval efficiency is low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for Chinese full-text search in database
  • Method and system for Chinese full-text search in database
  • Method and system for Chinese full-text search in database

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0061] This embodiment discloses a method for database Chinese full-text retrieval, such as figure 1 shown, including the following steps:

[0062] Receive text data to be inserted into the database;

[0063] Carry out binary word segmentation processing for each adjacent two Chinese characters of the text data, and create an inverted index for the text data at the same time;

[0064] In the word segmentation process, for the binary phrases obtained by each word segmentation, write the position information of the binary phrases and the binary phrases in the text data into the inverted index;

[0065] Receive the text to be retrieved, and perform binary word segmentation processing to obtain multiple binary phrases to be retrieved;

[0066] In the database, a full-text search is performed based on the inverted index and the plurality of binary phrases to be searched.

[0067] Specifically, such as figure 2 As shown, the method includes the process 1 of inserting new text d...

Embodiment 2

[0098] As a modification of Embodiment 1, this embodiment provides a method for Chinese full-text search in a database, such as Figure 5 shown, including the following steps:

[0099] A method for database Chinese full-text retrieval, is characterized in that, comprises the following steps:

[0100] Pre-create the inverted index structure;

[0101] Receive text data to be inserted into the database;

[0102] Carry out binary word segmentation processing for each adjacent two Chinese characters as a group of the text data;

[0103] In the word segmentation process, for the binary phrases obtained by each word segmentation, write the position information of the binary phrases and the binary phrases in the text data into the inverted index;

[0104] Receive the text to be retrieved, and perform binary word segmentation processing to obtain multiple binary phrases to be retrieved;

[0105] In the database, a full-text search is performed based on the inverted index and the pl...

Embodiment 3

[0120] Based on the retrieval method in Embodiment 1, this embodiment provides a database Chinese full-text retrieval system.

[0121] A database Chinese full-text retrieval system, such as Figure 7 shown, including client, database system and server; where,

[0122] The client receives the text to be retrieved input by the user, generates a retrieval request and sends it to the server;

[0123] The server, connected to the database system, is configured to: receive text data and insert it into the database, and generate a corresponding inverted index of the text data, specifically including:

[0124] Step 101: receiving text data to be inserted into the database;

[0125] Step 102: Preprocessing the text data;

[0126] Step 103: perform binary segmentation on the preprocessed text data every two adjacent Chinese characters as a group, and create an inverted index for the text data at the same time; the inverted index structure includes a three-level index, wherein, The p...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and system for Chinese full-text retrieval in a database. The method comprises the following steps: receiving text to be retrieved; performing binary word segmentation processing on each two Chinese characters of the to-be-retrieved text to obtain a plurality of binary words. metaphrase, and insert the data table file; create an inverted index for the data table file, the inverted index contains the position index of each of the binary phrases, and is used to write the corresponding phrase in the database during the retrieval process. The position information in each text data in , the position information includes the row containing the phrase, and the position in the row; according to the plurality of bigrams, the full text of the to-be-retrieved text is performed in the database. retrieve. The retrieval method of the present invention does not need to construct a dictionary, has better retrieval effect on new words, and has higher retrieval efficiency by introducing a multi-level index mechanism.

Description

technical field [0001] The disclosure belongs to the technical field of data retrieval, and in particular relates to a method and system for Chinese full-text retrieval of a database. Background technique [0002] Full-text retrieval technology is a very common information query application, and one of the core technologies of various search engines on the Internet is full-text retrieval. A full-text search product is essentially a database product with embedded full-text search technology. Chinese word segmentation is involved in the Chinese full-text search process. [0003] At present, the main Chinese word segmentation can be divided into: word segmentation method based on string matching and word segmentation method based on statistics. The word segmentation method based on string matching needs to match the Chinese character string to be analyzed with the entry in a dictionary. If a certain string is found in the dictionary, it is considered that a word has been reco...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/31G06F16/33G06F16/338
Inventor 卢健姜瑞海王硕张龙
Owner HIGHGO SOFTWARE