Chinese word segmentation method and apparatus

A technology of Chinese word segmentation and Chinese characters, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as low efficiency and insufficient accuracy, and achieve small calculation, high accuracy, and unsupervised candidate words The effect of extraction

Active Publication Date: 2016-05-04
RUN TECH CO LTD BEIJING
View PDF5 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the embodiments of the present invention is to provide a Chinese word segmentation method and device to solve the problems of insufficient accuracy and low efficiency in existing Chinese word segmentation schemes

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese word segmentation method and apparatus
  • Chinese word segmentation method and apparatus
  • Chinese word segmentation method and apparatus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0043] image 3 It is a schematic flowchart of a Chinese word segmentation method provided by Embodiment 1 of the present invention, and the method can be executed by a Chinese word segmentation device. Such as image 3 As shown, the method includes:

[0044] Step 301. Divide the text set into multiple short sentences, and number the multiple short sentences.

[0045] Wherein, the text set includes at least one text.

[0046] Exemplarily, the device for executing the method of this embodiment may be realized by software and / or hardware, and may be integrated into a server for providing services such as word segmentation or retrieval.

[0047] In this embodiment, the text set can be divided into n short sentences, and the short sentences can be numbered as 1, 2, . . . n in sequence.

[0048] Preferably, the text set can be divided into multiple short sentences according to Chinese punctuation marks, and the multiple short sentences are numbered.

[0049] Preferably, when t...

Embodiment 2

[0061] Figure 4 It is a schematic flow chart of a Chinese word segmentation method provided by Embodiment 2 of the present invention. This embodiment is optimized based on the above embodiments. In this embodiment, for each Chinese character in the text set, the current Chinese character corresponding to Before the first short sentence number list, add a step: determine the short sentence number list and adjacent character set corresponding to all the different Chinese characters in the text set. The advantage of this is that when each Chinese character is processed, the short sentence number list and adjacent character set corresponding to the current Chinese character can be directly obtained from all the determined short sentence number lists and adjacent character sets, and directly obtained The short sentence number list corresponding to the adjacent Chinese characters improves the processing speed.

[0062] Further, this embodiment also optimizes the calculation proces...

Embodiment 3

[0094] Figure 5 A structural block diagram of a Chinese word segmentation device provided by Embodiment 3 of the present invention, the device can be implemented by software and / or hardware, and can perform word segmentation processing on Chinese text by executing the Chinese word segmentation method of the embodiment of the present invention. Typically, the device can be integrated into a server for providing services such as word segmentation or retrieval. Such as Figure 5 As shown, the device includes a text set segmentation module 501, a first short sentence number list acquisition module 502, a second short sentence number list acquisition module 503, a co-occurrence degree calculation module 504, an adjacent character set acquisition module 505, and an adjacent correlation degree Calculation module 506, candidate word set adding module 507 and word segmentation module 508.

[0095] Wherein, the text set segmentation module 501 is used to divide the text set into mult...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Embodiments of the invention disclose a Chinese word segmentation method and apparatus. The method comprises the steps of dividing a text set into a plurality of short sentences and numbering the short sentences; for each Chinese character in the text set, obtaining a first short sentence number list corresponding to a current Chinese character, obtaining a second short sentence number list corresponding to an adjacent Chinese character adjacent to the right side of the current Chinese character, and calculating a degree of co-occurrence according to the first short sentence number list and the second short sentence number list; obtaining an adjacent character set corresponding to the current Chinese character, and calculating a relevant degree of adjacency according to the adjacent character set; determining whether a word consisting of the current Chinese character and the adjacent Chinese character is added into a candidate word set or not according to the degree of co-occurrence and the relevant degree of adjacency; and performing word segmentation on the text set according to the candidate word set. The method is small in calculation amount and high in accuracy when calculating the candidate word set, can effectively improve the accuracy of a word segmentation result and improve the efficiency of word segmentation, does not depend on a corpus dictionary, and can realize unsupervised candidate vocabulary extraction.

Description

technical field [0001] The embodiments of the present invention relate to the technical field of natural language, and in particular to a Chinese word segmentation method and device. Background technique [0002] Chinese word segmentation (Chinese Word Segmentation) refers to the segmentation of a sequence of Chinese characters into individual words. Chinese uses characters as the basic unit of writing, and there are no symbols used to identify word boundaries, such as spaces in English, between words. Therefore, it is a difficult problem in the analysis and processing of Chinese texts to segment each sentence in Chinese texts. [0003] Chinese word segmentation technology mainly includes Chinese word segmentation algorithm based on mechanical matching, Chinese word segmentation method based on Statistical Language Model (SLM), and Chinese word segmentation method based on artificial intelligence technology. Among them, the word segmentation method based on the statistical ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
CPCG06F40/289
Inventor 韦强申刘鹏
Owner RUN TECH CO LTD BEIJING
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products