Method and apparatus for cutting large and small granularity of Chinese language text

An implementation method and a small-grained technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problem of being unable to process requirements and provide Chinese word segmentation results

Active Publication Date: 2008-08-20
SHENZHEN TENCENT COMP SYST CO LTD
View PDF0 Cites 22 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0016] To sum up, the disadvantage of the existing technology is that it is unable to provide Chinese word segmentation results with corresponding granularity for different subsequent Chinese text processing requirements.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and apparatus for cutting large and small granularity of Chinese language text
  • Method and apparatus for cutting large and small granularity of Chinese language text
  • Method and apparatus for cutting large and small granularity of Chinese language text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] The basic process of the embodiment scheme of the present invention is as figure 1 As shown, the following basic steps are included:

[0030] Step 101: Formulate identification rules for pattern words, named entity words such as person names, place names, and organization names, and corresponding large-grained and small-grained distinguishing information.

[0031] Among them, the identification rules for pattern words include:

[0032] Granularity information is added to the recognition rules, that is, granularity distinction points. Then use a deterministic finite state automaton (Deterministic Finite Automaton, DFA) to express the recognition rules, so that in the process of word segmentation, the finite state automaton can be used to identify pattern words that meet the rules. In this way, at the time of final output, the above-mentioned DFA can be used to divide the pattern words according to the large and small granularity requirements of the user, and the patter...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Disclosed is a device and method for large and small particle size segmentation of chinese text, comprising the steps of setting segmentation mode of large and small particle size, treating the chinese text with corresponding segmentation mode in accordance with the input particle size and outputting the segmented chinese text. According to actual needs, the invention can segment text by corresponding particle size and satisfy a requirement of different subsequent treatments of Chinese text.

Description

technical field [0001] The invention relates to the technical field of text information automatic processing, in particular to a method and device for realizing large and small granularity segmentation of Chinese text. Background technique [0002] Since the Chinese text is based on single characters, that is, a piece of Chinese text is composed of single characters, the words that express the meaning of the text do not have explicit separation marks between words like English, so in order to perform semantic analysis on Chinese texts, the first The task of is to add a word boundary marker to each word in the text, so that the formed word string can reflect the original meaning of the sentence. [0003] The existing Chinese word segmentation methods can generally meet the basic requirements of Chinese word segmentation, but Chinese word segmentation is the most basic analysis and processing of text. Based on this processing, there are many other subsequent text processing op...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 朱鉴李闪
Owner SHENZHEN TENCENT COMP SYST CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products