Method for discovering compound words in specific field based on statistics and rules

A field-specific compound word technology, applied in the field of computer natural language processing, can solve problems such as compound word recognition that cannot be solved well, and achieve the effect of reducing depth search, improving accuracy, and reducing CPU and memory space-time overhead

Inactive Publication Date: 2013-09-18
瑞达信息安全产业股份有限公司
View PDF2 Cites 28 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The purpose of the present invention is to propose a method for discovering compound words in a specific field based on statistics and rules, so as

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for discovering compound words in specific field based on statistics and rules
  • Method for discovering compound words in specific field based on statistics and rules
  • Method for discovering compound words in specific field based on statistics and rules

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0047] as attached figure 1 Shown is a structural diagram of the present invention, a compound word discovery method based on statistics and rules in a specific field, and its steps are:

[0048] A. Use the existing word segmentation system to perform atomic word segmentation and part-of-speech tagging on domain texts;

[0049] B. Use stop words and word formation rules to filter and delete atomic words that cannot form compound words;

[0050] C. Forward traversing the processed atomic words, constructing a directed graph containing the atomic word combination relationship, the directed graph is marked as G: , where V refers to the atomic word set in the text, and E corresponds to V The set of atomic words adjacent to the atomic word of ;

[0051] D. Use the deep traversal algorithm to search the directed graph to find all possible combinations of compound words, and at the same time use statistical indicators and word formation rules to judge the conditions of word formati...

Embodiment 2

[0055] The difference from Embodiment 1 above is that furthermore, the word segmentation system described in step A uses the ICTCLAS4J version, which can be directly deployed on a computer or perform word segmentation operations through a compiler call interface. as attached image 3 shown, which is attached figure 2 The processing flowchart of the middle block 1001 illustrates an embodiment of calling the ICTCLAS4J word segmentation system for initial word segmentation. The process starts at block 2001, selecting and importing domain texts, and the domain texts are centrally placed in a folder on the hard disk. In block 2005, the interface of the word segmentation system is invoked to segment and part-of-speech tag the domain text. In block 2009, the word segmentation result is saved in memory.

Embodiment 3

[0057] What is different from the above-mentioned embodiment 1 is that, furthermore, the stop words described in step B come from a stop word table composed of a plurality of Chinese characters, and this table is stored as a txt file on the hard disk memory of the computer, and can be read directly during use. call into memory.

[0058] The word formation rules described in step B include: Rule 1: numerals, pronouns, prepositions, auxiliary words, function words, conjunctions and other parts of speech do not form compound words; Rule 2: single-character words or nouns followed by numerals do not form compound words ;Rule 3: Words that already have complete meanings cannot form compound words; Rule 4: Some words can only be used as prefixes; Rule 5: Some words can only be used as suffixes; Rule 6: Compound words must contain at least one verb, noun or nominal components; Rule 7: The last word of a compound word is a verb, noun or nominal component.

[0059] Figure 4 for fi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the field of computer natural language processing and relates to a method for discovering compound words in a specific field based on statistics and rules. The method comprises the steps as follows: carrying out word segmentation and part-of-speech tagging by a word segmentation system, traversing word segmentation results, filtering by stop words and word-formation rules, traversing to generate a digraph of atomic words, permutating and combining possible compound word combinations by depth traversal, restricting by statistical indexes and the word-formation rules at the same time, generating a compound word candidate set for manual screening, and importing the compound words into a dictionary file for later use. The method has the advantages as follows: the digraph of the atomic words is created, and the compound word boundary is automatically sought by the depth traversal, so that the compound word with any length can be identified; the word-formation rules are convenient to customize and expand and good in portability; higher accuracy and recalling rate are obtained at the same time, so that the Chinese word segmentation accuracy is improved; and the generated compound words can have more accurate concepts, so that a good foundation is laid for a deep research on Chinese information processing.

Description

technical field [0001] The invention belongs to the field of computer natural language processing, and relates to a compound word discovery method in a specific field based on statistics and rules. Background technique [0002] The existing conventional Chinese word segmentation system is relatively mature and can basically meet the general needs of Chinese word segmentation, but the word segmentation ability for compound words in specific fields needs to be strengthened. For example, "cross-site scripting", "stack overflow", "denial of service" and so on, these words can be regarded as a word in the field of information security, but the result of processing by the general word segmentation system is as follows: "cross / v site / v script / n", "stack / ng overflow / v", "deny / vd service / v". The word segmentation results are individual words. Such word segmentation results often split the domain vocabulary in a specific field into several words, so that the original word string...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/30
Inventor 刘毅彭涛韩波邓院林曹鹏
Owner 瑞达信息安全产业股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products