Meaningful string identification method and device

A recognition method and algorithm technology, applied in the direction of instruments, computing, electrical digital data processing, etc., can solve the problem of low accuracy rate of meaningful string extraction, and achieve the effect of improving the probability of correct recognition

Active Publication Date: 2014-06-18
ALIBABA GRP HLDG LTD
View PDF5 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0013] The technical problem to be solved in this application is to provide a meaningful string identification method and device to solve the problem of low accuracy of meaningful string extraction

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Meaningful string identification method and device
  • Meaningful string identification method and device
  • Meaningful string identification method and device

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach

[0044] A specific implementation of the word segmentation algorithm is as follows:

[0045] Separate each sentence with sentence separator marks (such as period, exclamation mark, question mark, etc.), read a sentence, and get multiple possible candidate strings for this sentence. If there is a separator inside the candidate string, filter out the candidate string. Continue to read the next sentence and perform the above processing until all sentences are processed.

[0046] A specific implementation of the n-gram segmentation algorithm is as follows: read the parameters N1 and N2, where N1 is the minimum number of segmented words, N2 is the maximum number of segmented words, separated according to sentence separation marks (such as period, exclamation mark, question mark, etc.) Extract each sentence, and then extract each candidate string of n words from each sentence, where n is traversed from N1 to N2.

[0047] The separator string can be preset, including the characters in Table...

example 2

[0058] For example, the original corpus is: "Zuo Zhuan, Three Kingdoms... are all historical classics."

[0059] Still using the n-ary segmentation algorithm, the valid candidate strings extracted are: "Zuo Zhuan", "Three Kingdoms", "Guozhi", "Three Kingdoms", "Etc.", "Du", "Shili", "History" "," "all classics", "etc are all", "all calendars", "is history", "historical classics", "historical classics", "etc are all calendars", "all historical", "are historical classics" "And "historical classics".

[0060] Step 102: a statistical step, to perform statistics on the distribution of adjacent separation strings of each valid candidate string in the original corpus;

[0061] Optionally, a statistical result of the distribution of adjacent separated strings of a valid candidate string in the corpus (also referred to as a separated string score herein) refers to the total number of adjacent separated strings of all instances of the valid candidate string. Generally, in a corpus, a certain...

example 7

[0085] Example 7: Example of judging meaningful strings

[0086] Suppose that the effective candidate string and the corresponding left and right neighbor separator string scores are as follows:

[0087] Valid candidate string

Left Neighbor Separator String Score (L)

Right adjacent separator string score (R)

Three Kingdoms

17

0

Three Kingdoms

14

6

Zuo Zhuan

14

10

Etc.

1

0

[0088] Suppose the discriminant formula is:

[0089] It means that the condition of F(L, R)=1 is (L>5) and (R>5), the condition of F(L,R)=0 is (L≤5) or (R≤5), where F (L, R) is 1 means a meaningful string, F(L, R) is 0 means it is not a meaningful string, L means the score of the left-neighbor separation string, R means the score of the left-neighbor separation string, that is, satisfies the left and right adjacent separation strings The valid candidate strings with scores greater than 5 are meaningful strings, otherwise they are not meaningful strings.

[0090]

[0091] This shows that...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a meaningful string identification method and device. The meaningful string identification method comprises an extraction step which comprises extracting valid candidate strings in a corpus, a statistics step which comprises performing statistics on distribution of adjacent separation strings of the valid candidate strings in an original corpus, wherein the separation strings are predefined characters or character assemblies and a judging step which comprises confirming meaningful strings in the valid candidate strings according to the statistics result of the distribution of the adjacent separation strings of the valid candidate strings in the corpus. The meaningful string identification method and device can improve excavating accuracy rate of the meaningful strings.

Description

Technical field [0001] This application relates to the technical field of text information processing, in particular to a method and device for identifying meaningful strings. Background technique [0002] In recent years, with the gradual popularity of the Internet, the scale of various electronic resource texts has been expanding day by day, and the information contained in the texts has increased. In order to retrieve and mine valuable information from a large amount of data, the research community and the business community have vigorously developed Various text processing and data mining techniques. Various text processing and data mining methods are often based on words, so the automatic discovery of new words is an important part of text processing and data mining. [0003] The so-called new word discovery refers to automatically or semi-automatically obtaining vocabulary from the text that has not been registered in the thesaurus. [0004] The current research methods for n...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/3329
Inventor刘健
OwnerALIBABA GRP HLDG LTD