Corpus cleaning method and device

A corpus and word technology, applied in the field of language data processing, can solve the problems of wasting computing resources, unclear expression structure level, and imprecise quantifier jurisdiction, and achieve the effect of high understanding efficiency.

Inactive Publication Date: 2019-05-24
GUANGDONG XIAOTIANCAI TECH CO LTD
View PDF13 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] It is generally believed that natural language has certain defects in logical understanding, for example, the structural level of its expression is not clear enough, the individualized cognitive model is not clear enough, the scope of quantifier jurisdiction is not exact, the word order of sentence components is not fixed, language The shape and semantics do not correspond. These defects cause the computer to face a lot of corpus outside the parsing rules when it understands natural language. computing resources

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Corpus cleaning method and device
  • Corpus cleaning method and device
  • Corpus cleaning method and device

Examples

Experimental program
Comparison scheme
Effect test

no. 1 example

[0067] The first embodiment of the present invention, such as figure 1 A method of corpus cleaning is shown, including:

[0068] S100 Obtain the sentences in the original corpus, perform syntactic analysis on the sentences, and obtain the words, words of speech and original compositional relationship therein;

[0069] S200 extracting a combination of key relationships from the original composition relationship, where the key relationship is a combination relationship between sentence components; extracting subject components and subject parts of speech in the combination of key relationships;

[0070] S300, according to the corresponding relationship between the part of speech of the word and the part of speech of the subject, match the word into the main body component, and obtain a valid word if the matching is successful;

[0071] S400 Eliminate other words except the effective words from the original corpus.

[0072] Specifically, in the present invention, the original c...

no. 3 example

[0094] The third embodiment of the present invention, such as image 3 A method of corpus cleaning is shown, including:

[0095] S100 Obtain the sentences in the original corpus, perform syntactic analysis on the sentences, and obtain the words, words of speech and original compositional relationship therein;

[0096] S200 extracting a combination of key relationships from the original composition relationship, where the key relationship is a combination relationship between sentence components; extracting subject components and subject parts of speech in the combination of key relationships;

[0097] S300, according to the corresponding relationship between the part of speech of the word and the part of speech of the subject, match the word into the main body component, and obtain a valid word if the matching is successful;

[0098] S301 counting the number of occurrences of the statement, and adding the statement to the cleaning rule base when it is greater than a preset va...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the technical field of language data processing, and provides a corpus cleaning method and device, and the method comprises the steps: obtaining sentences in an original corpus, carrying out the syntactic analysis of the sentences, and obtaining words, word properties, and an original constitution relationship in the sentences; extracting a key relationship combination from the original composition relationship, wherein the key relationship is a combination relationship among sentence components; extracting main body components and main body part-of-speech in the combination of the key relations; matching the words into the main body components according to the corresponding relation between the word properties and the main body properties, and obtaining effectivewords after matching is successful; And other words except the valid word are removed from the original corpus. According to the method, by removing other words except the valid words in the originalcorpus, the invalid corpus which does not conform to the recognition rule is cleaned, so that the efficiency of understanding the natural language by a computer is improved.

Description

technical field [0001] The invention relates to the technical field of language data processing, in particular to a method and device for cleaning corpus. Background technique [0002] With the gradual development of wearable devices, smart home, Internet of Things and other fields, creating an intelligent life in an all-round way has become the current focus, and human-computer interaction has gradually become a key link in realizing this kind of life. The traditional interaction method uses programmers to input computer language to realize the terminal's understanding of the user's intentions. In this way, ordinary users cannot interact more deeply with the terminal. [0003] Some existing artificial intelligence software products can realize simple interaction with ordinary users by understanding the user's natural language, and recognize the natural language input by the user to understand semantics, such as Microsoft Cortana, Apple Siri, Xiaomi Xiaoai, etc. Further, by...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/215G06F16/2458
Inventor 魏誉荧
Owner GUANGDONG XIAOTIANCAI TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products