Unlock instant, AI-driven research and patent intelligence for your innovation.
Corpus cleaning method and device
What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A corpus and word technology, applied in the field of language data processing, can solve the problems of wasting computing resources, unclear expression structure level, and imprecise quantifier jurisdiction, and achieve the effect of high understanding efficiency.
Inactive Publication Date: 2019-05-24
GUANGDONG XIAOTIANCAI TECH CO LTD
View PDF13 Cites 1 Cited by
Summary
Abstract
Description
Claims
Application Information
AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology
Problems solved by technology
[0004] It is generally believed that natural language has certain defects in logical understanding, for example, the structural level of its expression is not clear enough, the individualized cognitive model is not clear enough, the scope of quantifier jurisdiction is not exact, the word order of sentence components is not fixed, language The shape and semantics do not correspond. These defects cause the computer to face a lot of corpus outside the parsing rules when it understands natural language. computing resources
Method used
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more
Image
Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
Click on the blue label to locate the original text in one second.
Reading with bidirectional positioning of images and text.
Smart Image
Examples
Experimental program
Comparison scheme
Effect test
no. 1 example
[0067] The first embodiment of the present invention, such as figure 1 A method of corpus cleaning is shown, including:
[0068] S100 Obtain the sentences in the original corpus, perform syntactic analysis on the sentences, and obtain the words, words of speech and original compositional relationship therein;
[0069] S200 extracting a combination of key relationships from the original composition relationship, where the key relationship is a combination relationship between sentence components; extracting subject components and subject parts of speech in the combination of key relationships;
[0070] S300, according to the corresponding relationship between the part of speech of the word and the part of speech of the subject, match the word into the main body component, and obtain a valid word if the matching is successful;
[0071] S400 Eliminate other words except the effective words from the original corpus.
[0072] Specifically, in the present invention, the original c...
no. 3 example
[0094] The third embodiment of the present invention, such as image 3 A method of corpus cleaning is shown, including:
[0095] S100 Obtain the sentences in the original corpus, perform syntactic analysis on the sentences, and obtain the words, words of speech and original compositional relationship therein;
[0096] S200 extracting a combination of key relationships from the original composition relationship, where the key relationship is a combination relationship between sentence components; extracting subject components and subject parts of speech in the combination of key relationships;
[0097] S300, according to the corresponding relationship between the part of speech of the word and the part of speech of the subject, match the word into the main body component, and obtain a valid word if the matching is successful;
[0098] S301 counting the number of occurrences of the statement, and adding the statement to the cleaning rule base when it is greater than a preset va...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More
PUM
Login to View More
Abstract
The invention relates to the technical field of language data processing, and provides a corpus cleaning method and device, and the method comprises the steps: obtaining sentences in an original corpus, carrying out the syntactic analysis of the sentences, and obtaining words, word properties, and an original constitution relationship in the sentences; extracting a key relationship combination from the original composition relationship, wherein the key relationship is a combination relationship among sentence components; extracting main body components and main body part-of-speech in the combination of the key relations; matching the words into the main body components according to the corresponding relation between the word properties and the main body properties, and obtaining effectivewords after matching is successful; And other words except the valid word are removed from the original corpus. According to the method, by removing other words except the valid words in the originalcorpus, the invalid corpus which does not conform to the recognition rule is cleaned, so that the efficiency of understanding the natural language by a computer is improved.
Description
technical field [0001] The invention relates to the technical field of language data processing, in particular to a method and device for cleaning corpus. Background technique [0002] With the gradual development of wearable devices, smart home, Internet of Things and other fields, creating an intelligent life in an all-round way has become the current focus, and human-computer interaction has gradually become a key link in realizing this kind of life. The traditional interaction method uses programmers to input computer language to realize the terminal's understanding of the user's intentions. In this way, ordinary users cannot interact more deeply with the terminal. [0003] Some existing artificial intelligence software products can realize simple interaction with ordinary users by understanding the user's natural language, and recognize the natural language input by the user to understand semantics, such as Microsoft Cortana, Apple Siri, Xiaomi Xiaoai, etc. Further, by...
Claims
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More
Application Information
Patent Timeline
Application Date:The date an application was filed.
Publication Date:The date a patent or application was officially published.
First Publication Date:The earliest publication date of a patent with the same application number.
Issue Date:Publication date of the patent grant document.
PCT Entry Date:The Entry date of PCT National Phase.
Estimated Expiry Date:The statutory expiry date of a patent right according to the Patent Law, and it is the longest term of protection that the patent right can achieve without the termination of the patent right due to other reasons(Term extension factor has been taken into account ).
Invalid Date:Actual expiry date is based on effective date or publication date of legal transaction data of invalid patent.