Corpus cleaning method and device
A corpus and word technology, applied in the field of language data processing, can solve the problems of wasting computing resources, unclear expression structure level, and imprecise quantifier jurisdiction, and achieve the effect of high understanding efficiency.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
no. 1 example
[0067] The first embodiment of the present invention, such as figure 1 A method of corpus cleaning is shown, including:
[0068] S100 Obtain the sentences in the original corpus, perform syntactic analysis on the sentences, and obtain the words, words of speech and original compositional relationship therein;
[0069] S200 extracting a combination of key relationships from the original composition relationship, where the key relationship is a combination relationship between sentence components; extracting subject components and subject parts of speech in the combination of key relationships;
[0070] S300, according to the corresponding relationship between the part of speech of the word and the part of speech of the subject, match the word into the main body component, and obtain a valid word if the matching is successful;
[0071] S400 Eliminate other words except the effective words from the original corpus.
[0072] Specifically, in the present invention, the original c...
no. 3 example
[0094] The third embodiment of the present invention, such as image 3 A method of corpus cleaning is shown, including:
[0095] S100 Obtain the sentences in the original corpus, perform syntactic analysis on the sentences, and obtain the words, words of speech and original compositional relationship therein;
[0096] S200 extracting a combination of key relationships from the original composition relationship, where the key relationship is a combination relationship between sentence components; extracting subject components and subject parts of speech in the combination of key relationships;
[0097] S300, according to the corresponding relationship between the part of speech of the word and the part of speech of the subject, match the word into the main body component, and obtain a valid word if the matching is successful;
[0098] S301 counting the number of occurrences of the statement, and adding the statement to the cleaning rule base when it is greater than a preset va...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com