Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

75 results about "Sentence length" patented technology

In English grammar, sentence length refers to the number of words in a sentence. Most readability formulas use the number of words in a sentence to measure its difficulty. Yet in some cases, a short sentence can be harder to read than a long one.

Online traditional Chinese medicine text named entity identifying method based on deep learning

The invention discloses an online traditional Chinese medicine text named entity identifying method based on deep learning. The method includes the steps that online traditional Chinese medicine text data are obtained through a web crawler, and named entities of the obtained online traditional Chinese medicine text data are labeled with existing terminological dictionaries and human assistance; a word2vec tool is used for carrying out learning on large-scale label-free linguistic data, and word vectors with fixed length are obtained and used for forming a corresponding glossary; word segmentation is carried out on the online traditional Chinese medicine text data, words are converted into the word vectors with the fixed length by searching for the glossary, the word vectors serve as input of a convolutional neural network, and a blank character is used for filling when sentence length is insufficient; output of the convolutional neural network serves as input of a bidirectional long-short-time memory recurrent neural network, and an identification result of the online traditional Chinese medicine text data words to be identified is output. Compared with a traditional method for named entity identifying, the method reduces complexity and workload of feature extraction, simplifies the processing process and remarkably improves identification efficiency.
Owner:SOUTH CHINA UNIV OF TECH

Chinese network review emotion classification method based on integrated study frame

The invention discloses a Chinese network review emotion classification method based on an integrated study frame. According to the method, a part-of-speech combination mode, an order-preserving sub-matrix mode and a frequent word sequence mode are adopted as input characteristics, in the level of characteristics, factors of the influence of Chinese word order information, interval phrase characteristics and the sentence length are considered, and the characteristic vector sparsity problem is solved through semantic similarities; the problem that many review text characteristics exist is solved, the inter-base-classifier independence is guaranteed, and the classification performance of base classifiers is improved as much as possible; a base classifier algorithm constructed based on product attributes is adopted to comprehensively review emotion information of each attribute in a text, and then the sentence-level emotional tendency of reviews is judged, so that a final classification result is more accurate. The Chinese network review emotion classification method based on the integrated study frame is applicable to e-commerce network review emotion classification in various fields, can make a potential consumer know evaluation information of a commodity before purchase and can also make a merchant better sufficiently know the consumer's opinion, and therefore the service quality is improved.
Owner:NANJING SILICON INTELLIGENCE TECH CO LTD

Translation confidence scores

A confidence scoring system can include a model trained using features extracted from translations that have received user translation ratings. The features can include, e.g. sentence length, an amount of out-of-vocabulary or rare words, language model probability scores of the source or translation, or a semantic similarity between the source and a translation. Parameters of the confidence model can then be adjusted based on a comparison of the confidence model output and user translation ratings, where the user translation ratings can be selected or weighted based on a determination of individual user fluentness. After the confidence model has been trained, it can produce confidence scores for new translations. If a confidence score is higher than a threshold, it can indicate the translation should be selected for automatic presentation to users. If the confidence score is below another threshold, it can indicate the translation should be updated.
Owner:META PLATFORMS INC

Chinese reading difficulty grading method and system based on machine learning

InactiveCN107506346AAccurately reflect structural propertiesReflect structural propertiesCharacter and pattern recognitionNatural language data processingChinese charactersFeature set
The invention discloses a Chinese reading difficulty grading method and system based on machine learning. In the grading method, training samples can be updated in real time so that the feature that language changes along with times is taken into consideration and therefore the Chinese difficulty grading table and the word frequency table can be updated. The introduction of features such as semantic, sentences, texts and subjects makes it more objective by using the above features, sentence length and word length as the index of complexity and therefore structural property can be accurately reflected. By using feature set to make up for the lack of a few shallow local linguistic features, it can reflect the real process of reading comprehension and classify the reading difficulty level more accurately. By this method, the reading difficulty grading technique can be applied to Chinese, which accords with the language characteristics of Chinese. The grading system comprises a text obtaining unit, a constructing unit and a training and predicting unit, which realizes the same beneficial effects of the grading method for Chinese text reading difficulty.
Owner:北京享阅教育科技有限公司

Method and system for filtering bilingualism corpora

The invention discloses a filtering method of a bilingual corpus and the method comprises the following steps: A. ratio flag value of sentence length of English-Chinese bilingual sentence pair is confirmed; B. the number of different parts of speech in the English-Chinese bilingual sentence pair is respectively counted, the matching number of the corresponding words in a bilingual intertranslating dictionary and words of the part of speech are calculated and the interpretation eigenvalue is confirmed according to the number of different parts of speech and the matching number; C. the filtration and classification are carried out by the ratio eigenvalue of the sentence length and the interpretation eigenvalue according to a classification model established by using a training set in advance. The invention discloses a bilingual corpus system; the invention also provides a filtering method of the bilingual corpus and a system thereof, which are used for improving universality, accuracy and recalling rate of the corpus.
Owner:BEIJING KINGSOFT SOFTWARE +2

Text similarity calculation method and device, computer equipment and computer storage medium

The invention discloses a text similarity calculation method and device, and relates to the technical field of text processing, which can accurately calculate the similarity between texts in a text with complex expression. The method comprises the steps of obtaining training word segmentation corpora obtained after word segmentation is conducted on text corpora with different sentence lengths; inputting the training word segmentation corpora as training data into a supervision model for training, and constructing a sentence vector conversion model which is used for converting sentences in the text corpus into sentence vectors for representing text characteristics; adjusting characteristic parameters in the sentence vector conversion model according to the sentence vector which is obtained by training and represents the text characteristics; based on the adjusted sentence vector conversion model, performing sentence vector conversion on the plurality of target texts to obtain a plurality of sentence vectors representing the characteristics of the target texts; and calculating the similarity among the plurality of target texts according to the plurality of sentence vectors representing the characteristics of the target texts.
Owner:PING AN TECH (SHENZHEN) CO LTD

Text abstraction method based on TF-IDF

The invention discloses a text abstraction method based on TF-IDF, which comprises the following steps of: carrying out Chinese word segmentation; removing unused words; computing TF-IDF of the words;computing the TF-IDF of the sentences; calculating position characteristics of the sentences; calculating the importance degree of the sentences; screening key sentences; outputting the text abstract; and taking the TFIDF value of the keyword contained in the sentence as a weight, and giving different weights to the core word keyword and the general keyword. Meanwhile, in order to prevent the influence of the sentence length inconsistency on the result, a sliding window is introduced, the importance degree of the maximum sliding window in the sentences is used as the sentence importance degree, the sentences are ranked by combining the characteristics of the sentence length, the sentence position and the like, and a good effect is achieved on a plurality of corpora.
Owner:BEIJING UNIV OF TECH

Automatic generating method and device for Internet headlines

The invention discloses an automatic generating method and device for Internet headlines. The method comprises the following steps: dividing the main body of a piece of news, keeping sentences with the length within a preset length range, and marking the sentences as kept sentences; respectively computing the Similarity (s) between each kept sentence and a headline of the news and the Weight (s) of each kept sentence; computing the rank score of each kept sentence according to a formula that Rank (s) is equal to Weight (s) / Similarity (s); selecting the kept sentence with the highest rank score as the headline, wherein Rank (s) is the rank score of the kept sentence. The method and the device automatically recognize a sentence which can best reflect the value of a piece of news, and take the sentence as the headline of the news.
Owner:广州索答信息科技有限公司

Song ci poetry text message hiding technology based on hybrid encryption

The invention provides a Song ci poetry text message hiding technology based on hybrid encryption, which belongs to the information hiding and data security directions in the field of computers. The Song ci poetry text message hiding technology comprises the steps of encrypting secret information to be hidden by using an advanced encryption standard (AES) in a hybrid manner, encrypting an AES secret key by using an elliptic curve cryptography (ECC) algorithm, passing all information after encryption processing through a 140 tune name template library of the complete collection of Song ci poetry, and hiding the information by means of the system which is composed of templates, a dictionary, a steganographic device and an extractor, wherein the system can generate steganographic Song ci poetry through a random selection or template designation method according to the length of a cryptograph, and the sentence length, grammatical style and intonation sentence pattern of the steganographic Song ci poetry conform to the original Song ci poetry completely, thereby achieving the purposes of obfuscating attackers and ensuring secure transmission of the hidden information. The Song ci poetry text message hiding technology disclosed by the invention can solve the security problem of data transmission in channels, can provide double security measures of information hiding and data encryption, and has high practical application value.
Owner:NANJING UNIV OF AERONAUTICS & ASTRONAUTICS

Method for grading Chinese electronic document reading on the Internet

The invention discloses a method for grading Chinese electronic document reading on the Internet, comprising firstly determining the frequency distributions of Chinese characters, word groups and sentence structure indexes in different grades of documents; selecting the Chinese characters and the word groups for grading document reading, and avoiding the interference of often-used words and little-used words, then analyzing the word composition of a to-be-graded target document, analyzing the document to be a two-tuple vector (of words and occurrence number); calculating the sentence structure indexes of the document comprising an average paragraph length, an average sentence length, the length difference between the longest sentence and the shortest sentence and the like; and finally using the Naive Bayes method for determining the reading grade of the document based on the word composition information and the sentence structure information of the Chinese document. The reading grade of a Chinese electronic document is efficiently determined by analyzing the Chinese characters and word group composition of the document, combining with the sentence structures of the document, reasoning from the frequency distribution of each word and the structure indexes in different reading grades of documents and applying the Naive Bayes method.
Owner:NANJING UNIV

Online service anomaly detection method based on log semantic analysis

The invention relates to an online service anomaly detection method based on log semantic analysis. According to the method, a TextCNN is improved, and a variable convolution and pooling convolution neural network is provided for log classification. The VCNN provided by the invention mainly considers the influence factors of the word embedding dimension. The multi-convolution and multi-pooling operation performed by the method is not only applied to sentence length, but also more applied to the word embedding dimension to extract rich semantic feature information so as to make up for the deficiency of semantic information in the word embedding dimension. In addition, the average pooling operation proposed in a pooling layer facilitates the storage of the important feature information of the extracted features. The core of the method is log classification based on log semantics and service completion strength and performance anomaly classification based on improved Bayesian. The methodis mainly suitable for detecting the online service exception, deduces the log cluster influencing the system performance by utilizing a log, and can support exception detection and search for the logrelated to the influence on the service performance.
Owner:HANGZHOU DIANZI UNIV

Electronic medical record entity relationship extraction method based on shortest dependency subtree

The invention provides an electronic medical record entity relationship extraction method based on a shortest dependency subtree. The method comprises the following steps: firstly, extracting an entity-based shortest subtree from an original sentence through dependency syntactic analysis to compress the sentence length; secondly, coding the statements through a bidirectional long short-term memory(BLSTM) neural network, and then coding the statements through the BLSTM neural network; learning final semantic representation of the sentences through a maximum pooling layer (Max Pooling), and finally classifying the sentences through a softmax classifier to obtain an entity relationship. According to the method, noise vocabularies and compressed statement lengths can be deleted. Meanwhile, the key words representing the relations between the entities are completely reserved, so that the compressed statement semantic relations are clearer. The problem that semantic information of statements cannot be well represented due to too long statements of an existing electronic medical record entity relation extraction model is solved, and the performance of the relation extraction model is improved.
Owner:SICHUAN UNIV

Sentence alignment method for bilingual parallel corpuses

The invention provides a sentence alignment method for bilingual parallel corpuses. The method comprises the following steps of: A, obtaining a bilingual probability distribution dictionary comprising word inter-translation pairs and word inter-translation probabilities of a source language and a target language; B, constructing a dynamic plan matrix according to the quantities of sentences of the source language and target language of a to-be-aligned text, and determining evaluation scores on the basis of sentence length information, word information and the word inter-translation probability under different alignment modes according to the dynamic plan matrix and the bilingual probability distribution dictionary; C, determining an alignment path under the alignment mode, the evaluation score of which is greater than an appointed threshold value, according to the evaluation score; and D, determining an alignment path sequence of sentences of the source language and target language of the to-be-aligned text according to the alignment path. The sentence alignment method for bilingual parallel corpuses is beneficial for improving the automatic sentence alignment precision of bilingual parallel corpuses.
Owner:北京同文世纪科技有限公司

System and method for automatically generating keywords

An information handling system is disclosed for generating tags of a file including a document or a webpage posting. The generating tags of a file include converting a webpage posting to a PDF document. The method further includes extracting tags provided by users. The method includes scanning the extracted data from a glossary PDF document to identify keywords of the glossary PDF document in accordance with a sentence length. The method further includes extracting data from the PDF document and scanning the extracted data to identify keywords of the PDF document in accordance with a sentence length. The method further includes reapplying selected keywords to the tags of the file.
Owner:DELL PROD LP

Word segmentation method based on Bi-LSTM

The invention discloses a word segmentation method based on Bi-LSTM. The method includes the steps that training corpus data is converted into character-level corpus data; the corpus data is segmentedaccording to sentence length, and several sentences are obtained; then the obtained sentences are grouped according to the sentence length, and a data set comprising n groups of sentences is obtained; several pieces of data are extracted from the data set to serve as iteration data; the each-time iteration data is converted into a fixed-length vector to be sent into a depth learning model Bi-LSTM, and parameters of the depth learning model Bi-LSTM are trained; when the loss value iteration change generated by the depth learning model is smaller than a set threshold value and is not decreasedany more or the maximum iteration time number is reached, training of the depth learning model is terminated, and the trained depth learning model Bi-LSTM is obtained; corpus data to be predicted is converted into character-level corpus data, the character-level corpus data is sent into the trained depth learning model Bi-LSTM, and a word segmentation result is obtained.
Owner:北京知道未来信息技术有限公司

Instance-based sentence boundary determination by optimization

A method for instance-based sentence boundary determination optimizes a set of criteria based on examples in a corpus, and provides a general domain-independent framework for the task by balancing a comprehensive set of sentence complexity and quality constraints. The characteristics and style of naturally occurring sentences are simulated through the use of semantic grouping and sentence length distribution. The method is parameterized so that it is easily adapts to suit a Natural Language Generation (NLG) system's generation.
Owner:IBM CORP

LSTM-based word segmentation method

The invention discloses an LSTM-based word segmentation method. The method comprises the steps of 1) converting training corpus data into character-level corpus data; 2) dividing the corpus data according to a sentence length to obtain multiple sentences, and according to the sentence length, grouping the obtained sentences to obtain a data set comprising n groups of sentences; 3) extracting multiple pieces of data from the data set to serve as iterative data; 4) converting the iterative data each time into a fixed-length vector, inputting the vector to a deep learning model LSTM, training parameters of the deep learning model LSTM, and when a loss value iterative change generated by the deep learning model is smaller than a set threshold, is no longer reduced or reaches a maximum iterative frequency, stopping the training of the deep learning model, and obtaining a trained deep learning model LSTM; and 5) converting to-be-predicted corpus data into the character-level corpus data, andinputting the character-level corpus data to the trained deep learning model LSTM to obtain a word segmentation result.
Owner:北京知道未来信息技术有限公司

Method for extracting webpage text based on label path and text punctuation ratio feature fusion

The invention discloses a method for extracting a webpage text based on a label path and a text punctuation ratio feature fusion. The method is mainly by constructing a text punctuation ratio and a feature fusion method of a label path to propose a novel feature value, thereby extracting the text from a webpage. The method is characterized in that a text punctuation ratio feature pair is defined to measure the average sentence length of the label path, and at the same time, the position of the label path and its internal complexity are combined to give a more comprehensive feature value to judge the content of the text. By adoption of the method, it is possible to extract the webpage text more accurately without constructing an extraction template, and the application scope is wide.
Owner:SOUTH CHINA AGRI UNIV

LSTM-CNN-based word segmentation method

The invention discloses an LSTM-CNN-based word segmentation method. The method comprises the steps of converting training corpus data into character-level corpus data; performing statistics on characters of the corpus data to obtain a character set and numbering the characters to obtain a character number set; performing statistics on character tags to obtain a tag set, and numbering the tags to obtain a tag number set; dividing corpora according to sentence lengths, and grouping obtained sentences according to the sentence lengths to obtain a data set comprising n groups of sentences; randomly selecting a sentence group from the data set without replacement, wherein the characters of each sentence form a piece of data w, and the corresponding tag set is y; converting the data w into the corresponding number and tag y to be input to a LSTM-CNN of a model, and training parameters of the deep learning model; and converting to-be-predicted data into data matched with the deep learning model, and inputting the data to the trained deep learning model to obtain a word segmentation result.
Owner:北京知道未来信息技术有限公司

Question and answer processing method and device, computer equipment and readable storage medium

The invention relates to a question and answer processing method and device, computer equipment and a readable storage medium, and the method comprises the steps: obtaining a to-be-answered question, and determining a similar question similar to the to-be-answered question from a question and answer database; converting the similar problem into a similar problem vector; based on any two or more of a literal similarity score, a semantic similarity score, a sentence editing distance score, a sentence length score, a knowledge base similarity score, a hash similarity score, a word vector similarity score, a keyword score, a keyword editing distance score and an entity word editing distance score and a similar problem vector, determining a target question matched with the to-be-answered question from the similar questions; and determining a target answer corresponding to the target question from the question and answer database.
Owner:广东优碧胜科技有限公司

Semantic role recognition method based on phrase structure tree

The invention relates to a semantic role recognition method based on a phrase structure tree. The method comprises the steps of sentence pruning, wherein when a system inputs one sentence, phrase analysis is performed on the sentence, the analyzed result is subjected to pruning through a parenthesis or a coordination structure, the complexity of the sentence is simplified, and the length of the sentence is shortened; clause extracting processing, wherein on the basis of the phrase structure tree, clauses in the pruned sentences are extracted, the extracted clauses and the remaining portion obtained after the clauses are extracted are subjected to semantic role analysis separately, the complete sentence semantic role is obtained, and the analysis result of the semantic role is reduced; boundary correction, wherein the reduced semantic role is combined with the phrase tree to perform predicate argument boundary correction on the sentences, and finally the sentence semantic role analysisresult is output. The sentence complexity is simplified, the sentence length is shortened, the complex and long sentence can be effectively processed, and the semantic role labeling condition is improved.
Owner:SHENYANG AEROSPACE UNIVERSITY

Example sentence searching method and system

ActiveCN102890723ARegularize output example sentencesSpecial data processing applicationsUser inputCalculation methods
The invention relates to the field of natural language processing, and provides an example sentence searching method according to query. The method comprises the following steps of: obtaining the query input by a user; processing the query input by the user; searching sample sentences matched with the query in an example sentence library, and calculating the relativity of the query and the example sentences; carrying out example sentence relativity scoring adjustment according to a usage diversity or translation diversity principle, and sorting the example sentences; outputting the example sentences and presenting phrases in the example sentences. The invention further provides an example sentence searching system according to the query. According to the scheme provided by the invention, various factors are comprehensively considered in calculation of the relativity of the query and the example sentences, and specifically, the features of the related phrases to the query in the example sentences, the syntactic features, the example sentence structure integrality feature, the sentence length feature and the digital noise feature of punctuations in the example sentences are comprehensively considered for calculating the relativity of the query and the example sentences; and the method is superior to other relativity calculation methods.
Owner:深圳宜搜天下科技股份有限公司

Text splicing method and device thereof

The invention discloses a text splicing method and a device thereof. The method comprises the following steps of: obtaining a to-be-spliced current text fragment, obtaining an average sentence lengthcorresponding to the current text fragment, obtaining a first semantic score, in a semantic model, of the current text fragment, and obtaining a second semantic score, in the semantic model, of a candidate sentence comprising the current text fragment, wherein the current text fragment is a starting fragment of the candidate sentence; and splicing the current text fragment according to the averagesentence length, the first semantic score and the second semantic score, so as to obtain a target sentence corresponding to the current text fragment. During sentence segmentation, sentence lengths are considered, so that the sentence lengths are proper, long difficult sentences or massive short sentences are avoided, and the sentence lengths are relatively stable; and scoring of the semantic model is also considered, so that the sentence segmentation correctness can be improved, semantic meanings of the sentences are not damaged and the intelligibility of the sentences is improved.
Owner:BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD

Training Method Using Specific Audio Patterns and Techniques

A method is disclosed that utilizes specific techniques, based upon empirical study, to significantly increase the ability of a trainee to remain focused on the training materials and subject matter and actually learn and retain the training subject matter. More specifically, the present invention utilizes audio and / or visual (e.g., a personal computer) elements, with a strict set of rules which must be followed regarding sentence length, narrators, and underlying music within the dialog to create a specific rhythmic “feel” to the training. As a result of using such techniques, significantly improved results over prior art training methods can be obtained.
Owner:THE TSI

Translation model optimization method for dynamically adjusting length punishment and translation length

ActiveCN111178092AThe optimization method is simpleThe optimization method is convenient and effectiveNatural language translationNeural architecturesData setAlgorithm
The invention discloses a translation model optimization method for dynamically adjusting length punishment and translation length. The method comprises the steps of obtaining standard data in a specified language direction as a standard bilingual data set for various index prediction; performing word segmentation operation on the standard bilingual data set, and performing further training to obtain a new training data set; modifying a neural machine translation model decoder part, and automatically predicting the optimal length punishment value of the current sentence pair; performing lengthstatistics to obtain a target statement sub-length; preparing an independent feedforward neural network model so that a translation finally predicted by the model tends to a translation result with the optimal length; and enabling the Transformer neural machine translation model to dynamically adjust the length penalty and the optimal translation sentence length for different sentences. Accordingto the method, the length punishment and the dynamic adjustment of the translation length in the model translation process are realized, the realization is simple, the method is effective, the practicability is high, and the model translation quality improvement effect is obvious.
Owner:沈阳雅译网络技术有限公司

Mixed-corpus word segmentation method based on Bi-LSTM

The invention discloses a mixed-corpus word segmentation method based on Bi-LSTM. The method includes the steps that training mixed corpus data is converted into character-level corpus data; characters of the corpus data are subjected to statistics, a character set is obtained, the characters are numbered, and a character number set is obtained; labels of the characters are subjected to statistics, a label set is obtained, the labels are numbered, and a label number set is obtained; corpus is segmented according to sentence length, obtained sentences are grouped according to the sentence length, and a data set is obtained; a sentence group is randomly selected from the data set without releasing, multiple sentences are extracted from the sentence group, the characters of each sentence formdata w, and the corresponding label set is y; the data w is converted into corresponding numbers and labels y to be sent into a model Bi-LSTM, and parameters of the depth learning model Bi-LSTM are trained; data to be predicted is converted into data matched with the depth learning model, the data is sent into the trained depth learning model Bi-LSTM, and a word segmentation result is obtained.
Owner:北京知道未来信息技术有限公司

Mixed corpus word segment method based on LSTM (Long Short Term Memory)-CNN (Convolutional Neural Network)

The invention discloses a mixed corpus word segment method based on an LSTM (Long Short Term Memory)-CNN (Convolutional Neural Network). The method comprises the steps of converting training mixed corpus data into the mixed corpus data at a character level; counting the mixed corpus data characters to obtain a character set, and numbering each character to obtain a character serial number set; counting character labels to obtain a label set, numbering the labels to obtain a label serial number set; segmenting the corpus according to a sentence length, and grouping the obtained sentences according to the sentence length to obtain a data set; randomly selecting a sentence subgroup from the data set, extracting a plurality of sentences from the sentence subgroup, wherein the characters of each sentence form a datum w, and a corresponding label set is y; converting the datum w into a corresponding serial number and sending the label y to a model LSTM-CNN, and training a parameter of a deeplearning model; and converting to-be-predicted mixed corpus data into data matched with the deep learning model, sending the to-be-predicted mixed corpus data to the trained deep learning model to obtain a word segment result.
Owner:北京知道未来信息技术有限公司

LSTM-based mixed corpus word segmentation method

The invention discloses an LSTM-based mixed corpus word segmentation method. According to the method, training mixed corpus data is converted into character-level mixed corpus data; and the mixed corpus data is divided according to sentence length to obtain a plurality of sentences; the obtained sentences are grouped according to the sentence length, and a data set comprising n groups of sentencesis obtained; a plurality of pieces of data are extracted from the data set to serve as iteration data; the data iterated each time is converted into vectors with fixed length, the vectors are input into a deep learning model LSTM, and parameters of the deep learning model LSTM are trained; when the iteration change of a loss value generated by the deep learning model is smaller than a set threshold, is no longer lowered or reaches the maximum number of iterations, training of the deep learning model is terminated, and a trained deep learning model LSTM is obtained; and to-be-predicted mixed corpus data is converted into character-level corpus data, the corpus data is input into the trained deep learning model LSTM, and a word segmentation result is obtained.
Owner:北京知道未来信息技术有限公司

Corpus processing method and device and storage medium

The invention discloses a corpus processing method. The corpus processing method comprises the steps: acquiring the phoneme frequency of each phoneme and the sentence length frequency of each sentence in an original corpus, wherein the phoneme frequency of each phoneme represents the number of the same phonemes in the original corpus, and the sentence length frequency of each sentence represents the number of sentences with the same sentence length in the original corpus; and calculating a frequency parameter of each sentence according to the phoneme frequency and the sentence length frequency, and taking the frequency parameter as a score of the sentence, wherein the frequency parameter is in negative correlation with the phoneme frequency, and is in negative correlation with the sentence length frequency. The invention further discloses a corpus processing device and a storage medium. According to the corpus processing method, a reliable standard is provided for corpus selection, so that the reliability of corpus sentence selection during screening can be improved, and the screening efficiency of a large number of text corpora is effectively improved, and the corpus processing method is suitable for large-scale corpus information screening tasks.
Owner:GUANGZHOU DUOYI NETWORK TECH +2
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products