Patents
Literature
Patsnap Copilot is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Patsnap Copilot

223 results about "N-gram" patented technology

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

Unknown malcode detection using classifiers with optimal training sets

The present invention is directed to a method for detecting unknown malicious code, such as a virus, a worm, a Trojan Horse or any combination thereof. Accordingly, a Data Set is created, which is a collection of files that includes a first subset with malicious code and a second subset with benign code files and malicious and benign files are identified by an antivirus program. All files are parsed using n-gram moving windows of several lengths and the TF representation is computed for each n-gram in each file. An initial set of top features (e.g., up to 5500) of all n-grams IS selected, based on the DF measure and the number of the top features is reduced to comply with the computation resources required for classifier training, by using features selection methods. The optimal number of features is then determined based on the evaluation of the detection accuracy of several sets of reduced top features and different data sets with different distributions of benign and malicious files are prepared, based on the optimal number, which will be used as training and test sets. For each classifier, the detection accuracy is iteratively evaluated for all combinations of training and test sets distributions, while in each iteration, training a classifier using a specific distribution and testing the trained classifier on all distributions. The optimal distribution that results with the highest detection accuracy is selected for that classifier.
Owner:DEUTSCHE TELEKOM AG

Chinese text automatic correction method

The invention discloses a Chinese text automatic correction method. The method comprises the following steps of: a) inputting a to-be-corrected Chinese text, and performing word segmentation preprocessing on the Chinese text sentence by sentence; b) searching for one-character words, two-character words or disperse strings of three or more than three characters occurring in the text subjected to word segmentation sentence by sentence; c) performing continuous determination on the disperse strings occurring in the text subjected to word segmentation by adopting an N-gram model, and checking text word level errors for each single sentence in combination with a word forming probability of separate characters; and d) constructing an error correction knowledge base to generate an error correction candidate text. According to the Chinese text automatic correction method provided by the invention, the one-character words, two-character words or disperse strings of three or more than three characters occurring in the text subjected to word segmentation are searched for sentence by sentence, the disperse strings occurring in the text subjected to word segmentation are subjected to continuous determination by adopting the N-gram model to determine identification errors, and the error correction knowledge base is constructed to generate the error correction candidate text, so that error checking and correcting processes are combined very well, and the method has the characteristics of high error checking speed and high error correcting efficiency.
Owner:SHANGHAI INST OF TECH

Character string updated degree evaluation program

There is provided a character string updated degree evaluation program that enables quantitative grasping of an amount of intellectual work through editing and updating of character strings. A text subjected to comparison is divided into common part character strings each having a length greater than or equal to a threshold value, and non-common part character strings. A number of edited points from the original text and a context edit distance are calculated based on the rate of the common part character strings and the occurrence pattern thereof. A number of edited point is acquired from a number of elements contained in a common part character string set, and a context edit distance is acquired from a change in an order of occurrence of the common part character strings. Calculation of a new creation percentage and analysis by an N-gram are performed on the non-common part character string. The new creation percentage is acquired from the total length of the elements contained in a non-common part character string set, and a new creation novelty degree is acquired from a non-partial matching rate between a non-common part character string set and an element contained in the non-common part character string set. Calculations for the common part character string set and for the non-common part character string set are united, thereby calculating a text updated degree.
Owner:NAT UNIV CORP NAGAOKA UNIV TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products