Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

226 results about "N-gram" patented technology

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

Identification and rejection of meaningless input during natural language classification

A method for identifying data that is meaningless and generating a natural language statistical model which can reject meaningless input. The method can include identifying unigrams that are individually meaningless from a set of training data. At least a portion of the unigrams identified as being meaningless can be assigned to a first n-gram class. The method also can include identifying bigrams that are entirely composed of meaningless unigrams and determining whether the identified bigrams are individually meaningless. At least a portion of the bigrams identified as being individually meaningless can be assigned to the first n-gram class.
Owner:MICROSOFT TECH LICENSING LLC

Apparatus method and medium for tracing the origin of network transmissions using n-gram distribution of data

A method, apparatus, and medium are provided for tracing the origin of network transmissions. Connection records are maintained at computer system for storing source and destination addresses. The connection records also maintain a statistical distribution of data corresponding to the data payload being transmitted. The statistical distribution can be compared to that of the connection records in order to identify the sender. The location of the sender can subsequently be determined from the source address stored in the connection record. The process can be repeated multiple times until the location of the original sender has been traced.
Owner:THE TRUSTEES OF COLUMBIA UNIV IN THE CITY OF NEW YORK

Method, apparatus and computer program product for providing flexible text based language identification

An apparatus for providing flexible text based language identification includes an alphabet scoring element, an n-gram frequency element and a processing element. The alphabet scoring element may be configured to receive an entry in a computer readable text format and to calculate an alphabet score of the entry for each of a plurality of languages. The n-gram frequency element may be configured to calculate an n-gram frequency score of the entry for each of the plurality of languages. The processing element may be in communication with the n-gram frequency element and the alphabet scoring element. The processing element may also be configured to determine a language associated with the entry based on a combination of the alphabet score and the n-gram frequency score.
Owner:NOKIA TECH OY

Method of identifying the language of a textual passage using short word and/or n-gram comparisons

A method and system identifying the language of a textual passage is disclosed. The method and system includes parsing the textual passage into n-grams and assigning an initial weight to each n-gram, and adjusting the weight initially assigned to a word or n-gram parsed from the textual passage. The initially assigned weight is adjusted in a manner proportionate to the inverse of the number of languages within which such words or n-grams appear. Reducing the weight assigned to such words or n-grams diminishes—without completely eliminating—their importance in comparison to other words or n-grams parsed from the same textual passage when determining the language of a passage. The method and system of the present invention appropriately weighs the short words or n-grams common to multiple languages without affecting the short words or n-grams that are uncommon to several languages.
Owner:JUSTSYST EVANS RES

Method of identifying the language of a textual passage using short word and/or n-gram comparisons

A method and system identifying the language of a textual passage is disclosed. The method and system includes parsing the textual passage into n-grams and assigning an initial weight to each n-gram, and adjusting the weight initially assigned to a word or n-gram parsed from the textual passage. The initially assigned weight is adjusted in a manner proportionate to the inverse of the number of languages within which such words or n-grams appear. Reducing the weight assigned to such words or n-grams diminishes—without completely eliminating—their importance in comparison to other words or n-grams parsed from the same textual passage when determining the language of a passage. The method and system of the present invention appropriately weighs the short words or n-grams common to multiple languages without affecting the short words or n-grams that are uncommon to several languages.
Owner:JUSTSYST EVANS RES

Systems and methods for an autonomous avatar driver

The autonomous avatar driver is useful in association with language sources. A sourcer may receive dialog from the language source. It may also, in some embodiments, receive external data from data sources. A segmentor may convert characters, represent particles and split dialog. A parser may then apply a link grammar, analyze grammatical mood, tag the dialog and prune dialog variants. A semantic engine may lookup token frames, generate semantic lexicons and semantic networks, and resolve ambiguous co-references. An analytics engine may filter common words from dialog, analyze N-grams, count lemmatized words, and analyze nodes. A pragmatics analyzer may resolve slang, generate knowledge templates, group proper nouns and estimate affect of dialog. A recommender may generate tag clouds, cluster the language sources into neighborhoods, recommend social networking to individuals and businesses, and generate contextual advertising. Lastly, a response generator may generate responses for the autonomous avatar using the analyzed dialog. The response generator may also incorporate the generated recommendations.
Owner:BOTANIC TECH INC

Sentiment Classification Based on Supervised Latent N-Gram Analysis

A method for sentiment classification of a text document using high-order n-grams utilizes a multilevel embedding strategy to project n-grams into a low-dimensional latent semantic space where the projection parameters are trained in a supervised fashion together with the sentiment classification task. Using, for example, a deep convolutional neural network, the semantic embedding of n-grams, the bag-of-occurrence representation of text from n-grams, and the classification function from each review to the sentiment class are learned jointly in one unified discriminative framework.
Owner:NEC LAB AMERICA

Compression method, method for compressing entry word index data for a dictionary, and machine translation system

A n-gram statistical analysis is employed to acquire frequently appearing character strings of n characters or more, and individual character strings having n characters or more are replaced by character translation codes of 1 byte each. The correlation between the original character strings having n characters and the character translation codes is registered in a character translation code table. Assume that a character string of three characters, i.e., a character string of three bytes, "sta," is registered as 1-byte code "e5" and that a character string of four characters, i.e., a character string of four bytes, "tion," is registered as 1-byte code "f1." Then, the word "station," which consists of a character string of seven characters, i.e., seven bytes, is represented by the 2-byte code "e5 f1," so that this contributes to a compression of five bytes.
Owner:IBM CORP

Unknown malcode detection using classifiers with optimal training sets

The present invention is directed to a method for detecting unknown malicious code, such as a virus, a worm, a Trojan Horse or any combination thereof. Accordingly, a Data Set is created, which is a collection of files that includes a first subset with malicious code and a second subset with benign code files and malicious and benign files are identified by an antivirus program. All files are parsed using n-gram moving windows of several lengths and the TF representation is computed for each n-gram in each file. An initial set of top features (e.g., up to 5500) of all n-grams IS selected, based on the DF measure and the number of the top features is reduced to comply with the computation resources required for classifier training, by using features selection methods. The optimal number of features is then determined based on the evaluation of the detection accuracy of several sets of reduced top features and different data sets with different distributions of benign and malicious files are prepared, based on the optimal number, which will be used as training and test sets. For each classifier, the detection accuracy is iteratively evaluated for all combinations of training and test sets distributions, while in each iteration, training a classifier using a specific distribution and testing the trained classifier on all distributions. The optimal distribution that results with the highest detection accuracy is selected for that classifier.
Owner:DEUTSCHE TELEKOM AG

System and method for detecting malicious executable code

A system and method for detecting malicious executable software code. Benign and malicious executables are gathered; and each are encoded as a training example using n-grams of byte codes as features. After selecting the most relevant n-grams for prediction, a plurality of inductive methods, including naive Bayes, decision trees, support vector machines, and boosting, are evaluated.
Owner:GEORGETOWN UNIV

Method of identifying script of line of text

A method of identifying the script of a line of text by first assigning a weight to each n-gram in a group of documents of known scripts, where each n-gram is a sequence of numbers representing k-mean cluster centroids of a known script to which character segments in the documents of known scripts most closely match. A line of text is identified, where the line of text is made up of pixels. The identified line of text is cropped so that only a percentage of the pixels remain. The cropped line is vertically and horizontally rescaled into gray-scale pixels. The vertical gray-scale pixels are replaced with the sequence number of a k-means cluster centroid of a known script to which it most closely matches. The n-grams of the number sequence that represents the line of text is scored against the n-gram weights of the documents of known text. The highest score of the line of text is identified and compared to the scores of the documents of known scripts. The script of the line of text is determined to be the script of the document against which the line of text scores the highest.
Owner:NATIONAL SECURITY AGENCY

Apparatus method and medium for detecting payload anomaly using n-gram distribution of normal data

A method, apparatus and medium are provided for detecting anomalous payloads transmitted through a network. The system receives payloads within the network and determines a length for data contained in each payload. A statistical distribution is generated for data contained in each payload received within the network, and compared to a selected model distribution representative of normal payloads transmitted through the network. The model payload can be selected such that it has a predetermined length range that encompasses the length for data contained in the received payload. Anomalous payloads are then identified based on differences detected between the statistical distribution of received payloads and the model distribution. The system can also provide for automatic training and incremental updating of models.
Owner:THE TRUSTEES OF COLUMBIA UNIV IN THE CITY OF NEW YORK

Methods and systems for implementing approximate string matching within a database

A computer-based method for character string matching of a candidate character string with a plurality of character string records stored in a database is described. The method includes a) identifying a set of reference character strings in the database, the reference character strings identified utilizing an optimization search for a set of dissimilar character strings, b) generating an n-gram representation for one of the reference character strings in the set of reference character strings, c) generating an n-gram representation for the candidate character string, d) determining a similarity between the n-gram representations, e) repeating steps b) and d) for the remaining reference character strings in the set of identified reference character strings, and f) indexing the candidate character string within the database based on the determined similarities between the n-gram representation of the candidate character string and the reference character strings in the identified set.
Owner:MASTERCARD INT INC

Systems and methods for interactive topic-based text summarization

Techniques for determining interactive topic-based summarization are provided. A text to be summarized is segmented. Discrete keyword, key-phrase, n-gram, sentence and other sentence constituent based summaries are generated based on statistical measures for each text segment. Interactive topic-based summaries are displayed with human sensible omitted text indicators such as alternate colors, fonts, sounds, tactile elements or other human sensible display characteristics useful in indicating omitted text. Individual and / or combinations of discrete keyword, key-phrase, n-gram, sentence, noun phrase and sentence constituent based summaries are dynamically displayed to provide an overview of topic and subtopic development within a text. A hierarchical and interactive display of texts based on the use of discrete sentence constituent based summaries which associates expansible and contractible displayed text provides contextualized access to an interactive topic-based text summary and to an original text.
Owner:PALO ALTO RES CENT INC

Systems and methods for interactive topic-based text summarization

Techniques for determining interactive topic-based summarization are provided. A text to be summarized is segmented. Discrete keyword, key-phrase, n-gram, sentence and other sentence constituent based summaries are generated based on statistical measures for each text segment. Interactive topic-based summaries are displayed with human sensible omitted text indicators such as alternate colors, fonts, sounds, tactile elements or other human sensible display characteristics useful in indicating omitted text. Individual and / or combinations of discrete keyword, key-phrase, n-gram, sentence, noun phrase and sentence constituent based summaries are dynamically displayed to provide an overview of topic and subtopic development within a text. A hierarchical and interactive display of texts based on the use of discrete sentence constituent based summaries which associates expansible and contractible displayed text provides contextualized access to an interactive topic-based text summary and to an original text.
Owner:PALO ALTO RES CENT INC

Efficient language identification

A system and methods of language identification of natural language text are presented. The system includes stored expected character counts and variances for a list of characters found in a natural language. Expected character counts and variances are stored for multiple languages to be considered during language identification. At run-time, one or more languages are identified for a text sample based on comparing actual and expected character counts. The present methods can be combined with upstream analyzing of Unicode ranges for characters in the text sample to limit the number of languages considered. Further, n-gram methods can be used in downstream processing to select the most probable language from among the languages identified by the present system and methods.
Owner:MICROSOFT TECH LICENSING LLC

Systems and methods for alphanumeric navigation and input

Systems and methods for simplifying text entry are provided. A visual keypad may include a plurality of user-selectable buttons corresponding to at least some of the buttons of the alphabet. The layout of the visual keypad may be determined based on an n-gram table. The layout of the visual keypad may be rearranged based at least in part on the most likely next character in response to receiving a user selection of a button on the visual keypad.
Owner:UNITED VIDEO PROPERTIES

Systems and methods for displaying interactive topic-based text summaries

Techniques for displaying interactive topic-based summarization are provided. A text to be summarized is segmented. Discrete keyword, key-phrase, n-gram, sentence and other sentence constituent based summaries are generated based on statistical measures for each text segment. Interactive topic-based summaries are displayed with human sensible omitted text indicators such as alternate colors, fonts, sounds, tactile elements or other human sensible display characteristics useful in indicating omitted text. Individual and / or combinations of discrete keyword, key-phrase, n-gram, sentence, noun phrase and sentence constituent based summaries are dynamically displayed to provide an overview of topic and subtopic development within a text. A hierarchical and interactive display of texts based on the use of discrete sentence constituent based summaries which associates expansible and contractible displayed text to provide contextualized access to an interactive topic-based text summary and to an original text.
Owner:XEROX CORP

Language identification from short strings

Systems and processes for language identification from short strings are provided. In accordance with one example, a method includes, at a first electronic device with one or more processors and memory, receiving user input including an n-gram and determining a similarity between a representation of the n-gram and a representation of a first language. The representation of the first language is based on an occurrence of each of a plurality of n-grams in the first language and an occurrence of each of the plurality of n-grams in a second language. The method further includes determining whether the similarity between the representation of the n-gram and the representation of the first language satisfies a threshold.
Owner:APPLE INC

Method and system for organizing information

A system and method to process data having a module stored on the server computer system for receiving a query over a network from a client computer system. A search engine utilizes the query to extract a search result from a data source. A query decomposition module decomposes the query into at least one n-gram which is a subset of the query. A processing module processes the at least one n-gram to determine at least one related search suggestion. A merging module merges the at least one related search suggestion into a ranked output data set. A transmission module transmits the search result and the at least one related search suggestion from the server computer system to the client computer system.
Owner:IAC SEARCH & MEDIA

Chinese text automatic correction method

The invention discloses a Chinese text automatic correction method. The method comprises the following steps of: a) inputting a to-be-corrected Chinese text, and performing word segmentation preprocessing on the Chinese text sentence by sentence; b) searching for one-character words, two-character words or disperse strings of three or more than three characters occurring in the text subjected to word segmentation sentence by sentence; c) performing continuous determination on the disperse strings occurring in the text subjected to word segmentation by adopting an N-gram model, and checking text word level errors for each single sentence in combination with a word forming probability of separate characters; and d) constructing an error correction knowledge base to generate an error correction candidate text. According to the Chinese text automatic correction method provided by the invention, the one-character words, two-character words or disperse strings of three or more than three characters occurring in the text subjected to word segmentation are searched for sentence by sentence, the disperse strings occurring in the text subjected to word segmentation are subjected to continuous determination by adopting the N-gram model to determine identification errors, and the error correction knowledge base is constructed to generate the error correction candidate text, so that error checking and correcting processes are combined very well, and the method has the characteristics of high error checking speed and high error correcting efficiency.
Owner:SHANGHAI INST OF TECH

Automatic Evaluation of Spoken Fluency

A procedure to automatically evaluate the spoken fluency of a speaker by prompting the speaker to talk on a given topic, recording the speaker's speech to get a recorded sample of speech, and then analyzing the patterns of disfluencies in the speech to compute a numerical score to quantify the spoken fluency skills of the speakers. The numerical fluency score accounts for various prosodic and lexical features, including formant-based filled-pause detection, closely-occurring exact and inexact repeat N-grams, normalized average distance between consecutive occurrences of N-grams. The lexical features and prosodic features are combined to classify the speaker with a C-class classification and develop a rating for the speaker.
Owner:NUANCE COMM INC

Method and apparatus for programmatically generating audio file playlists

Method and apparatus for programmatically generating interesting audio file playlists. A playlist generation mechanism may use an N-gram model of audio file ordering patterns found in a collection of human-generated playlists to automatically generate new playlists. Given play histories indicating one or more played audio files as input, statistical methods may be used to look for sequences of audio files that occur a statistically significant number of times in the N-gram model for inclusion in new, interesting playlists that incorporate the human element found in the collection of playlists. In some embodiments, one more backoff probability methods may be used to provide additional candidate audio files for playlists if there is insufficient coverage for an audio file in the N-gram model. In one embodiment, a class-based statistical model incorporating higher-level statistics for the audio files may be used to weight selection of audio file transitions from the N-gram model.
Owner:ORACLE INT CORP

Domain-specific sentiment classification

A domain-specific sentiment classifier that can be used to score the polarity and magnitude of sentiment expressed by domain-specific documents is created. A domain-independent sentiment lexicon is established and a classifier uses the lexicon to score sentiment of domain-specific documents. Sets of high-sentiment documents having positive and negative polarities are identified. The n-grams within the high-sentiment documents are filtered to remove extremely common n-grams. The filtered n-grams are saved as a domain-specific sentiment lexicon and are used as features in a model. The model is trained using a set of training documents which may be manually or automatically labeled as to their overall sentiment to produce sentiment scores for the n-grams in the domain-specific sentiment lexicon. This lexicon is used by the domain-specific sentiment classifier.
Owner:GOOGLE LLC

Character string updated degree evaluation program

There is provided a character string updated degree evaluation program that enables quantitative grasping of an amount of intellectual work through editing and updating of character strings. A text subjected to comparison is divided into common part character strings each having a length greater than or equal to a threshold value, and non-common part character strings. A number of edited points from the original text and a context edit distance are calculated based on the rate of the common part character strings and the occurrence pattern thereof. A number of edited point is acquired from a number of elements contained in a common part character string set, and a context edit distance is acquired from a change in an order of occurrence of the common part character strings. Calculation of a new creation percentage and analysis by an N-gram are performed on the non-common part character string. The new creation percentage is acquired from the total length of the elements contained in a non-common part character string set, and a new creation novelty degree is acquired from a non-partial matching rate between a non-common part character string set and an element contained in the non-common part character string set. Calculations for the common part character string set and for the non-common part character string set are united, thereby calculating a text updated degree.
Owner:NAT UNIV CORP NAGAOKA UNIV TECH

System and method for context-based spontaneous speech recognition

A system and method for processing human language input uses collocation information for the language that is not limited to N-gram information for N no greater than a predetermined value. The input is preferably speech input. The system and method preferably recognize at least a portion of the input based on the collocation information.
Owner:NUSUARA TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products