Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

105 results about "Document clustering" patented technology

Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering.

Document clustering that applies a locality sensitive hashing function to a feature vector to obtain a limited set of candidate clusters

Documents from a data stream are clustered by first generating a feature vector for each document. A set of cluster centroids (e.g., feature vectors of their corresponding clusters) are retrieved from a memory based on the feature vector of the document using a locality sensitive hashing function. The centroids may be retrieved by retrieving a set of cluster identifiers from a cluster table, the cluster identifiers each indicative of a respective cluster centroid, and retrieving the cluster centroids corresponding to the retrieved cluster identifiers from a memory. Documents may then be clustered into one or more of the candidate clusters using distance measures from the feature vector of the document to the cluster centroids.
Owner:SIEMENS CORP

Method and apparatus for document clustering and document sketching

A first embodiment of the invention provides a system that automatically classifies documents in a collection into clusters based on the similarities between documents, that automatically classifies new documents into the right clusters, and that may change the number or parameters of clusters under various circumstances. A second embodiment of the invention provides a technique for comparing two documents, in which a fingerprint or sketch of each document is computed. In particular, this embodiment of the invention uses a specific algorithm to compute the document's fingerprint, One embodiment uses a sentence in the document as a logical delimiter or window from which significant words are extracted and, thereafter, a hash is computed of all pair-wise permutations. Words are extracted based on their weight in the document, which can be computed using measures such as term frequency and the inverse document frequency.
Owner:EBRARY

Dynamic information extraction with self-organizing evidence construction

A data analysis system with dynamic information extraction and self-organizing evidence construction finds numerous applications in information gathering and analysis, including the extraction of targeted information from voluminous textual resources. One disclosed method involves matching text with a concept map to identify evidence relations, and organizing the evidence relations into one or more evidence structures that represent the ways in which the concept map is instantiated in the evidence relations. The text may be contained in one or more documents in electronic form, and the documents may be indexed on a paragraph level of granularity. The evidence relations may self-organize into the evidence structures, with feedback provided to the user to guide the identification of evidence relations and their self-organization into evidence structures. A method of extracting information from one or more documents in electronic form includes the steps of clustering the document into clustered text; identifying patterns in the clustered text; and matching the patterns with the concept map to identify evidence relations such that the evidence relations self-organize into evidence structures that represent the ways in which the concept map is instantiated in the evidence relations.
Owner:TECHTEAM GOVERNMENT SOLUTIONS

Method and apparatus for document clustering and document sketching

A first embodiment of the invention provides a system that automatically classifies documents in a collection into clusters based on the similarities between documents, that automatically classifies new documents into the right clusters, and that may change the number or parameters of clusters under various circumstances. A second embodiment of the invention provides a technique for comparing two documents, in which a fingerprint or sketch of each document is computed. In particular, this embodiment of the invention uses a specific algorithm to compute the document's fingerprint, One embodiment uses a sentence in the document as a logical delimiter or window from which significant words are extracted and, thereafter, a hash is computed of all pair-wise permutations. Words are extracted based on their weight in the document, which can be computed using measures such as term frequency and the inverse document frequency.
Owner:EBRARY

Interactive cleaning for automatic document clustering and categorization

Documents are clustered or categorized to generate a model associating documents with classes. Outlier measures are computed for the documents indicative of how well each document fits into the model. Outlier documents are identified to a user based on the outlier measures and a user selected outlier criterion. Ambiguity measures are computed for the documents indicative of a number of classes with which each document has similarity under the model. If a document is annotated with a label class, a possible corrective label class is identified if the annotated document has higher similarity with the possible corrective label class under the model than with the annotated label class. The clustering or categorizing is repeated adjusted based on received user input to generate an updated model associating documents with classes. Outlier and. ambiguity measures are also calculated at runtime for new documents classified using the model.
Owner:XEROX CORP

System and method for resolving entity coreference

A method and a system for coreference resolution are provided. The method includes receiving a set of document clusters, each cluster in the set of document clusters including a set of text documents. Instances of each of a set of candidate named entities are identified in the document clusters. For a pairs of the candidate named entities, at least one socio-temporal feature is computed that is based on the similarity of the distributions of identified instances of the respective candidate name entities among the document clusters. A decision for merging for the candidate named entities into a common real named entity is based on the socio-temporal features.
Owner:XEROX CORP

Method and platform for term extraction from large collection of documents

A method and platform for statistically extracting terms from large sets of documents is described. An importance vector is determined for each document in the set of documents based on importance values for words in each document. A binary document classification tree is formed by clustering the documents into clusters of similar documents based on the importance vector for each document. An infrastructure is built for the set of documents by generalizing the binary document classification tree. The document clusters are determined by dividing the generalized tree of the infrastructure into two parts and cutting away the upper part. Statistically significant individual key words are extracted from the clusters of similar documents. Key words are treated as seeds and terms are extracted by starting from the seeds and extending to their left or right contexts.
Owner:AGENCY FOR SCI TECH & RES

Phrase-based document clustering with automatic phrase extraction

Meaningful phrases are distinguished from chance word sequences statistically, by analyzing a large number of documents and using a statistical metric such as a mutual information metric to distinguish meaningful phrases from groups of words that co-occur by chance. In some embodiments, multiple lists of candidate phrases are maintained to optimize the storage requirement of the phrase-identification algorithm. After phrase identification, a combination of words and meaningful phrases can be used to construct clusters of documents.
Owner:MICRO FOCUS LLC

Methods and apparatus for interactive document clustering

A computer-based process is described for identifying clusters of documents that have some degree of similarity from among a set of documents that permits user interaction with the process. A plurality of seed candidate documents is identified. Candidate probes based upon the seed candidate documents are generated, and information regarding the candidate probes is displayed to a user. User input regarding the candidate probes is received, and a set of probes from which to form clusters of documents are defined based upon the user input regarding the candidate probes. A probe is selected and a cluster of documents is formed from among available documents not yet clustered using the probe. The process can be repeated to generate further clusters. The process can be implemented with a computer system, and associated programming instructions can be contained within a computer readable medium.
Owner:JUSTSYST EVANS RES

Interactive cleaning for automatic document clustering and categorization

Documents are clustered or categorized to generate a model associating documents with classes. Outlier measures are computed for the documents indicative of how well each document fits into the model. Outlier documents are identified to a user based on the outlier measures and a user selected outlier criterion. Ambiguity measures are computed for the documents indicative of a number of classes with which each document has similarity under the model. If a document is annotated with a label class, a possible corrective label class is identified if the annotated document has higher similarity with the possible corrective label class under the model than with the annotated label class. The clustering or categorizing is repeated adjusted based on received user input to generate an updated model associating documents with classes. Outlier and ambiguity measures are also calculated at runtime for new documents classified using the model.
Owner:XEROX CORP

Online document clustering

Documents from a data stream are clustered by first generating a feature vector for each document. A set of cluster centroids (e.g., feature vectors of their corresponding clusters) are retrieved from a memory based on the feature vector of the document and a relative age of each of the cluster centroids. The centroids may be retrieved by retrieving a set of cluster identifiers from a cluster table, the cluster identifiers each indicative of a respective cluster centroid, and retrieving the cluster centroids corresponding to the retrieved cluster identifiers from a memory. A list of cluster identifiers in the cluster table may be maintained based on the relative age of cluster centroids corresponding to the cluster identifiers. Cluster identifiers that correspond to cluster centroids with a relative age exceeding a predetermined threshold are periodically removed from the list of cluster identifiers.
Owner:SIEMENS CORP

System and method for analyzing vertical public opinions based on industry

A system for analyzing vertical public opinions based on an industry comprises an acquisition and pre-treatment module for acquiring and pre-treating Internet information relevant to the consumer electronics industry and obtaining the formative information of the consumer electronics industry based on documents; a word segmentation module for matching words by means of a character string matching algorithm, and obtaining work segmentation results by amending the matching results in a word segmentation method based on understanding and statistics; an analysis module for performing document clustering and classification on the word segmentation results according to the frequency and similarity of keywords in the word segmentation results of the documents, and for obtaining analyzed and processed information after hotspot / sensitive topic analysis, orientation analysis and trend analysis to the clustered and classified results; and a display module for pushing the analyzed and processed information to users. The invention further provides a method for analyzing vertical public opinions based on an industry.
Owner:WUHAN TIPDM INTELLIGENT TECH

Search engine technology based on relevance feedback and clustering

InactiveCN101853272AMeet the query requirementsWon't throw awaySpecial data processing applicationsWeb pageRetrieval result
The invention relates to a search engine technology based on relevance feedback and clustering. By simultaneously utilizing user relevance feedback information and relavancy sequencing to direct the clustering of retrieval results, the invention ensures that the final partitioning of the retrieval results meet user query requirements; and in a clustering process, a large amount of documents and repeated webpage which are irrelevant to a user are removed, the clustering speed is improved and the retrieval results are optimized at the same time. In the clustering process, a clustering center is not modified by a clustering cluster irrelevant to the user, thereby result documents relevant to the user are ensured not to be lost when noise is introduced in irrelevant document clustering.
Owner:NORTH CHINA ELECTRIC POWER UNIV (BAODING)

Information search method and system based on interactive document clustering

The invention provides an information search method and system based on interactive document clustering. The method comprises the following steps that a document set is horizontally partitioned and preprocessed; word frequency statistics is conducted, and high-frequency words constitute a characteristic word set; vector space representation of documents is generated, the distances between the documents are calculated, and a similarity matrix is generated; a Laplacian matrix is generated, the number of clusters and a representation matrix are determined according to intervals between proper values of the Laplacian matrix, secondary clustering is conducted, and initial distance results are obtained; users conduct interactive operation on the initial distance results, new characteristic words are mined through chi-square statistics, a vector space is reconstructed, and the clustering process is repeated; finally, clustering results are shown to the users, and therefore the users obtain different categories of search results. According to the information search method and system, a semi-supervised learning approach in which the users intervene is adopted, the documents are clustered and analyzed, and the users obtain the different categories of search results.
Owner:PEKING UNIV

Document clustering methods, document cluster label disambiguation methods, document clustering apparatuses, and articles of manufacture

Document clustering methods, document cluster label disambiguation methods, document clustering apparatuses, and articles of manufacture are described. In one aspect, a document clustering method includes providing a document set comprising a plurality of documents, providing a cluster comprising a subset of the documents of the document set, using a plurality of terms of the documents, providing a cluster label indicative of subject matter content of the documents of the cluster, wherein the cluster label comprises a plurality of word senses, and selecting one of the word senses of the cluster label.
Owner:BATTELLE MEMORIAL INST

LDA fusion model and multilayer clustering-based news topic detection method

The invention belongs to the field of data mining, natural language processing and information retrieval, and provides a news topic detection method. For the defect of a TF-IDF-based vector space algorithm in semantics and the defects of time complexity and accuracy of textual level clustering, feature extraction, representation modeling, similarity calculation and quick and accurate text clustering methods for a large amount of news texts are improved. The LDA fusion model and multilayer clustering-based news topic detection method comprises the following steps of 1: building a similarity model by using a vector space model (VSM); 2: finally obtaining accurate parameter settings; 3: organically fusing two text models; 4: judging whether a topic is a new topic or not; 5: calculating the similarity until all documents are clustered; and 6: adding an ISP&AH clustering algorithm of AHC based on the step 5. The method is mainly applied to the design and manufacturing occasions.
Owner:TIANJIN UNIV

System and method for clustering documents

Provided are a system and method of clustering documents. The system includes a document DB, a document feature writing unit storing documents, a document retrieving unit, a clustering unit, and a cluster DB. The document DB stores documents. The document feature writing unit extracts attribute information of documents stored in the document database, and writes indexes with respect to the respective documents on the basis of the attribute information. The document retrieving unit retrieves documents including a query input by a user, using the indexes. The clustering unit includes a representative vector calculator calculating feature vectors and a representative vector of the retrieved documents, and a similarity calculator calculating similarities between the documents using the feature vectors and the representative vector. The cluster database stores documents clustered by the clustering unit.
Owner:LG ELECTRONICS INC

Chinese Web document online clustering method based on common substrings

The invention discloses a Chinese Web document online clustering method based on common substrings. As known to all, search engines are important in application of information searching and positioning with sharp increase of information on the internet. Web document clustering can automatically classify return results of the search engines according to different themes so as to assist users to reduce query range and fast position needed information. The Web document online clustering is characterized in that non-numerical and non-structured characteristics of Web documents are required to be met on the one hand, and clustering time is required to meet online search requirements of users on the other hand. According to the two characteristics, the invention provides the Chinese Web document online clustering method based on common substrings, and the method comprises steps as follows: (1) firstly, preprocessing the first n query results returned by the search engines so as to realize deleting and replacing operation of non-Chinese characters in the return results of the search engines, (2) extracting common substrings in the Web documents by utilizing GSA, (3) presenting a weighting calculation formula referring to TF*IDF according to the common substrings which are extracted and then building a document characteristic vector model, (4) computing pairwise similarity of the Web documents on the basis of the model to acquire a similarity matrix, (5) adopting an improved hierarchical clustering algorithm to achieve clustering of the Web documents on the basis of the matrix, and (6) executing clustering description and label extraction. The Chinese Web document online clustering method based on common substrings has obvious advantages on performance, clustering label generation and clustering time effects.
Owner:BEIHANG UNIV

Unsupervised document clustering using latent semantic density analysis

According to one embodiment, a latent semantic mapping (LSM) space is generated from a collection of a plurality of documents, where the LSM space includes a plurality of document vectors, each representing one of the documents in the collection. For each of the document vectors considered as a centroid document vector, a group of document vectors is identified in the LSM space that are within a predetermined hypersphere diameter from the centroid document vector. As a result, multiple groups of document vectors are formed. The predetermined hypersphere diameter represents a predetermined closeness measure among the document vectors in the LSM space. Thereafter, a group from the plurality of groups is designated as a cluster of document vectors, where the designated group contains a maximum number of document vectors among the plurality of groups.
Owner:APPLE INC

Method and system for web document clustering

Method and System for web documents clustering are provided. The method for web documents clustering comprises the steps of: inputting a plurality of web documents; collecting information of the links and the directory structure of the inputted web documents; extracting, according to the collected links and directory structure, a hierarchical structure for the plurality of web documents; and generating and outputting, based on the extracted hierarchical structure, one or more clusters of the plurality of web documents. In some embodiments, the hierarchical relations of the generated clusters can also be outputted at the same time. Compared with the prior art, the method and system for web documents clustering according to the present invention can improve substantially the accuracy and efficiency of the web documents clustering.
Owner:NEC (CHINA) CO LTD

Document clustering method

A document clustering method is used for conducting document mining on a document set of a potential Dirichlet distribution model. The document clustering method at least comprises the following steps: conducting training on the potential Dirichlet distribution algorithm in a first document set D1 to obtain parameters beta and phi, wherein the theme number K is preset in the potential Dirichlet distribution algorithm; according to the parameter phi, utilizing the information entropy theory for filtering the first document set D1 to obtain a second document set D2; according to the parameter beta, grouping the second document set D2 to generate a third document set D3 containing grouping information; operating the FG-Kmeans algorithm on the third document set D3 to obtain a finally-clustered clustering center set C and a mark matrix U. According to the document clustering method, documents are grouped according to the potential Dirichlet distribution algorithm, the FG-Kmeans algorithm is utilized for processing the grouped documents, the problem of high-dimensional and sparse data in document mining is well solved, and the concept of feature grouping is introduced into feature space, so that information contained in the feature space is more rich.
Owner:南方电网互联网服务有限公司

Unsupervised document clustering using latent semantic density analysis

According to one embodiment, a latent semantic mapping (LSM) space is generated from a collection of a plurality of documents, where the LSM space includes a plurality of document vectors, each representing one of the documents in the collection. For each of the document vectors considered as a centroid document vector, a group of document vectors is identified in the LSM space that are within a predetermined hypersphere diameter from the centroid document vector. As a result, multiple groups of document vectors are formed. The predetermined hypersphere diameter represents a predetermined closeness measure among the document vectors in the LSM space. Thereafter, a group from the plurality of groups is designated as a cluster of document vectors, where the designated group contains a maximum number of document vectors among the plurality of groups.
Owner:APPLE INC

Method for detecting burst topic in user generation text stream based on graph clustering

The invention relates to a method for detecting a burst topic in a user generation text stream based on graph clustering and belongs to the technical field of internet data mining. By the method, a graph-based new field of view relative to the conventional topic detection problem is provided, and the detection problem of the burst topic in the text stream is converted into a typical graph clustering problem, so the problem can be solved by using the conventional graph theory method. The method comprises the following main steps of: acquiring the text stream; detecting the burse topic; constructing a burst word graph; and clustering burst words. The method aims at the detection of the burst topic in the user generation text stream and has the performance which is superior to that of the conventional method based on document clustering, a probability topic model and burst characteristic clustering.
Owner:TSINGHUA UNIV

Document clustering

Methods and systems for clustering document collections are disclosed. A system for clustering observations may include a processor and a processor-readable storage medium. The processor-readable storage medium may contain one or more programming instructions for performing a method of clustering observations. A plurality of parameter vectors and a plurality of observations may be received. A distribution may also be determined. An optimal partitioning of the observations may then be selected based on the distribution, the parameter vectors and a likelihood function.
Owner:XEROX CORP

Keyword calculation method based on document clustering

The invention relates to a keyword calculation method based on document clustering. The method comprises the following steps of: (1) obtaining a text document set; (2) performing word entry segmentation on all document contents in the document set by a word segmentation algorithm; (3) building a document vector; (4) calculating the document vector by the TF-IDF (Term Frequency-Inverse Document Frequency); (5) performing dimension compression on the document vector; (6) performing document clustering calculation; and (7) calculating representative keywords of each group of documents. The keyword calculation method has the beneficial effects that complete feasible calculation steps are provided; the document vector dimension compression is innovatively supported; and the calculation efficiency is high. When the dimension compression of the document vector is executed, a concise and efficient novel method different from any one technology in the prior art is adopted. The keyword calculation method belongs to a first technical scheme capable of calculating the representative keywords from the document set by connecting different links through feasible calculation steps.
Owner:HAINAN UNIVERSITY

Joint approach to feature and document labeling

Documents of a set of documents are represented by bag-of-words (BOW) vectors. L labeled topics are provided, each labeled with a word list comprising words of a vocabulary that are representative of the labeled topic and possibly a list of relevant documents. Probabilistic classification of the documents generates for each labeled topic a document vector whose elements store scores of the documents for the labeled topic and a word vector whose elements store scores of the words of the vocabulary for the labeled topic. Non-negative matrix factorization (NMF) is performed to generate a document-topic model that clusters the documents into k topics where k>L. NMF factors representing L topics of the k topics are initialized to the document and word vectors for the L labeled topics. In some embodiments the NMF factors representing the L topics initialized to the document and word vectors are frozen, that is, are not updated by the NMF after the initialization.
Owner:XEROX CORP

Generating templates from user's past documents

Automatic generation of document templates based on recognized composition element patterns in a group of clustered documents is provided. Composition elements used in documents are typically unique to a particular user or to a group of users. An automated template generation system detects composition element patterns in documents associated with a given user. Sequences of composition elements from one document are aligned with sequences of composition elements of one or more other documents. The aligned sequences are scored to generate a document distance matrix. The documents are clustered together based on the alignment scores and a document template is generated for each corresponding cluster of documents. In one or more aspects, selecting a document template and updating it results in a modified document template or, in certain cases, a new document template. The generated document templates are displayed in a user interface for selection by a user.
Owner:MICROSOFT TECH LICENSING LLC

Method and system relating to re-labelling multi-document clusters

Individuals receive overwhelming barrage of information which must be filtered, processed, analysed, reviewed, consolidated and distributed or acted upon. However, prior art tools for automatically processing content, such as for example returning search results from an Internet or database search for example are ineffective. Prior art search techniques merely provide large numbers of “hits” with at most removal of multiple occurrences of identical items. However, it would be beneficial to present searches as a series of multi-document clusters wherein occurrences of commonly themed content are clustered allowing the user to rapidly see the number of different themes and review a selected theme. Further, it would be beneficial, in repeated searches, for new clusters to be identified automatically as well as new items of content associated with existing clusters to be associated to these clusters.
Owner:WHYZ TECH

Document clustering methods, document cluster label disambiguation methods, document clustering apparatuses, and articles of manufacture

Document clustering methods, document cluster label disambiguation methods, document clustering apparatuses, and articles of manufacture are described. In one aspect, a document clustering method includes providing a document set comprising a plurality of documents, providing a cluster comprising a subset of the documents of the document set, using a plurality of terms of the documents, providing a cluster label indicative of subject matter content of the documents of the cluster, wherein the cluster label comprises a plurality of word senses, and selecting one of the word senses of the cluster label.
Owner:BATTELLE MEMORIAL INST
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products