Patents

Literature

Patsnap Eureka AI that helps you search prior art, draft patents, and assess FTO risks, powered by patent and scientific literature data.

119 results about "Document similarity" patented technology

Filter

Efficacy Topic

Property

Owner

Technical Advancement

Application Domain

Technology Topic

Technology Field Word

Patent Country/Region

Patent Type

Patent Status

Application Year

Inventor

Document similarity (or distance between documents) is a one of the central themes in Information Retrieval. How humans usually define how similar are documents? Usually documents treated as similar if they are semantically close and describe similar concepts. On other hand “similarity” can be used in context of duplicate detection.

Document similarity detection and classification system

InactiveUS20050060643A1Natural language data processingData switching networksDocument similarityDocument preparation

A document similarity detection and classification system is presented. The system employs a case-based method of classifying electronically distributed documents in which content chunks of an unclassified document are compared to the sets of content chunks comprising each of a set of previously classified sample documents in order to determine a highest level of resemblance between an unclassified document and any of a set of previously classified documents. The sample documents have been manually reviewed and annotated to distinguish document classifications and to distinguish significant content chunks from insignificant content chunks. These annotations are used in the similarity comparison process. If a significant resemblance level exceeding a predetermined threshold is detected, the classification of the most significantly resembling sample document is assigned to the unclassified document. Sample documents may be acquired to build and maintain a repository of sample documents by detecting unclassified documents that are similar to other unclassified documents and subjecting at least some similar documents to a manual review and classification process. In a preferred embodiment the invention may be used to classify email messages in support of a message filtering or classification objective.

Document similarity detection and classification system

Document similarity detection and classification system

Document similarity detection and classification system

Owner:GLASS JEFFREY B MR

Device system and method for determining document similarities and differences

InactiveUS20050010863A1Data processing applicationsDigital computer detailsDocument similarityPaper document

A device, system and method of outputting information is disclosed to compare at least two documents, namely, at least a first document and a second document, to facilitate visual mapping and comparison of these documents. These documents comprise document subsections and the subsections comprise document subsection headers associated therewith. At least one of the first document subsection headers is juxtaposed relative to an output of second document subsection headers mapping thereto, to visually emphasize a header mapping. This header mapping is established by: mapping the first document subsections relative to the second document subsections based on identifying substantial similarities therebetween, to establish a subsection mapping therebetween; and, in relation to the subsection mapping and the association between the document subsections and the subsection headers, further mapping the first document subsection headers relative to the second document subsection headers. Several closely-related devices, systems and methods for outputting information to compare at least two documents, establishing a mapping to compare at least two documents, and highlighting similar text segments, are also disclosed.

Device system and method for determining document similarities and differences

Device system and method for determining document similarities and differences

Device system and method for determining document similarities and differences

Owner:MOUNTAVOR INC DBA DEALSUMM

Document similarity scoring and ranking method, device and computer program product

InactiveUS7689559B2Avoids large and wasted effortSmall similarity scoreData processing applicationsWeb data indexingDocument similarityCollation

A device, computer program product and a method for searching, navigating or retrieving documents in a set of electronic documents, including performing a link analysis of the set of electronic documents. The link analysis includes one of analyzing at least two of the set of documents with at least a portion of a similarity graph constructed among the set of documents and analyzing the at least two of the set of documents with the at least a portion of the similarity graph and at least a portion of a hyperlink graph constructed from hyperlinks between the set of documents. Also described is a method for building a similarity matrix.

Document similarity scoring and ranking method, device and computer program product

Document similarity scoring and ranking method, device and computer program product

Document similarity scoring and ranking method, device and computer program product

Owner:TELENOR AS

Device system and method for determining document similarities and differences

InactiveUS7260773B2Data processing applicationsDigital computer detailsDocument similarityDocumentation

Two documents are processed to facilitate visual mapping and comparison. These documents comprise document subsections and the subsections comprise document subsection headers associated therewith. At least one of the first document subsection headers is juxtaposed relative to an output of second document subsection headers mapping thereto, to visually emphasize a header mapping. This header mapping is established by: mapping the first document subsections relative to the second document subsections based on identifying substantial similarities therebetween, to establish a subsection mapping therebetween; and, in relation to the subsection mapping and the association between the document subsections and the subsection headers, further mapping the first document subsection headers relative to the second document subsection headers. Several closely-related devices, systems and methods for outputting information to compare at least two documents, establishing a mapping to compare at least two documents, and highlighting similar text segments, are also disclosed.

Device system and method for determining document similarities and differences

Device system and method for determining document similarities and differences

Device system and method for determining document similarities and differences

Owner:MOUNTAVOR INC DBA DEALSUMM

Document similarity scoring and ranking method, device and computer program product

InactiveUS20070185871A1Avoids largeAvoids wasted computational effortData processing applicationsWeb data indexingHyperlinkElectronic document

A device, computer program product and a method for searching, navigating or retrieving documents in a set of electronic documents, including performing a link analysis of the set of electronic documents. The link analysis includes one of analyzing at least two of the set of documents with at least a portion of a similarity graph constructed among the set of documents and analyzing the at least two of the set of documents with the at least a portion of the similarity graph and at least a portion of a hyperlink graph constructed from hyperlinks between the set of documents. Also described is a method for building a similarity matrix.

Document similarity scoring and ranking method, device and computer program product

Document similarity scoring and ranking method, device and computer program product

Document similarity scoring and ranking method, device and computer program product

Owner:TELENOR AS

Method and system for machine-learning based optimization and customization of document similarities calculation

InactiveUS20120136812A1Digital data processing detailsDigital computer detailsLearning basedDocument similarity

One embodiment of the present invention provides a system for optimizing and customizing document-similarity calculation. During operation, the system presents a collection of similar documents to a user, collects feedback on the similarity of the documents from the user, generates generic rules for calculating document similarity, and filters documents with customized similarity calculation based on the feedback provided by the user.

Method and system for machine-learning based optimization and customization of document similarities calculation

Method and system for machine-learning based optimization and customization of document similarities calculation

Method and system for machine-learning based optimization and customization of document similarities calculation

Owner:PALO ALTO RES CENT INC

Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus

InactiveUS7451139B2Efficient extractionImprove performanceData processing applicationsDigital data information retrievalDocument similarityDocument preparation

An input section inputs a document set. A normalization section calculates a similarity as a relative value between documents, with respect to combinations of a plurality of documents, in the document set. The normalization section employs the tf·idf method. In tf·idf method, a document vector and a significance of a word included in the document is used to perform normalization to convert each similarity to an absolute value.

Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus

Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus

Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus

Owner:FUJITSU LTD

Document similarity calculating method and similar document whole-network retrieval tracking method

InactiveCN106095737AAccurate processing of similarityAccurate analysis and statisticsSemantic analysisText database indexingCosine similarityTwo-vector

The invention relates to a document similarity calculating method and a similar document whole-network retrieval tracking method. The technical scheme is characterized in that the document similarity calculating method includes: S01, performing word segmentation on an original document and a target document to acquire respective word segmentation sets; S02, performing preprocessing and feature weighting: utilizing TF-IDF technology to calculate weight of each word segmentation, extracting core key words, utilizing Word2vec to dig correlation degree among different word segmentation in the documents, and performing semantic analysis on each document; S03, adopting a vector space model and a cosine similarity algorithm: utilizing a cosine value of an included angle of two vectors in vector space to evaluate similarity of the documents, wherein the cosine value is between 0 and 1, and the greater the cosine value is, the higher the similarity of the documents is. The document similarity calculating method and the similar document whole-network retrieval tracking method are suitable for news information redistribution tracking and transmissibility statistics.

Document similarity calculating method and similar document whole-network retrieval tracking method

Document similarity calculating method and similar document whole-network retrieval tracking method

Document similarity calculating method and similar document whole-network retrieval tracking method

Owner:HANGZHOU FANEWS TECH

Universal Document Similarity

InactiveUS20130054612A1Digital data information retrievalSemantic analysisDocument similarityMore language

Described herein are methods for finding substantially similar / different sources (files and documents), and estimating similarity or difference between given sources. Similarity and difference may be found across a variety of formats. Sources may be in one or more languages such that similarity and difference may be found across any number and types of languages. A variety of characteristics may be used to arrive at an overall measure of similarity or difference including determining or identifying syntactic roles, semantic roles and semantic classes in reference to sources.

Universal Document Similarity

Universal Document Similarity

Universal Document Similarity

Owner:ABBYY DEV INC

Document citation network visualization and document recommendation method and system

ActiveCN105589948AAvoid the "cold start" problemGuaranteed accuracyWeb data indexingSpecial data processing applicationsTime informationDocument similarity

The invention discloses a document citation network visualization and document recommendation method and system, and relates to the fields of document influence analysis and information visualization. The method comprises the following steps of calculating the importance degree of documents according to authors, time information, citation numbers and other inherent attributes, document similarities and transfer values generated by introducing behavior quantitative analysis, and sorting the documents; then performing clustering on the sorted documents and performing visualization on the clustered result to establish a dual-layer network model, and displaying important documents in a clear manner; and finally, recommending the documents in the center of clustering displayed in the visualization manner to a user. The document citation network visualization and document recommendation method and system are high in usability and can help science researchers to rapidly screen out most authoritative papers.

Document citation network visualization and document recommendation method and system

Document citation network visualization and document recommendation method and system

Document citation network visualization and document recommendation method and system

Owner:CHONGQING UNIV OF POSTS & TELECOMM

Text semantic analysis method

PendingCN109271626AOvercome deficienciesImprove accuracySemantic analysisSpecial data processing applicationsObject structureDocument similarity

A text semantic analysis method and system can realize semantic analysis of text data base on lexical level and sentence level. Aiming at the semantic analysis at the lexical level, the invention firstly adopts an improved word segmentation algorithm to solve the problem that English words are segmented only by spaces. Secondly, based on word segmentation, TF-IDF modeling is performed to obtain weight value; Then the text is vectorized by weighting and summing the weight value and the word vector trained by Word2Vec, and finally the document similarity is solved. At the same time, the invention considers the contribution degree of the vocabulary to the document content and the semantic status to calculate the similarity degree of the document, the result has higher accuracy, and provide agood foundation for subsequent text clustering. The present invention extracts subject-predicate object structure based on text segmentation, part-of-speech tagging, syntactic analysis and dependencyrelation for sentence level semantic analysis. The invention realizes the extraction of subject-predicate-object structures of various sentence types in all aspects, and realizes the noun expansion function, which is more consistent with the manual extraction result.

Text semantic analysis method

Text semantic analysis method

Text semantic analysis method

Owner:BEIJING UNIV OF TECH

Document processing device and document processing method

InactiveUS20090265344A1More accuracyAccurate calculationWeb data indexingSpecial data processing applicationsDocument similarityDocument preparation

An object of the present invention is to provide a document processing device and document processing method that can provide a search result satisfactory to a user with respect to WWW documents in which a number of links among WWW documents is low and a number of accesses by users is low. An access pattern collection unit 101 generates an access user vector uj of one WWW document Dj and an access user vector uje of another document Dje. A user similarity computing unit 105 computes a document similarity sim (uj, uje) which indicates a user similarity between the WWW document Dj and WWW document Dje. A keyword vector smoothing unit 106 acquires a smoothed keyword weight vector w′j by correcting a keyword weight vector wj in one document, using the computed document similarity sim (uj, uje). An rearranging unit 110 calculates an evaluation value B_SCORE for input information for searching, based on the smoothed keyword weight vector w′j.

Document processing device and document processing method

Document processing device and document processing method

Document processing device and document processing method

Owner:NTT DOCOMO INC

Topic model-based judgment document similarity analysis method

InactiveCN107291688AImprove accuracyImprove applicabilitySemantic analysisSpecial data processing applicationsDocument similarityInput selection

The invention discloses a topic model-based judgment document similarity analysis method. A semantic-based semi-automatic and universal similarity analysis method is proposed for judgment documents by adopting an LDA (Latent Dirichlet Allocation) topic model in machine learning. The method mainly comprises the steps of selecting corpora; establishing a similarity tag; performing text preprocessing; performing input selection; performing parameter setting; performing iterative training; generating a model; applying the model; and the like. Based on a general similarity analysis method, the characteristics of rich specialized vocabularies and complex semantics in contents of the judgment documents are fully considered, and the semi-structured characteristics of the judgment documents are utilized, so that the accuracy and applicability of judgment document similarity analysis are improved.

Topic model-based judgment document similarity analysis method

Topic model-based judgment document similarity analysis method

Topic model-based judgment document similarity analysis method

Owner:NANJING UNIV

Systems and methods to control work progress for content transformation based on natural language processing and/or machine learning

InactiveUS20140108103A1ResourcesLongest common substring problemDocumentation procedure

Systems and methods are provided to compute indicators of completeness of the work output of a transformation of text-based content, worker capacity in performing the transformation, and / or the degree of matching between a unit of work and a worker, based on information collected about complexity of works, times and throughput of workers, rating of work outputs and using natural language processing techniques and machine learning techniques, such as language detection, longest common substring, length ratio, document similarity, etc. The indicators are utilized to optimize job pickup and output submission for online crowdsourcing tasks related to transformation of text-based content, such as transcription, translation, proofreading, etc.

Systems and methods to control work progress for content transformation based on natural language processing and/or machine learning

Systems and methods to control work progress for content transformation based on natural language processing and/or machine learning

Systems and methods to control work progress for content transformation based on natural language processing and/or machine learning

Owner:GENGO

Document similarity calculation method and near-duplicate document detection method and device

InactiveCN104252445AImprove detection accuracyEfficient identificationNatural language data processingSpecial data processing applicationsComputation complexityDocument similarity

The invention relates to a document similarity calculation method and a near-duplicate document detection method and device. The calculation method comprises the following steps: performing word segmentation processing on two documents to be detected to obtain respective participle sets of the documents to be detected; calculating the edition similarity of all participle pairs in the two participle sets, wherein two participles in each participle pair come from the two participle sets respectively; establishing sides among the participle pairs of which the edition similarity meets a certain requirement in all the participle pairs to obtain a weighted biograph, wherein the edition similarity is the weights of the sides of corresponding participle pairs; calculating the maximum weighted matching value of the weighted bi-graph; calculating the similarity between the documents to be detected by using the maximum weighted matching value. By adopting the document similarity calculation method and the near-duplicate document detection method and device provided by the invention, high accuracy is achieved, near-duplicate texts comprising participle set edition errors can be identified effectively, the near-duplicate document detection accuracy is increased, the calculation complexity is lowered, and the calculation efficiency is optimized.

Document similarity calculation method and near-duplicate document detection method and device

Document similarity calculation method and near-duplicate document detection method and device

Document similarity calculation method and near-duplicate document detection method and device

Owner:HUAWEI TECH CO LTD +1

System and method for measuring SVG document similarity

InactiveUS20070083808A1Improve performanceEffective applicationDigital data information retrievalData processing applicationsDocument similarityData mining

A system and method for measuring the similarity between SVG documents. The present invention involves reducing the respective documents to their minimal logical representations and then analyzing the representations using tree isomorphism techniques. Applications can then use this comparison data to more efficiently perform actions such as content compression, content streaming and content searching.

System and method for measuring SVG document similarity

System and method for measuring SVG document similarity

System and method for measuring SVG document similarity

Owner:WSOU INVESTMENTS LLC

Document detection method and system

ActiveCN103294671AImprove accuracyImprove efficiencySpecial data processing applicationsDocument similarityThe Internet

An embodiment of the invention provides a detection method and system, relates to the field of a data processing technology, and solves the problem that the conventional approximate duplicated document detection method cannot meet higher requirements in terms of a precision ratio and a recall ratio. In the embodiment of the invention, a method combining multi-feature fingerprint inquiry and document similarity comparison is adopted, multi-feature fingerprints can accurately reflect discriminative features of a web page document to be detected and other web page documents, and records in accordance with conditions can be rapidly inquired according to a corresponding relation of the feature fingerprints and approximate duplicated documents in an existing data base, so that the accuracy rate and the efficiency of approximate duplicated document detection can be improved. With the adoption of the detection method of the document similarity comparison, the situation that a web page document to be detected surely belongs to an approximate duplicated document but cannot be detected by the multi-feature fingerprint inquiry due to the fact that the data base is defective can be prevented, so that the recall ration of the approximate duplicated document detection is improved.

Document detection method and system

Document detection method and system

Document detection method and system

Owner:SHENZHEN SHI JI GUANG SU INFORMATION TECH

Determining a document similarity metric

ActiveUS7565348B1Data processing applicationsDigital data information retrievalDocument similarityTheoretical computer science

To perform multi-pattern searching, a preprocessing engine populates a SUFFIX table, a PREFIX table and a PATTERN table. The SUFFIX table combines data conventionally stored in SHIFT and HASH tables. Pointers in the SUFFIX table refer to corresponding segments in the PREFIX table. Each PREFIX table segment is sorted by a prefix hash. A PATTERN table includes a hash of each full pattern sorted and grouped into segments, with each segment corresponding to a suffix hash and prefix hash combination. Pointers in the PREFIX table refer to corresponding segments in the PATTERN table. The PREFIX and PATTERN can be kept in secondary storage, allowing potentially billions of patterns to be used. After preprocessing, patterns are evaluated against a source file. A document metric is determine to qualitatively describe the similarity between the source file and each pattern file.

Determining a document similarity metric

Determining a document similarity metric

Determining a document similarity metric

Owner:PALAMIDA

Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus

InactiveUS20030172058A1Digital data information retrievalDigital data processing detailsDocument similarityDocument preparation

An input section inputs a document set. A normalization section calculates a similarity as a relative value between documents, with respect to combinations of a plurality of documents, in the document set. The normalization section employs the tf.idf method. In tf.idf method, a document vector and a significance of a word included in the document is used to perform normalization to convert each similarity to an absolute value.

Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus

Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus

Document similarity calculation apparatus, clustering apparatus, and document extraction apparatus

Owner:FUJITSU LTD

Measure of similarity of documentation based on document structure

ActiveCN1959671AImprove accuracySolve the problem of losing information about the distribution of words in various parts of the documentSpecial data processing applicationsDocumentation procedureStructure analysis

A method for measuring file similarity based on file structure includes finding out subsubject sequence of each file from X and Y files to be compared by utilizing file structure analysis means, utilizing similarity measurement means to calculate similarity value between two subsubjects of different files, setting up weighted bipartite graph G={ X,Y, E } on obtained subsubject sequence and similarity value, solving out optimum match for weighted bipartite graph and carrying out normalization treatment on total weight value of optimum match so as to obtain similarity value of X and Y files.

Measure of similarity of documentation based on document structure

Measure of similarity of documentation based on document structure

Measure of similarity of documentation based on document structure

Owner:PEKING UNIV

Judgement document similarity calculation method, judgement document similarity search device and computer equipment

InactiveCN106933787AImprove accuracyNatural language data processingSpecial data processing applicationsDocument similarityData mining

The invention provides a judgement document similarity calculation method, a judgement document similarity search device, and computer equipment. The judgement document similarity calculation method comprises the following steps of: obtaining at least two judgement documents; extracting judgement keywords of one or more defendants in each judgement document; and determining the similarity between corresponding judgement documents according to the similarity between the judgement keywords corresponding to the defendants in different judgement documents. According to the judgement document similarity calculation method, the judgement document similarity calculation correctness is improved.

Judgement document similarity calculation method, judgement document similarity search device and computer equipment

Judgement document similarity calculation method, judgement document similarity search device and computer equipment

Judgement document similarity calculation method, judgement document similarity search device and computer equipment

Owner:SHANGHAI XIAOI ROBOT TECH CO LTD

Fast document type recognition method based on full-sized feature extraction

InactiveCN105426884AFast operationAvoid influenceCharacter and pattern recognitionDocument similarityScale variation

The invention provides a fast document type recognition method based on full-sized feature extraction. The method comprises the following steps of document image preprocessing, including image zooming, graying and noise filtering; document image feature extraction, including Hessian matrix building, scale space generation, primary determination of feature points, precise positioning of the feature points, main direction determination of selected feature points, feature point descriptor construction and feature value string generation; document image feature value comparison, including document similarity calculation and comparison algorithm optimization. According to the method, an image is preprocessed through software, and additional hardware equipment does not need to be added. According to the method, the scale-invariant feature is creatively introduced for improving a typical SURF feature extraction algorithm, so that the problem of matching failure due to error amplification of the SURF algorithm caused by scale variations is fundamentally solved. The method has the advantage that a multi-thread technology and a large cache are used for solving the problems of large data volume calculation during comparison and the harsh time requirement of a user on an electronic government affair platform.

Fast document type recognition method based on full-sized feature extraction

Fast document type recognition method based on full-sized feature extraction

Fast document type recognition method based on full-sized feature extraction

Owner:FOSHAN UNIVERSITY

System and method for measuring SVG document similarity

InactiveUS7403951B2Improve performanceData processing applicationsDigital data information retrievalDocument similarityApplication software

A system and method for measuring the similarity between SVG documents. The present invention involves reducing the respective documents to their minimal logical representations and then analyzing the representations using tree isomorphism techniques. Applications can then use this comparison data to more efficiently perform actions such as content compression, content streaming and content searching.

System and method for measuring SVG document similarity

System and method for measuring SVG document similarity

System and method for measuring SVG document similarity

Owner:WSOU INVESTMENTS LLC

Document similarity distinguishing method based on Fourier transform

ActiveCN103324664AAvoid problems with high feature difficultySimple calculationSpecial data processing applicationsFast Fourier transformWeight coefficient

The invention provides a document similarity distinguishing method based on Fourier transform. The method comprises the following steps: acquiring the keyword sequence Ks of a document collection S and a corresponding keyword frequency collection Ns, as well as a keyword sequence Ks' of the detection document s' relative to the document collection S and s corresponding keyword frequency collection Ns'; calculating the weight coefficient of each of the keyword sequences Ks and Ks' as well as the weight sequence FKs of the keyword sequence Ks and the weight sequence FKs' of the keyword sequence Ks'; carrying out Fourier transform to weight sequence FKs and FKs'; calculating the threshold value Omega S of similarity distance of similarity of random document in the detection document s' and the document collection S; calculating the similarity distance D (s', si) between the documents si in the detection document s' and the document collection S, and comparing the similarity distance D with the threshold value Omega S; judging whether the detection document s' and the document collection S are similar or not. The distinguishing method of document similarity based on Fourier transform provided by the invention can not only reduce the requirement to a representing method of the document while calculating similarity, but also can reduce the complexity of calculation and improve the computational efficiency.

Document similarity distinguishing method based on Fourier transform

Document similarity distinguishing method based on Fourier transform

Document similarity distinguishing method based on Fourier transform

Owner:STATE GRID CORP OF CHINA +4

Document similarity calculation method facing cloud storage based on fully homomorphic password technology

ActiveCN104967693AImprove securityPrevent external attacksTransmissionCiphertextDocument similarity

The invention discloses a document similarity calculation method facing cloud storage based on a fully homomorphic password technology. The method comprises the following steps that a data owner uploads a document ID, an encrypted document cryptograph and a cryptograph of a document Hash value to a cloud server; a public key certificate is issued to a cloud service provider and a data user; the data user encrypts a simhash value of a document with a similarity to be calculated and uploads to the cloud service provider; the cloud service provider carries out fully homomorphic addition operation of a document simhash cryptograph value to be calculated and a data owner document simhash cryptograph value and returns an operation result to the data owner; the data owner acquires Hamming distances between the documents and returns a document ID with a top distance sorting to the cloud service provider. By using the method, the calculation is performed under a cryptograph state. During a calculation process, any information related to the document is not revealed to the cloud service provider and other attackers so that a data secret of the data owner and a query data secret of the data user are protected.

Document similarity calculation method facing cloud storage based on fully homomorphic password technology

Owner:武汉盛金源电子科技有限公司

Document similarity determining method based on maximum likelihood estimation

ActiveCN104636325AHigh precisionHigh similarity natural precisionSpecial data processing applicationsFeature extractionValue set

The invention discloses a document similarity determining method based on maximum likelihood estimation. The method includes the following steps that firstly, text characteristics are extracted; secondly, numerical value mapping is conducted on text characteristic sets, so that numerical value sets Sd corresponding to documents are obtained; thirdly, minwise fingerprint representation is adopted for the numerical value sets Sd corresponding to the documents; fourthly, the similarity a of the two documents is calculated on the basis of minwise fingerprint of the documents and a maximum likelihood function. According to the method, the probabilities of various results (<, > and =) of hash value comparison are used, the likelihood function combining the probabilities is ingeniously designed on the basis of the probabilities, and a maximum likelihood minwise hash estimator is established. The method is applied and popularized to determining of the similarity of three documents, and the similarity of high-precision text is obtained accurately. Because the variance mean obtained through a maximum likelihood method is minimum, the natural precision of the obtained similarity is higher than that of a minwise method.

Document similarity determining method based on maximum likelihood estimation

Document similarity determining method based on maximum likelihood estimation

Document similarity determining method based on maximum likelihood estimation

Owner:CENT SOUTH UNIV

Large-scale document similarity detection method

ActiveCN108595517AImprove accuracyGuaranteed detection volumeSpecial data processing applicationsData dredgingHash function

The invention provides a large-scale document similarity detection method. The method comprises the steps of S1, calculating the similarity of other information of documents in a document set; S2, enabling each document content to correspond to a signature S and a f-dimensional vector V; S3, performing word segmentation processing on the document content; S4, comprehensively calculating a weight of a feature word x; S5, mapping the feature word x into a signature h by using a hash function, traversing all bits of the h, and adjusting the V; S6, traversing the V, adjusting the signature S, andfinally generating a signature value, corresponding to the document content, of the signature S; S7, dividing the signature value corresponding to the document content into n blocks, mapping the blocks to a bucket by using the hash function, and judging whether double hash is performed or not; S8, taking the documents of the same bucket as a candidate pair, and calculating the similarity; and S9,judging whether the documents are similar documents or not. The method is high in detection accuracy and high in executive efficiency, and can be widely used in the internet large-scale data mining.

Large-scale document similarity detection method

Large-scale document similarity detection method

Large-scale document similarity detection method

Owner:NANJING UNIV OF POSTS & TELECOMM

Method for automatically classifying documents based on K nearest neighbor algorithm under power cloud environment

InactiveCN102147813AImprove execution efficiencyReduced execution timeSpecial data processing applicationsDocument similarityNear neighbor

The invention discloses a method for automatically classifying documents based on a K nearest neighbor algorithm under a power cloud environment. By the method, a MapReduce programming framework of cloud calculation is improved; a Map function finishes the calculation of document similarity; and a reduce function defines K samples having highest similarity, counts weights of various classifications where the nearest neighbor belongs and outputs the classification having the largest weight so as to automatically classify the documents. By the method, the task for classifying a large quantity of documents can be finished quickly, so the execution time of the task for classifying the documents is shortened greatly, and the classifying efficiency is improved; and the method has robustness.

Method for automatically classifying documents based on K nearest neighbor algorithm under power cloud environment

Method for automatically classifying documents based on K nearest neighbor algorithm under power cloud environment

Owner:JIANGSU ELECTRIC POWER CO

Cross-language topic detection method and system

ActiveCN106202065ARealize detectionImprove accuracyNatural language translationSpecial data processing applicationsDocument similarityAlgorithm

The invention discloses a cross-language topic detection method and system, wherein the method comprises the following steps of building a comparable corpus of a first language and a second language; respectively building a first language topic model and a second language topic model on the basis of the comparable corpus; determining the alignment of a first language topic and a second language topic through similarity judgment on the basis of document-topic probability distribution generated by the first language topic model and the second language topic model so as to realize cross-language topic detection. The system comprises a first generating module, a second generation module and a detection module. The cross-language topic detection method and the cross-language topic detection system provided by the invention have the advantages that the accuracy rate of cross-language document similarity calculation is improved; through the building of the topic models based on LDA (latent dirichlet allocation), the cross-language topic detection is realized by utilizing the cross-language topic alignment.

Cross-language topic detection method and system

Cross-language topic detection method and system

Cross-language topic detection method and system

Owner:MINZU UNIVERSITY OF CHINA

News and case similarity calculation method based on asymmetric twin network

ActiveCN110717332ASolve the problem of redundant contentEasy to learnSemantic analysisText processingDocument similarityTheoretical computer science

The invention relates to a news and case similarity calculation method based on an asymmetric twin network, and belongs to the technical field of natural language processing. The method comprises thefollowing steps: firstly, selecting a sentence representation document most relevant to a news title by calculating the similarity between sentences and titles in a text so as to remove redundant sentences in the news text; describing and modeling a document and a case by using an asymmetric twin network, fusing the case element serving as supervision information into the asymmetric twin network to encode a news document and case description in consideration of key semantic information of the case contained in the case element, and finally judging the correlation between news and the case by calculating document similarity. According to the method, similarity calculation is carried out on the news text and the case description based on the asymmetric twin network, semantic coding modelingcan be carried out on the unbalanced news text and case description, and the accuracy of similarity calculation can be improved.

News and case similarity calculation method based on asymmetric twin network

News and case similarity calculation method based on asymmetric twin network

News and case similarity calculation method based on asymmetric twin network

Owner:KUNMING UNIV OF SCI & TECH

Popular searches

Electronic mail Document classification Annotation Case base Degree of similarity Data science Equipment computers Link analysis Similarity matrix Computer program