Patents
Literature
Patsnap Copilot is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Patsnap Copilot

119 results about "Document similarity" patented technology

Document similarity (or distance between documents) is a one of the central themes in Information Retrieval. How humans usually define how similar are documents? Usually documents treated as similar if they are semantically close and describe similar concepts. On other hand “similarity” can be used in context of duplicate detection.

Document similarity detection and classification system

A document similarity detection and classification system is presented. The system employs a case-based method of classifying electronically distributed documents in which content chunks of an unclassified document are compared to the sets of content chunks comprising each of a set of previously classified sample documents in order to determine a highest level of resemblance between an unclassified document and any of a set of previously classified documents. The sample documents have been manually reviewed and annotated to distinguish document classifications and to distinguish significant content chunks from insignificant content chunks. These annotations are used in the similarity comparison process. If a significant resemblance level exceeding a predetermined threshold is detected, the classification of the most significantly resembling sample document is assigned to the unclassified document. Sample documents may be acquired to build and maintain a repository of sample documents by detecting unclassified documents that are similar to other unclassified documents and subjecting at least some similar documents to a manual review and classification process. In a preferred embodiment the invention may be used to classify email messages in support of a message filtering or classification objective.
Owner:GLASS JEFFREY B MR

Document similarity calculation method and near-duplicate document detection method and device

The invention relates to a document similarity calculation method and a near-duplicate document detection method and device. The calculation method comprises the following steps: performing word segmentation processing on two documents to be detected to obtain respective participle sets of the documents to be detected; calculating the edition similarity of all participle pairs in the two participle sets, wherein two participles in each participle pair come from the two participle sets respectively; establishing sides among the participle pairs of which the edition similarity meets a certain requirement in all the participle pairs to obtain a weighted biograph, wherein the edition similarity is the weights of the sides of corresponding participle pairs; calculating the maximum weighted matching value of the weighted bi-graph; calculating the similarity between the documents to be detected by using the maximum weighted matching value. By adopting the document similarity calculation method and the near-duplicate document detection method and device provided by the invention, high accuracy is achieved, near-duplicate texts comprising participle set edition errors can be identified effectively, the near-duplicate document detection accuracy is increased, the calculation complexity is lowered, and the calculation efficiency is optimized.
Owner:HUAWEI TECH CO LTD +1

Fast document type recognition method based on full-sized feature extraction

The invention provides a fast document type recognition method based on full-sized feature extraction. The method comprises the following steps of document image preprocessing, including image zooming, graying and noise filtering; document image feature extraction, including Hessian matrix building, scale space generation, primary determination of feature points, precise positioning of the feature points, main direction determination of selected feature points, feature point descriptor construction and feature value string generation; document image feature value comparison, including document similarity calculation and comparison algorithm optimization. According to the method, an image is preprocessed through software, and additional hardware equipment does not need to be added. According to the method, the scale-invariant feature is creatively introduced for improving a typical SURF feature extraction algorithm, so that the problem of matching failure due to error amplification of the SURF algorithm caused by scale variations is fundamentally solved. The method has the advantage that a multi-thread technology and a large cache are used for solving the problems of large data volume calculation during comparison and the harsh time requirement of a user on an electronic government affair platform.
Owner:FOSHAN UNIVERSITY

Document similarity distinguishing method based on Fourier transform

The invention provides a document similarity distinguishing method based on Fourier transform. The method comprises the following steps: acquiring the keyword sequence Ks of a document collection S and a corresponding keyword frequency collection Ns, as well as a keyword sequence Ks' of the detection document s' relative to the document collection S and s corresponding keyword frequency collection Ns'; calculating the weight coefficient of each of the keyword sequences Ks and Ks' as well as the weight sequence FKs of the keyword sequence Ks and the weight sequence FKs' of the keyword sequence Ks'; carrying out Fourier transform to weight sequence FKs and FKs'; calculating the threshold value Omega S of similarity distance of similarity of random document in the detection document s' and the document collection S; calculating the similarity distance D (s', si) between the documents si in the detection document s' and the document collection S, and comparing the similarity distance D with the threshold value Omega S; judging whether the detection document s' and the document collection S are similar or not. The distinguishing method of document similarity based on Fourier transform provided by the invention can not only reduce the requirement to a representing method of the document while calculating similarity, but also can reduce the complexity of calculation and improve the computational efficiency.
Owner:STATE GRID CORP OF CHINA +4
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products