Document similarity measurement method and system based on keyword sequence structure
A technology of keyword sequence and document similarity, applied in unstructured text data retrieval, text database query, special data processing applications, etc. question
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0037] The present invention provides a method for measuring document similarity based on keyword position structure distribution, the method comprising:
[0038] S1 stores two documents W 1 with W 2 , the document W 1 with W 2 Both have multiple natural segments, the two stored documents W 1 with W 2 Separate word segmentation and stop word processing.
[0039] S2 sets the keyword sequence, in the document W 1 with W 2 Search for the set of positions where all keywords in the keyword sequence appear in the keyword sequence;
[0040] keyword sequence S in W 1 An occurrence in means that m keywords in sequence S appear in document W 1 appears once in sequence. In the document W 1 Search for a certain occurrence of the keyword sequence S in , which can be recorded as: get the occurrence positions of m keywords Ponit={p 1 ,p 2 ,...,p m}, all occurrences form the set of occurrences of S in the document, where p i for keywords i In the document W 1 A certain occurr...
Embodiment 2
[0054] The present invention also provides a document similarity measurement system based on keyword sequence structure, comprising:
[0055] Document preprocessing module 1, used to store two documents W 1 with W 2 , the document W 1 with W 2 Both have multiple natural segments, the two stored documents W 1 with W 2 Separate word segmentation and stop word processing;
[0056] Appearance location statistics module 2, used to set the keyword sequence, and in the document W 1 with W 2 Search for the set of positions where all keywords in the keyword sequence appear in the keyword sequence;
[0057] keyword sequence S in W 1 An occurrence in means that m keywords in sequence S appear in document W 1 appears once in sequence. In document W 1 Find a certain occurrence of the keyword sequence S in , and obtain the occurrence positions of m keywords Ponit={p 1 ,p 2 ,...,p m}, all occurrences form the set of occurrences of S in the document, where p i for keywords i I...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com