Full text query and search systems and methods of use

A technology of retrieval system and search engine, which is applied in the field of information technology and software, and can solve problems that are difficult, cannot realize user intentions, and have a large number of hits

Inactive Publication Date: 2007-12-12
英孚威尔公司
View PDF0 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In most cases, it may be difficult to completely define a related topic with a small number of keywords
[0010] 2) The "hit-heavy" problem: i.e., reporting many irrelevant results
[0011] 3) The rating of "hits" may not fulfi...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Full text query and search systems and methods of use
  • Full text query and search systems and methods of use
  • Full text query and search systems and methods of use

Examples

Experimental program
Comparison scheme
Effect test

example 1

[0118] Example 1: Implementation of Theoretical Model

[0119] Details of a specific embodiment of the search engine of the present invention will be disclosed in this section.

[0120] 1. Introduce the flatDB program

[0121] FlatDB is a set of C language programs that work with flat file databases. That is, they are tools that can handle flat text files with large data contents. The file format can be various, such as table form, XML format, FASTA format, and any form, as long as there is a unique original key. Typical applications include large sequence databases (genpept, dbEST), human gene ranking or other gene banks, PubMed, Medline, etc.

[0122] In the settings of the tool, there is an indexing program, a retrieval program, an inserting program, an updating program, and a deleting program. Also, for very large entries, there is a procedure for retrieving a specific part of the entry. Unlike SQL, FlatDB does not support links between different files. For example, ...

example 2

[0219] Example 2: A database example for Medline

[0220] Here is a list of database files, which have been processed:

[0221] 1) Medline.raw raw database downloaded from NLM in XML format.

[0222] 2) Medline.fasta processed database

[0223] Follow FASTA format for parsed entries

[0224] >primary_id author.(year) title.journal.column: page number-page number

[0225] word1(freq)word2(freq)...

[0226] Words are picked out by features.

[0227] 3) Medline.pid2bid mapping between primary_id(pid) and binary_id(pid)

[0228] Medline.bid2pid mapping between binary_id and primary_id

[0229] primary_id is defined as a FASTA file. It is a unique identifier used by Medline. binary_id is an assigned id, we use it to save space.

[0230] Medline.pid2bid is a tabular format file. Format: primary_id binary_id (selected by primary_id)

[0231] Medline.bid2pid is a tabular format file. Format: binary_id primary_id (selected by binary_id)

[0232] ...

example 3

[0257] Example 3: How to generate a phrase dictionary

[0258] 1. Theoretical Aspects of Phrase Search

[0259] A phrase search is when a search is performed using a string of words (not a single word). Example: A person might look up information about teenage abortion. Each of these words has a different meaning when taken alone, and retrieves a lot of unrelated documents, but when you combine them one by one their meaning changes to very accurately the "teenage abortion" idea. From this perspective, phrases contain more information than combinations of individual words.

[0260] In order to perform a phrase search, we need to first generate a phrase dictionary, and a distribution function for any given database, just as we have for individual words. A programmatic method for generating a phrase distribution for any given text database is disclosed herein. From a completely theoretical point of view, for any 2 words, 3 words, ..., K words, the frequency of each ca...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention is a method for textual searching of text-based databases including databases of compiled internet content, scientific literature, abstracts for books and articles, newspapers, journals, and the like. Specifically, the algorithm supports searches using full-text or webpage as query and keyword searches allowing multiple entries and an information-content based ranking system (Shannon Information score) that uses p-values to represent the likelihood that a hit is due to random matches. Additionally, users can specify the parameters that determine hits and their ranking with scoring based on phrase matches and sentence similarities.

Description

technical field [0001] The invention includes the fields of information technology and software, and in particular relates to an information retrieval method with ratings for text-based databases. Background technique [0002] Most of the traditional online computer-based search methods for text content databases are based on keywords, that is, a database and its corresponding dictionary are first established. An index file of the database is associated with the dictionary in which the occurrence of each key word and their position in the database are recorded. When a query contains the entered keyword, all entries in the database containing that keyword are returned. In the "advanced search" type, a user can also specify excluded words, where occurrences of the specified words will not be allowed to be presented in any hits. [0003] The main problem with keyword-based search engines is how to rank hits if there are many entries containing that word. Consider first the c...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/00
Inventor 唐元华胡前进杨永红
Owner 英孚威尔公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products