Method for establishing and searching feature matrix of Web document based on semantics

A feature matrix and document technology, applied in the field of information retrieval, can solve problems such as difficulty in improving semantic levels and loss of semantic information

Inactive Publication Date: 2008-08-27
EAST CHINA NORMAL UNIV
View PDF0 Cites 79 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in the traditional LSA model, this relationship is not considered at the conceptual level, so it is difficult to improve at the semantic level, resulting in a large loss of semantic information

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for establishing and searching feature matrix of Web document based on semantics
  • Method for establishing and searching feature matrix of Web document based on semantics
  • Method for establishing and searching feature matrix of Web document based on semantics

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0085] Embodiment 1. Establishment of a feature matrix of a semantically based Web document

[0086] Assume that there are five web documents from the Internet (first step), and their contents are:

[0087] Document 1: Public transit

[0088] Train, plane, car, bus, subway

[0089] Document 2: Traffic Jam

[0090] Document 3: Transportation Industry

[0091] Document 4: lifeline of public transportation

[0092] Document 5: Buses and subways are the main means of transportation

[0093] First, use the word segmentation tool to perform word frequency statistics on nouns, pronouns, place words, personal names, place names, institutions, and other proper names in each document (step two). Form a keyword-document term frequency matrix (Table 3 below, corresponding to the third step, the fourth step, and the fifth step).

[0094] Table 3. Keyword-document term frequency matrix and n i and idf i

[0095] Keyword\document (word frequency)

[...

Embodiment 2

[0100] Example 2.Semantics-based retrieval method for Web documents

[0101] Assume that the retrieval content is: public transportation; assuming that the retrieved data source is the five documents corresponding to the feature matrix established in 1;

[0102] Establish ontology: Assume that the established traffic ontology is as Figure 5 Shown (corresponding to the first of the preparation):

[0103] according to SN ( N 1 , N 2 ) = Depth ( com _ parent ( N 1 , N 2 ) ) Height ( root ...

Embodiment 3

[0129] Embodiment 3. Utilize the traditional LSA algorithm

[0130] Suppose there are five documents, and their contents are:

[0131] Document 1: Public Transportation

[0132] train, plane, car, bus, subway

[0133] Document 2: Traffic Jam

[0134] Document 3: Transportation Industry

[0135] Document 4: The lifeblood of public transport

[0136] Document 5: Buses and subways are the main means of transportation

[0137] Suppose the search content is: public transportation

[0138] First, use the word segmentation tool to perform word frequency statistics on nouns, pronouns, local words, personal names, place names, institutions, and other proper names in each document. A keyword-document term frequency matrix is ​​formed.

[0139] Table 5 keywords-document term frequency matrix and n i and idf i

[0140] Keyword\document (word frequency)

document 1

document 2

document 3

document 4

Document 5

n i

idf...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to an establishing and retrieval method for a characteristic matrix of a semantically based Web document, belonging to the information retrieve technical field. During the process of establishing the characteristic matrix for the Web document, position information and particular expression form information are added into an index process of a prior LSA model by utilization of the particular position information and the particular expression form information in the Web document, thereby the prior LSA method is effectively improved. The retrieval process is as follows: firstly, semantic expansion of a concept in a query sentence is performed according to a body; secondly, a query vector is generated according to the query concept and an enlarged concept of the query concept, and the similarity of the query concept and the enlarged concept can be taken into consideration by a vector value, thereby semantic deletion of the prior LSA model is made up in a certain extent. The establishing and retrieval method for the characteristic matrix of tbe semantically based Web document has the advantages of scientific index and effective retrieve of unstructured document information, realization of retrieve of unstructured information in all locations at any moment, and assistance of convenient and in-time acquisition of required information of a user.

Description

technical field [0001] The invention relates to a method for establishing and retrieving a feature matrix of a semantic-based Web document, and belongs to the technical field of Information Retrieval. Background technique [0002] Since the development of database technology, the retrieval of formatted data has been relatively mature, and the document retrieval function based on the string matching function can already be realized. However, there is no effective retrieval method for a large number of unformatted documents (mainly referring to data in non-databases, such as Web documents). How to let users find the information they need in the most effective way and most accurately in the vast free text collection has become a hot spot in the field of Chinese retrieval. [0003] The development of Web search engine technology makes it possible to retrieve massive Web page information in the Internet. However, this kind of retrieval also has its own disadvantages: the basic ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 顾君忠杨静李子成贺梁吕钊王麒江开忠
Owner EAST CHINA NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products