Metadata automatic extraction method based on multiple rule in network search

An automatic extraction and network search technology, applied in the field of network search, can solve problems such as cumbersome rules and poor extraction results, and achieve remarkable results, good practical value, and improved extraction accuracy

Inactive Publication Date: 2008-01-09
PEKING UNIV
View PDF0 Cites 36 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

But this will make the final rules very cumbersome, and because words expressing a semantic meaning emerge in endlessly, the final extraction effect will not be very good, which is more obvious in Chinese extraction

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Metadata automatic extraction method based on multiple rule in network search
  • Metadata automatic extraction method based on multiple rule in network search
  • Metadata automatic extraction method based on multiple rule in network search

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] The specific implementation manner of the present invention will be described in detail below in conjunction with an example of integrating educational resources.

[0030] This specific implementation mode describes a method for extracting metadata of web pages on resource websites in the integration of educational resources. The integration of educational resources aims to provide an integrated platform of educational resources for online learners and teachers. As an important part of the metadata extraction step, it is necessary to achieve better extraction accuracy for semi-structured web pages and have the ability to process loosely structured documents.

[0031] As shown in Figure 1, in this embodiment, the extraction of metadata includes the following steps:

[0032] 1. Rough web page preprocessing

[0033]Typically, extraction algorithms are more suitable for processing xml documents and well-formed html documents. The "well-formed" mentioned here means that w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The method includes following steps: (1) preprocessing coarse web pages, and normalizing all web pages to more normative format; (2) primary positioning contents of web pages including information to be extracted in files of web page; (3) according to specified rules to extract metadata from contents picked-up by primary positioning operation. First, the invention distinguishes core area from hash in large area. Then, aiming at core area, the invention carries out regular extraction so as to raise accuracy of extraction greatly. The invention also can extract metadata in web page according to multiple rules. Based on given priorities, the multiple rules determine matching sequence, and carries out refined process according to method of extraction in two stages.

Description

Technical field: [0001] The invention belongs to the technical field of network search, in particular to a method for subject search on Internet pages. Background technique: [0002] Metadata is data that describes data (data that describes data) or "data about data" (data about data), which is used to describe the characteristics and attributes of data, and is also a tool for describing and organizing Internet information resources and discovering Internet information resources. In each field, there will be some large-scale resource publishing websites. By extracting metadata from these resource websites, a large number of useful resources can be collected to help different users build databases in specific fields. Therefore, the application of metadata extraction is very broad. [0003] Metadata extraction plays a fundamental role in data preparation in the entire information organization and retrieval. The data source of the extraction process first undergoes necessary ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 张铭杨宇
Owner PEKING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products