Webpage metadata automatic extraction method and system based on multi-page comparison

A technology for automatic extraction and metadata, which is used in electrical digital data processing, special data processing applications, instruments, etc.

Inactive Publication Date: 2011-01-26
上海华燕房盟网络科技股份有限公司
View PDF0 Cites 33 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The purpose of the present invention is to provide a method and system for solving the problem of automatic extraction of webpage metadata by comparing multiple pages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage metadata automatic extraction method and system based on multi-page comparison
  • Webpage metadata automatic extraction method and system based on multi-page comparison
  • Webpage metadata automatic extraction method and system based on multi-page comparison

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0114] The specific implementation of the present invention will be described in detail below in conjunction with an example of integrating housing information.

[0115] This specific embodiment describes a method for extracting metadata of house listing pages of real estate websites on the Internet. The goal of the integration of housing information is to provide an integrated platform of housing information for house seekers on the Internet. They only need to search on one website to find housing sources on all websites on the Internet. As an important part of the metadata extraction step, it is necessary to achieve better extraction accuracy for semi-structured web pages and have the ability to process loosely structured documents.

[0116] In this specific embodiment, the extraction of metadata includes the following steps:

[0117] 1. Configure web page collector

[0118] Here you need to define the websites that need to collect web pages, and each website needs to defi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides webpage metadata automatic extraction method and system based on multi-page comparison, belonging to the field of internet information processing. Pages of an internet are organized by utilizing a loose HTML (Hypertext Markup Language), but the HTML grammar is not strictly verified, the semantics and the expression form are mixed together, which brings great difficulty for webpage data extraction. The invention can solve the problem very well. In the invention, based on a hypothesis that a dynamic page is generated by filling different data by the same template, convention derivation can be carried out through comparing a plurality of similar pages, and the template of generating the group of the pages is obtained. The system comprises the following components of: (1) a webpage collector which captures the webpages from a preset website; (2) a webpage classifier which classifies the similar pages into one group; (3) a webpage metadata analysis module which derives the template and extracts metadata; (4) a webpage metadata storage which stores and indexes the metadata; and (5) a metadata search engine which retrieves and displays the metadata.

Description

Technical field: [0001] The invention belongs to the technical field of Internet information processing, and in particular relates to an automatic extraction method and system for web page metadata. Background technique: [0002] With the rapid development of Internet technology, the information on the Internet is also increasing exponentially. At present, the information retrieval method based on keyword matching in units of web pages has been difficult to satisfy people's growing thirst for information. For example, if someone wants to find relevant information about cars whose prices range from 100,000 to 200,000 on the Internet, it is difficult for traditional search engines to complete this search. To meet the demand for information retrieval, the metadata in web pages must be retrieved, stored and indexed. However, it is not easy to extract metadata from web pages, because Internet pages are organized with loose HTML, and HTML syntax verification is not strict, the st...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 甘雨李沛剡
Owner 上海华燕房盟网络科技股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products