Automatic moulding plate information locating method for structured web page

An information positioning and structuring technology, applied in special data processing applications, instruments, electrical and digital data processing, etc., can solve problems such as inaccurate matching, exhaustion of recruitment positions, and omissions in search.

Active Publication Date: 2008-05-14
北京酷讯科技有限公司
View PDF0 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The biggest problem is that regular expressions can only match the information listed in advance. For example, we can search for the search information listed in advance through regular expressions, such as the information that "recruitment position" is "teacher" and "secretary". Information such as "sales manager" and "network engineer" will not be searched. In fact, it is impossible for us to exhaustively enumerate the recruitment positions; in addition, we search for job information, but the actual information may not have a job title but only a paragraph about the job. It is described that regular expressions cannot search for such information, so there will be search omissions due to inaccurate matching and inability to judge reasonable content

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Automatic moulding plate information locating method for structured web page
  • Automatic moulding plate information locating method for structured web page
  • Automatic moulding plate information locating method for structured web page

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0031] The present invention will be further described in detail below in conjunction with the accompanying drawings:

[0032] Nowadays, most information publishing websites use programs to automatically publish relevant information to the web pages. The form of this kind of webpage is generally relatively fixed, so it is possible to extract the fixed structure of the webpage, identify the location of the attribute of interest, and then accurately extract the content of interest in the webpage.

[0033] Generally speaking, the attributes to be extracted will have their corresponding keywords. For example, in the recruitment information webpage, the attribute "work location" has its corresponding keywords-"work location", "work city" and so on. For a particular attribute, the number of corresponding keywords is not many, which is determined by the characteristics of natural language. The value of the attribute can be "Beijing", "Guangzhou", "Shanghai", etc., and there may be more t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an automatic module information positioning method for a structural web page. The existing positioning method has the shortcomings that the match is not accurate enough, and the judge on the reasonable content is difficult. In order to solve the problems, the invention is characterized in that the attribute key word is positioned by a regular expression, so as to determine the distance between the attribute key word and the attribute value; finally, the whole attribute value is positioned according to the attribute key word and the distance between the attribute key word and the attribute value. The invention can accurately and effectively position the searched information. The invention is suitable for various net information searching engine.

Description

Technical field [0001] The invention relates to an automatic template information positioning method for structured webpages. Background technique [0002] Web page information extraction technology is an important content in the field of Internet information mining. The problem to be solved by the webpage information extraction technology is how to extract the specified information from the webpage. For example, extract information such as company and position from all the recruitment information pages in a recruitment information publishing website. The past technology is to use regular expressions to match the specified information in the webpage, and then extract the most reasonable content from the matched information. This method has many flaws. The biggest problem is that regular expressions can only match the pre-listed information. For example, we can search for pre-listed search information through regular expressions, such as "recruitment position" for "teacher" and "s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 陈华
Owner 北京酷讯科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products