WEB page information sensing and collecting method

A collection method and technology of page information, applied in special data processing applications, website content management, instruments, etc., can solve problems such as difficult to define crawling, inability to automatically identify post URLs, and fast page refresh rate of mainstream websites. The effect of customizing workload and maintenance costs, overcoming the inability to collect information, and avoiding the risk of information loss

Active Publication Date: 2015-02-18
南京烽火星空通信发展有限公司
View PDF4 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] 1. The page refresh rate of mainstream websites is fast, and the information loss is serious. If there are multiple URL post formats in the current web page, it is easy to miss some important post information;
[0011] 2. Different websites need to customize different rules to identify the post URLs that need to be collected, which requires a lot of script customization, heavy workload and difficult maintenance;
[0012] 3. It is difficult to define unnecessary parts such as advertisements and promotion external link URLs;
[0013] 4. After the website is revised, the URL of the revised post cannot be automatically recognized

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • WEB page information sensing and collecting method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] The specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings.

[0040] Such as figure 1 As shown, the present invention designs a kind of WEB page information perception collection method in the actual application process, specifically comprises the following steps:

[0041] Step 001. From the entrance of the website to be collected, load page by page to obtain all link URLs on each page, filter out non-post information such as CSS, JS, pictures, audio or video, obtain the full URL of the website to be collected, and enter step 002;

[0042] Step 002. Judging whether there is a URL rule in the website to be collected and whether there is a full amount of URL records in the website to be collected at the same time, and according to the judgment result, enter step 003 and step 005 respectively for parallel processing, or enter step 004 and step 005 respectively 006 for parallel proces...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a WEB page information sensing and collecting method. According to the method, information collection is carried out through URL (uniform resource locator) proportion analysis and automatic sensing study mechanisms in a page, the information loss risk caused by artificial customization of site URL rules can be effectively avoided, a great amount of site script customization workload and maintenance cost is greatly reduced, the defect of information collection incapability trouble after the website version change is overcome, in addition, through intelligent increment merging, the URL rule is intelligently generated, and the accuracy for sensing and obtaining the information from the page is effectively ensured.

Description

technical field [0001] The invention relates to a method for perceptually collecting WEB page information. Background technique [0002] With the advancement of science and technology, Internet information has entered an era of explosion and diversity, and the Internet has become a huge information database. Facing the massive amount of information on the Internet with both diversity and complexity, only manual collection, sorting, and tracking of the latest Information dynamics are obviously unscientific, inefficient, and unable to meet actual needs. The automatic collection of Internet information can save users a lot of resources in information collection, resource integration, capital utilization, human investment, etc., and is widely used in industry portal information collection, competitor intelligence data collection, website content system construction, vertical search, public opinion, etc. Monitoring, scientific research and other fields. [0003] General web scr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/958
Inventor 瞿伟史波良
Owner 南京烽火星空通信发展有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products