Method and device for extracting page information

A technology of page information and positioning information, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of extracting page information and low efficiency of extracting page information, and achieve efficient and accurate extraction

Active Publication Date: 2016-05-18
ALIBABA (CHINA) CO LTD
View PDF4 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, a large number of new webpages are generated every day on the Internet, and it is difficult to extract page information from the newly added webpages by using the original pre

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting page information
  • Method and device for extracting page information
  • Method and device for extracting page information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0070] See Figure 1A The embodiment of the present invention provides a method for extracting page information.

[0071] In the embodiment of the present invention, before extracting the page information of the webpage to be processed, it is necessary to pre-set the algorithm library offline and configure the transcoding configuration information of the webpage to be processed. The transcoding configuration information of the webpage to be processed includes the positioning information and data structure type of each service block of the webpage to be processed. The preset algorithm library includes data structure types and their corresponding recognition algorithms. Such as Figure 1B As shown, the specific process of the above offline configuration operation includes:

[0072] S1: Obtain a DOM (Document Object Model, Document Object Model) tree of the webpage to be processed, and divide the DOM tree of the webpage to be processed according to the business type to obtain each bus...

Embodiment 2

[0124] See figure 2 An embodiment of the present invention provides a device for extracting page information, and the device is configured to execute the method for extracting page information provided in the first embodiment. The device includes:

[0125] The first obtaining module 201 is used to obtain the source code of the webpage to be processed and the document object model DOM tree, and obtain the transcoding configuration information of the webpage to be processed from the server. The transcoding configuration information of the webpage to be processed includes each service of the webpage to be processed Block location information and data structure type;

[0126] When a user browses a webpage through a terminal, the terminal sends a webpage acquisition request to the above-mentioned device, and the webpage acquisition request carries the webpage address of the webpage and the terminal identifier. The above-mentioned device receives the web page acquisition request sent ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and a device for extracting page information. The method comprises the steps as follows: a source code and a DOM tree of a to-be-treated webpage are obtained; transcoding configuration information of the to-be-treated webpage is obtained from a server; the transcoding configuration information of the to-be-treated webpage comprises positioning information and data structure type of each business block of the to-be-treated webpage; a DOM node corresponding to the business block is obtained from the DOM tree; a recognition algorithm corresponding to the business block is obtained from a preset algorithm library according to the data structure type of the business block; and the page information is extracted from the webpage source code, corresponding to the DOM node, of the business block according to the recognition algorithm corresponding to the business block. The method and the device efficiently and accurately extract the page information of the webpage according to the preset algorithm library and the transcoding configuration information of the webpage; and for a newly added webpage, the page information can also be successfully extracted from the newly added webpage.

Description

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Owner ALIBABA (CHINA) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products