Supercharge Your Innovation With Domain-Expert AI Agents!

A method and apparatus for extracting information

A technology of subject information and data blocks, which is applied in the computer field, can solve the problems of increased difficulty in extracting page subject information, and achieve the effect of improving accuracy, automation and flexibility

Active Publication Date: 2019-02-12
BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
View PDF10 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The difficulty of extracting the main body information of the page increases accordingly

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and apparatus for extracting information
  • A method and apparatus for extracting information
  • A method and apparatus for extracting information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035] The application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain related inventions, rather than to limit the invention. It should also be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

[0036]It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other. The present application will be described in detail below with reference to the accompanying drawings and embodiments.

[0037] It should also be noted that the content included in the drawings is part of the DOM data of the page (not all of them are shown). The program codes therein (including HTML, CSS, and may also include Javascript, etc.) are well known to those skilled in the art, and wil...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A method and apparatus for extracting information are disclosed in embodiment of that present application. A specific embodiment of the method includes: acquiring the DOM data of a target page, wherein the target page includes the page body information, and the page body information includes at least one of the following items: a text set and an image set; from the DOM data, deleting the data thatmeets the preset deletion condition to obtain the target data, wherein, the target data comprises the page main body data, and the page main body data comprises at least one of the following items: atext node set corresponding to a text set and a URL set of an image set; dividing the target data into blocks to obtain a data block set; determining a target data block from a data block set, wherein the target data block is a data block in the data block set that has the greatest probability of including page body data; extracting at least one of the following items of the text node and the URLfrom the target data block. The embodiment improves the flexibility of information extraction, and contributes to improving the accuracy and automation of information extraction.

Description

technical field [0001] The embodiments of the present application relate to the field of computer technologies, and in particular to methods and devices for extracting information. Background technique [0002] With the increase of the amount of Internet data, there are more and more various website pages on the Internet, the amount of information is increasing, and the website pages are becoming more and more complex. The difficulty of extracting the main body information of the page increases accordingly. The main body information of the page is usually the main content part of the website that we want to obtain when we obtain the page information, and the main body information of the page is usually very helpful for us to extract the most meaningful information of the page. [0003] Usually, when obtaining the main body information of the page, a step of eliminating irrelevant parts is also involved, so as to facilitate the extraction of the main body information of the ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/957
Inventor 杨森魏晨辉
Owner BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More