Chinese webpage interest point retrieval method and device, and electronic equipment

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By introducing word embedding vectors and positional encoding into the BERT model, combined with masked language modeling and node modeling, the problem of existing pre-trained models ignoring the positional structure of web pages is solved, improving the accuracy and efficiency of information extraction in Chinese web page pre-trained models, and making it suitable for question answering and retrieval tasks based on points of interest.

CN116304385BActive Publication Date: 2026-06-26BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
Filing Date: 2023-02-24
Publication Date: 2026-06-26

Application Information

Patent Timeline

24 Feb 2023

Application

26 Jun 2026

Publication

CN116304385B

IPC: G06F16/9537; G06F16/9532; G06F16/951; G06F40/284; G06F16/353; G06N3/0455; G06N3/0499; G06N5/04; G06N3/048; G06F18/2433; G06F18/2415

AI Tagging

Technology Topics

Information presentation Engineering

Technical Efficacy Phrases

avoid demandimprove accuracy

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Method and system for monitoring concentration in a deep sea mining hydrotransport pipe
CN121959478BHigh-precision measurement effectReflect instantaneous fluctuations in a timely mannerBiological models Particle suspension analysisDeep sea miningData set
A solid waste treatment device for chemical safety
CN122077971ARealize automatic resetavoid physical interferencePressesChemical safetyWaste treatment technologies
A method for the photochemical reduction-coupling of aryl alpha-keto esters to synthesize 2,3-diaryl tartaric esters
CN122079778Amild reaction conditions easy to operate Organic compound preparation Carboxylic acid nitrile preparation Chemical products Photochemistry
A cross-category industrial defect detection method, device and medium
CN116152174BImage enhancement Image analysis
A method for preparing methyl ethyl carbonate and diethyl carbonate by ester transesterification at normal temperature and pressure
CN122277404Aavoid demandreduce dependenceSodium methoxidePtru catalyst

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing pre-trained models neglect the positional and structural information of web page data during web page information retrieval and extraction, resulting in overly simplistic model representations that affect the accuracy of information extraction in downstream tasks. Furthermore, the word embedding vector encoding is too simplistic, leading to poor training performance.

Method used

Based on the BERT model, word embedding vector encoding and positional encoding are introduced. By retaining HTML tag data, a Chinese webpage pre-trained model is constructed to perform masked language modeling and masked node modeling. Fine-tuning is carried out in combination with attribute extraction and question answering tasks to improve the model's output of attribute information of interest points.

Benefits of technology

This improves the accuracy and efficiency of the model's output of point-of-interest (POI) attribute information, better meeting user needs and enhancing the accuracy of POI data in map software.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116304385B_ABST

Patent Text Reader

Abstract

The present disclosure discloses a Chinese webpage interest point retrieval method and device and electronic equipment, and relates to the technical field of neural networks and the technical field of cloud computing. The technical problem that in the model training process, only the pure text content of the webpage is pre-trained, the position structure information in the webpage data is ignored, and then the model learning representation is too single, which affects the accuracy of information extraction of the downstream task, is solved. The specific implementation scheme is: in response to the instruction of the user selecting the interest point, obtaining the target webpage data of the target webpage containing the interest point; inputting the target webpage data into the Chinese webpage pre-training model to obtain the target attribute information corresponding to the interest point; and displaying the target attribute information.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer technology, and more particularly to the field of neural network technology and cloud computing technology, specifically to a method, apparatus and electronic device for retrieving points of interest on Chinese web pages. Background Technology

[0002] In recent years, pre-trained models have been widely used in the field of natural language processing, especially the BERT pre-trained model, whose high performance in downstream tasks has made it almost the mainstream framework for pre-trained models. Therefore, web page extraction and understanding tasks, which also belong to the field of natural language processing, are well-suited for fine-tuning using BERT as a pre-trained model.

[0003] However, most existing techniques for web page information retrieval and extraction using pre-trained models remove HTML tags during data preprocessing, pre-training only on the plain text content of the web page. This approach easily overlooks the positional and structural information within the web page data, resulting in overly simplistic model representations and impacting the accuracy of information extraction in downstream tasks. Furthermore, most pre-trained web page models suffer from insufficiently fine-grained input processing and overly simplistic word embedding vector encoding, leading to poor training performance. Summary of the Invention

[0004] This disclosure provides a method, apparatus, and electronic device for retrieving points of interest (POIs) on Chinese web pages.

[0005] According to a first aspect of this disclosure, a method for retrieving points of interest (POIs) on Chinese web pages is provided, comprising:

[0006] In response to the user's instruction to select a point of interest, retrieve the target webpage data containing the target webpage of the point of interest;

[0007] The target webpage data is input into the Chinese webpage pre-training model to obtain the target attribute information corresponding to the interest point. The Chinese webpage pre-training model is obtained by pre-training using sample webpage data containing hypertext markup language.

[0008] Display target attribute information.

[0009] In one possible implementation, before inputting the target webpage data into the Chinese webpage pre-training model to obtain the target attribute information corresponding to the interest point, the method further includes:

[0010] Extract the target hypertext markup language and target Chinese word segmentation from the target webpage data;

[0011] The target path language of the target webpage is determined based on the target hypertext markup language and the target Chinese word segmentation.

[0012] In one possible implementation, the method provided by this embodiment of the invention involves inputting target webpage data into a Chinese webpage pre-training model to obtain target attribute information corresponding to points of interest, including:

[0013] The target Chinese word segmentation and target path language are input into the Chinese webpage pre-trained model to obtain the target attribute information corresponding to the points of interest.

[0014] In one possible implementation, the method provided by this embodiment of the invention trains the Chinese webpage pre-training model according to the following method:

[0015] Obtain sample webpage data for training Chinese webpage pre-trained models;

[0016] Determine the word embedding vector encoding and position encoding of the sample webpages based on the sample webpage data;

[0017] The word embedding vector encoding and position encoding are input into the initial Chinese webpage pre-training model for training, thus obtaining the Chinese webpage pre-training model.

[0018] In one possible implementation, the method provided by this embodiment of the invention, which determines the word embedding vector encoding and position encoding of a sample webpage based on sample webpage data, includes:

[0019] Extract the hypertext markup language and Chinese word segmentation from the sample webpage data;

[0020] The path language of the sample webpage is determined based on Hypertext Markup Language and Chinese word segmentation;

[0021] The word embedding vector encoding and position encoding of the sample webpage are determined based on the path language.

[0022] In one possible implementation, the method provided by this embodiment of the invention involves inputting word embedding vector encoding and position encoding into an initial Chinese webpage pre-training model for training to obtain the Chinese webpage pre-training model, including:

[0023] The word embedding vector encoding and position encoding are input into the initial Chinese webpage pre-trained model;

[0024] We utilize word embedding vector encoding and positional encoding for masked language modeling and masked node modeling.

[0025] In one possible implementation, after performing masked language modeling and masked node modeling using word embedding vector encoding and positional encoding, the method further includes:

[0026] A downstream task is set for the Chinese webpage pre-training model, so that when the Chinese webpage pre-training model is input with target webpage data containing target webpages of interest, it outputs the target attribute information corresponding to the interest points.

[0027] According to a second aspect of this disclosure, a Chinese webpage point of interest retrieval device is provided, comprising:

[0028] The acquisition unit is used to acquire target webpage data containing the target webpage of the interest point in response to the user's instruction to select the interest point;

[0029] The processing unit is used to input the target webpage data into the Chinese webpage pre-training model to obtain the target attribute information corresponding to the interest point. The Chinese webpage pre-training model is obtained by pre-training using sample webpage data containing hypertext markup language.

[0030] The display unit is used to display target attribute information.

[0031] In one possible implementation, the processing unit in the apparatus provided by the embodiments of the present invention is further configured to:

[0032] Extract the target hypertext markup language and target Chinese word segmentation from the target webpage data;

[0033] The target path language of the target webpage is determined based on the target hypertext markup language and the target Chinese word segmentation.

[0034] In one possible implementation, the processing unit in the apparatus provided by the embodiments of the present invention is further configured to:

[0035] The target Chinese word segmentation and target path language are input into the Chinese webpage pre-trained model to obtain the target attribute information corresponding to the points of interest.

[0036] In one possible implementation, the apparatus provided by this embodiment of the invention trains a pre-trained model of Chinese web pages according to the following method:

[0037] Obtain sample webpage data for training Chinese webpage pre-trained models;

[0038] Determine the word embedding vector encoding and position encoding of the sample webpages based on the sample webpage data;

[0039] The word embedding vector encoding and position encoding are input into the initial Chinese webpage pre-training model for training, thus obtaining the Chinese webpage pre-training model.

[0040] In one possible implementation, the processing unit in the apparatus provided by the embodiments of the present invention is further configured to:

[0041] Extract the hypertext markup language and Chinese word segmentation from the sample webpage data;

[0042] The path language of the sample webpage is determined based on Hypertext Markup Language and Chinese word segmentation;

[0043] The word embedding vector encoding and position encoding of the sample webpage are determined based on the path language.

[0044] In one possible implementation, the processing unit in the apparatus provided by the present invention is further configured to: input word embedding vector encoding and position encoding into an initial Chinese webpage pre-training model;

[0045] We utilize word embedding vector encoding and positional encoding for masked language modeling and masked node modeling.

[0046] In one possible implementation, the processing unit in the apparatus provided by the embodiments of the present invention is further configured to:

[0047] A downstream task is set for the Chinese webpage pre-training model, so that when the Chinese webpage pre-training model is input with target webpage data containing target webpages of interest, it outputs the target attribute information corresponding to the interest points.

[0048] According to a third aspect of this disclosure, an electronic device is provided, comprising:

[0049] At least one processor; and

[0050] A memory communicatively connected to the at least one processor; wherein,

[0051] The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method described in any one of the first aspects.

[0052] According to a fourth aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are used to cause the computer to perform the method described in any one of the first aspects.

[0053] According to a fifth aspect of this disclosure, a computer program product is provided, comprising a computer program / instructions that, when executed by a processor, implement the steps of the method described in any one of the first aspects.

[0054] In the embodiments of this disclosure, firstly, in response to a user's instruction to select a point of interest, target webpage data containing the target webpage containing the point of interest is obtained. Then, the target webpage data is input into a Chinese webpage pre-trained model to obtain the target attribute information corresponding to the point of interest. Finally, the target attribute information is displayed. The embodiments of this disclosure use Chinese webpages as a dataset to train the model and utilize this model to implement point-of-interest (POI)-based question answering and retrieval tasks. This allows for the supplementation and updating of POI data information in map software, improving data accuracy and addressing user needs.

[0055] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0056] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:

[0057] Figure 1 This is a flowchart illustrating a Chinese webpage point of interest retrieval method according to an embodiment of this disclosure;

[0058] Figure 2 This is a flowchart illustrating a training method for a Chinese webpage pre-training model according to an embodiment of this disclosure;

[0059] Figure 3 This is an example diagram illustrating the conversion from a source webpage to HTML source code and a DOM structure tree, provided according to embodiments of this disclosure.

[0060] Figure 4 This is a schematic diagram of the structure of a Chinese webpage pre-training model framework provided according to an embodiment of this disclosure;

[0061] Figure 5 This is a schematic diagram of the training process of a Chinese webpage pre-training model according to an embodiment of the present disclosure;

[0062] Figure 6 This is a block diagram of a Chinese webpage point of interest retrieval device according to an embodiment of the present disclosure;

[0063] Figure 7 This is a block diagram of an electronic device used to implement the Chinese webpage point of interest retrieval method according to embodiments of this disclosure. Detailed Implementation

[0064] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0065] The collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved in the technical solution disclosed herein comply with the provisions of relevant laws and regulations and do not violate public order and good morals.

[0066] The acquisition, storage, and application of user personal information involved in the technical solution disclosed herein comply with the provisions of relevant laws and regulations and do not violate public order and good morals.

[0067] The following are explanations of some of the words that appear in the text:

[0068] 1. In the embodiments of this disclosure, the term "and / or" describes the relationship between associated objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. The character " / " generally indicates that the preceding and following associated objects have an "or" relationship.

[0069] 2. In this embodiment of the disclosure, the term "HTML" refers to Hyper Text Markup Language (HTML), a markup language. It includes a series of tags. These tags can unify the format of documents on the network, connecting scattered Internet resources into a logical whole.

[0070] 3. In the embodiments of this disclosure, the term "POI" refers to a Point of Interest (POI) on a map. In layman's terms, it refers to some keywords that users search for in map software. For example, a certain city's health bureau is a POI. Users want to know relevant information about this POI, including its location, contact number, business hours, etc.

[0071] 4. In this embodiment of the disclosure, the term "BERT" refers to a self-encoding language model, BERT (Bidirectional Encoder Representations from Transformers), which is designed with two tasks for pre-training the model. One task is to train the language model using MaskLM, and the other task adds a sentence-level continuity prediction task on top of the bidirectional language model, that is, to predict whether the two text segments input to BERT are consecutive texts.

[0072] 5. In the embodiments of this disclosure, the term "XPath" refers to XML Path Language, which is a language used to determine the location of a part in an XML document.

[0073] 6. In the embodiments of this disclosure, the term "token" refers to a mark or a code. Before some data is transmitted, the code must be checked. Different codes are authorized to perform different data operations.

[0074] As discussed in the background section, most existing techniques for web page information retrieval and extraction using pre-trained models remove HTML tags during data preprocessing, pre-training only on the plain text content of the web page. This approach easily overlooks the positional and structural information within the web page data, resulting in overly simplistic model representations and impacting the accuracy of information extraction in downstream tasks. Furthermore, most pre-trained web page models suffer from insufficiently fine-grained input processing and overly simplistic word embedding vector encoding, leading to poor training performance.

[0075] However, if HTML tag data is retained, the model training process also needs to consider the webpage structure. Webpage data differs from text data; it contains many HTML document tags, which contain topological information about the text structure within the webpage. Therefore, to complete the webpage extraction and understanding task, the pre-training process must consider not only the webpage text content data but also the spatial location information of the text within the webpage and the DOM tree structure information of the webpage's HTML document.

[0076] Based on this, in some embodiments of this disclosure, the BERT model is used as the basic framework, Chinese web pages are used as the dataset, an improved word embedding vector encoding is introduced, a new pre-training strategy is designed, and the model, after being fine-tuned on attribute extraction and question answering tasks, can be applied to POI-type question answering and retrieval tasks.

[0077] The technical solutions provided by the embodiments of this disclosure are described below with reference to the accompanying drawings.

[0078] Figure 1 This is a flowchart illustrating a Chinese webpage point of interest retrieval method provided in this embodiment of the disclosure, such as... Figure 1 As shown, the method includes:

[0079] S110, in response to the user's instruction to select a point of interest, retrieves the target webpage data containing the target webpage of the point of interest.

[0080] In this embodiment of the disclosure, in response to a user's instruction to select a point of interest, target webpage data containing the target webpage of the point of interest is obtained.

[0081] In one example, a user searches for keywords in a map application, such as "Health Bureau of a certain city." The user wants to know relevant information about this Point of Interest (POI), including its location, contact number, and business hours. In this case, the target webpage data containing the keyword is retrieved to provide the user with information such as location, contact number, and business hours.

[0082] S120: Input the target webpage data into the Chinese webpage pre-trained model to obtain the target attribute information corresponding to the interest points.

[0083] In this embodiment, the target hypertext markup language and target Chinese word segmentation of the target webpage are extracted from the target webpage data, and the target path language of the target webpage is determined based on the target hypertext markup language and target Chinese word segmentation. Then, the target Chinese word segmentation and target path language are input into a Chinese webpage pre-training model to obtain the target attribute information corresponding to the points of interest. Thus, preprocessing the target webpage data before inputting it into the Chinese webpage pre-training model can improve the model's computation speed and increase output efficiency.

[0084] In this scheme, the training method for the Chinese webpage pre-trained model is as follows:

[0085] Obtain sample webpage data for training Chinese webpage pre-trained models;

[0086] The word embedding vector encoding and position encoding of the sample webpage are determined based on the sample webpage data. Specifically, when determining the word embedding vector encoding and position encoding, the hypertext markup language and Chinese word segmentation of the sample webpage are extracted from the sample webpage data. Then, the path language of the sample webpage is determined based on the hypertext markup language and Chinese word segmentation. Finally, the word embedding vector encoding and position encoding of the sample webpage are determined based on the path language. This enriches the dimensions of the encoding and makes the final model output data more accurate.

[0087] Finally, the word embedding vector encoding and positional encoding are input into the initial Chinese webpage pre-training model for training, resulting in the Chinese webpage pre-training model. During training, the word embedding vector encoding and positional encoding are first input into the initial Chinese webpage pre-training model. Then, they are used for masked language modeling and masked node modeling. Finally, downstream tasks are set for the Chinese webpage pre-training model, i.e., fine-tuning, so that when the Chinese webpage pre-training model is given target webpage data containing target webpages of interest, it outputs the target attribute information corresponding to the interest points. In this way, the trained model can output the attribute information of the corresponding rules for interest points, providing the services needed by users.

[0088] S130 displays target attribute information.

[0089] In this embodiment of the disclosure, the target attribute information is determined by step S120 and displayed to the user, thereby effectively improving the user experience.

[0090] Figure 2 This is a flowchart illustrating a training method for a Chinese webpage pre-training model provided in this embodiment of the disclosure. The method includes:

[0091] S210, Obtain sample webpage data for training Chinese webpage pre-training models.

[0092] In this embodiment of the disclosure, when training the Chinese webpage pre-training model, sample webpage data is first obtained, and then the data is preprocessed.

[0093] S220, Determine the word embedding vector encoding and position encoding of the sample webpage based on the sample webpage data.

[0094] In this embodiment of the disclosure, the preprocessing process mainly involves extracting the simplest expression of each HTML node in the HTML webpage, performing Chinese word segmentation on the sample webpage data, determining the path language XPath of the sample webpage based on the HTML and Chinese word segmentation, and finally determining the word embedding vector encoding and position encoding of the sample webpage based on the path language.

[0095] Specifically, when constructing XPath encoding, web page data is generally considered to contain more information than ordinary text data. Besides the text content, it also includes HTML tag information and the positional and structural information between HTML nodes. To enable the model to learn the positional and structural information inherent in HTML, the XPath word embedding vector encoding method is introduced. XPath is a markup language that describes the location of HTML nodes, such as... Figure 3 As shown, “html / body / div / div / p

[15] ” is the XPath expression corresponding to the node “fax number: XXXX-XXXXXXX”, which represents the node information traversed from the root node to the current node. Each HTML node has a simplest XPath expression to represent the position information of this node in the Document Object Model (DOM) tree. Therefore, the simplest XPath expression of each HTML node is obtained in advance during the data preprocessing process, and word embedding vector encoding is also performed on the XPath expression.

[0096] For the i-th input node x in the web page data i First, obtain its corresponding XPath expression p through preprocessing. i , will p i Divide by the " / " symbol to get p i.

[0097] Where d is the current node x i Depth in the DOM structure tree Labels representing sub-fragments of XPath expressions. This indicates the index of a sub-fragment of an XPath expression. For nodes that do not have a corresponding XPath expression, such as HTML tags, the [PAD] identifier is used to represent their XPath.

[0098] By constructing word embedding vector matrices for the labels and indices of the XPath expression fragments respectively, we can obtain a word embedding vector for an XPath expression fragment:

[0099]

[0100] For node x i The word embedding vector encoding of the overall XPath expression is as follows:

[0101]

[0102] When constructing positional encoding, because HTML web pages differ from plain text, their DOM tree structure contains more positional information. Therefore, based on the original positional encoding, a more multi-dimensional positional encoding was developed. The token's depth and tag index were introduced. Depth represents the depth of the HTML node to which the current token belongs in the DOM tree structure, and the tag index represents the index of the HTML node to which the current token belongs. Tokens with the same tag index indicate that they belong to the same HTML node. Combining the depth and tag index information with the token's positional index yields a multi-dimensional positional encoding of the input data.

[0103] E pos = [P1, P2, P3];

[0104] Where P1, P2, and P3 represent the depth, label index, and position index of the input token, respectively.

[0105] S230, the word embedding vector encoding and position encoding are input into the initial Chinese webpage pre-training model for training, to obtain the Chinese webpage pre-training model.

[0106] In this embodiment of the disclosure, the framework for the Chinese webpage pre-training model is as follows: Figure 4As shown, word embedding vector encoding and positional encoding are first input into the initial Chinese webpage pre-trained model. Then, word embedding vector encoding and positional encoding are used for masked language modeling and masked node modeling. Specifically, in implementing masked language modeling and masked node modeling, the input word embedding vector encoding is first positionally embedded to obtain the text embedding E. text Then, the position of the input content is obtained sequentially by embedding E. pos and fragment embedding E seg The input word embedding vectors are obtained by combining them:

[0107] E word =E text +E pos +E seg +E xpath ;

[0108] Afterwards, a Text Extractor distinguishes the HTML tag sequence T = (v CLS ,t0,t1,…,t n ), and the plain text sequence X = (x CLS ,x0,x1,…,x n The two sequences are processed by Transformer Encoder (the main component of the BERT model) to obtain the corresponding hidden layer representation h, and then mask language modeling and mask node modeling are performed.

[0109] 1. Masked Language Modeling

[0110] For encoding both HTML tags and text content, only 15% of the tokens in the plain text content are extracted for masking. The model then predicts the masked words, ignoring tokens whose content is HTML tags. The objective function for this task is:

[0111]

[0112] Where x represents a text sequence in the text input that does not contain HTML tags. This represents the token being masked, obtained by randomly selecting a token and performing a masking operation. use Rebuild

[0113] 2. Mask Node Modeling

[0114] For the masked node modeling task, extract the HTML tags from the input text sequence to form a node sequence T = (v CLS ,t0,t1,…,t n Randomly mask the j-th tag token to obtain T.mask =(v CLS ,t0,t1,…,v mask ,…t n ), T mask It can be seen as t j The context of T mask The input is fed into the model's encoder, and then processed by MASK(·) to obtain t based on the context prediction. j Hidden layer mask representation

[0115]

[0116] Then, the context representation h of the original sequence is obtained using the same method. j :

[0117] h j =MASK(Encoder(T))

[0118] Then, the cosine approximation function is used to calculate t after masking. j The degree of approximation between the predicted representation and the original representation is used as the loss function for this pre-training policy:

[0119]

[0120] Combining the two pre-training tasks, the final loss function of the webpage pre-trained model is:

[0121] L = L mlm +L mnp

[0122] S240 sets up downstream tasks for pre-trained models of Chinese web pages.

[0123] In this step, a downstream task is set for the Chinese webpage pre-training model, so that when the Chinese webpage pre-training model is input with target webpage data containing target webpages containing points of interest, it outputs the target attribute information corresponding to the points of interest.

[0124] Specifically, the model is fine-tuned using two downstream tasks: attribute extraction and question answering.

[0125] For attribute extraction tasks, which involve extracting DOM node content corresponding to specific attributes from a webpage, this task can be equivalently modeled as a node classification task. Each DOM node is assigned an attribute; if no attribute exists, it is assigned the null value "none". The hidden layer output `h` of the last layer of the pre-trained model is used as the representation of the DOM node. Then, a multilayer perceptron (MLP) classifier is used to calculate the score `s = MLP(h)` for each attribute type, and the cross-entropy function is used to calculate the loss. Finally, during the inference phase, after passing through a softmax layer, the node attribute is predicted based on the highest score, thus obtaining the DOM node content containing the desired attribute. For each attribute to be extracted, an attribute category needs to be predefined. For example, to extract phone information from a webpage, a "phone" attribute category needs to be defined, and the fine-tuned model is specifically designed for extracting phone information from webpages. Therefore, for each attribute to be extracted, a pre-trained model needs to be fine-tuned to complete the specific extraction task.

[0126] For question answering tasks, the goal is to extract the best answer to a given question from an HTML webpage. Unlike node classification tasks, question answering requires the model to treat a segment of text across the webpage as the answer, which may span multiple HTML nodes. There are also yes / no questions, with special "yes" and "no" tags pre-applied to the document. The model input consists of concatenated questions related to the webpage and the webpage data. We use the output h of the last hidden layer as a representation of the DOM nodes, and a binary classifier is used to obtain two scores s for each node, indicating whether it belongs to the start or end position of the answer. s and s e The starting and ending positions of the answers to the questions are determined based on the scores, and the loss is calculated using the cross-entropy function.

[0127] The fine-tuned model can be applied to POI-based question-answering and retrieval tasks. Data crawling code is written to crawl relevant Chinese web pages based on POI keywords. Preprocessing of the web page data is performed, including a necessary POI attribute extraction model. Extraction models for attributes such as phone number, time, and office location are then selected. After obtaining the results, the POI name is matched with its corresponding attribute. The matching method mainly calculates the distance between two nodes based on their XPath representations. If the distance is close, it indicates that the attribute describes this POI, completing one matching step. The specific model training process is as follows... Figure 5 As shown, the BERT model is used. BERT (Bidirectional Encoder Representations from Transformers) is an autoencoder language model.

[0128] Based on the same inventive concept, this disclosure also provides a Chinese webpage point of interest retrieval device, such as... Figure 6As shown, the Chinese webpage point of interest retrieval device 600 may include:

[0129] The acquisition unit 601 is used to acquire target webpage data containing the target webpage of the interest in response to the user's instruction to select an interest point;

[0130] The processing unit 602 is used to input the target webpage data into the Chinese webpage pre-training model to obtain the target attribute information corresponding to the interest point. The Chinese webpage pre-training model is obtained by pre-training using sample webpage data containing hypertext markup language.

[0131] Display unit 603 is used to display target attribute information.

[0132] In one possible implementation, the processing unit 602 in the apparatus provided by the embodiments of the present invention is further configured to:

[0133] Extract the target hypertext markup language and target Chinese word segmentation from the target webpage data;

[0134] The target path language of the target webpage is determined based on the target hypertext markup language and the target Chinese word segmentation.

[0135] In one possible implementation, the processing unit 602 in the apparatus provided by the embodiments of the present invention is further configured to:

[0136] The target Chinese word segmentation and target path language are input into the Chinese webpage pre-trained model to obtain the target attribute information corresponding to the points of interest.

[0137] In one possible implementation, the processing unit 602 in the apparatus provided by the present invention trains a pre-trained model of Chinese web pages according to the following method:

[0138] Obtain sample webpage data for training Chinese webpage pre-trained models;

[0139] Determine the word embedding vector encoding and position encoding of the sample webpages based on the sample webpage data;

[0140] The word embedding vector encoding and position encoding are input into the initial Chinese webpage pre-training model for training, thus obtaining the Chinese webpage pre-training model.

[0141] In one possible implementation, the processing unit 602 in the apparatus provided by the embodiments of the present invention is further configured to:

[0142] Extract the hypertext markup language and Chinese word segmentation from the sample webpage data;

[0143] The path language of the sample webpage is determined based on Hypertext Markup Language and Chinese word segmentation;

[0144] The word embedding vector encoding and position encoding of the sample webpage are determined based on the path language.

[0145] In one possible implementation, the processing unit 602 in the apparatus provided by the present invention is further configured to: input word embedding vector encoding and position encoding into an initial Chinese webpage pre-training model;

[0146] We utilize word embedding vector encoding and positional encoding for masked language modeling and masked node modeling.

[0147] In one possible implementation, the processing unit 602 in the apparatus provided by the embodiments of the present invention is further configured to:

[0148] A downstream task is set for the Chinese webpage pre-training model, so that when the Chinese webpage pre-training model is input with target webpage data containing target webpages of interest, it outputs the target attribute information corresponding to the interest points.

[0149] According to embodiments of this disclosure, this disclosure also provides an electronic device, a non-transitory computer-readable storage medium, and a computer program product.

[0150] Figure 7 A schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0151] like Figure 7 As shown, device 700 includes a computing unit 701, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 702 or a computer program loaded into random access memory (RAM) 703 from storage unit 708. RAM 703 may also store various programs and data required for the operation of device 700. The computing unit 701, ROM 702, and RAM 703 are interconnected via bus 704. Input / output (I / O) interface 705 is also connected to bus 704.

[0152] Multiple components in electronic device 700 are connected to I / O interface 705, including: input unit 706, such as keyboard, mouse, etc.; output unit 707, such as various types of displays, speakers, etc.; storage unit 708, such as disk, optical disk, etc.; and communication unit 709, such as network card, modem, wireless transceiver, etc. Communication unit 709 allows device 700 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0153] The computing unit 701 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as the Chinese web page interest point retrieval method. For example, in some embodiments, the Chinese web page interest point retrieval method can be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program can be loaded and / or installed on device 700 via ROM 702 and / or communication unit 709. When the computer program is loaded into RAM 703 and executed by the computing unit 701, one or more steps of the deep learning compiler operation method described above can be performed. Alternatively, in other embodiments, the computing unit 701 can be configured to perform the Chinese web page interest point retrieval method by any other suitable means (e.g., by means of firmware).

[0154] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0155] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0156] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0157] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0158] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or middleware components (e.g., application servers), or frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), the Internet, and blockchain networks.

[0159] Computer systems can include clients and servers. Clients and servers are generally geographically separated and typically interact via communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. A server can be a cloud server, also known as a cloud computing server or cloud host, a hosting product within the cloud computing service ecosystem, addressing the shortcomings of traditional physical hosts and VPS (Virtual Private Server, or simply "VPS") services, such as high management difficulty and weak business scalability. Servers can also be servers for distributed systems or servers incorporating blockchain technology.

[0160] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0161] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A method for retrieving points of interest (POIs) on Chinese web pages, characterized in that, include: In response to a user's instruction to select a point of interest, obtain target webpage data containing the target webpage of the point of interest; Extract the target Hypertext Markup Language (HTML) and target Chinese word segmentation from the target webpage data; The target path language of the target webpage is determined based on the target hypertext markup language and the target Chinese word segmentation. The target path language is XML path language, which is used to represent the position information of the target Chinese word segmentation in the document object model tree. The target Chinese word segmentation and the target path language are input into the Chinese webpage pre-training model to obtain the target attribute information corresponding to the point of interest. The Chinese webpage pre-training model is obtained by pre-training sample webpage data containing hypertext markup language through a dual-task pre-training process of mask language modeling and mask node modeling. The mask language modeling only masks the tokens of plain text content, while the mask node modeling masks the node sequence composed of HTML tags. Display the target attribute information.

2. The method according to claim 1, characterized in that, The Chinese webpage pre-trained model is trained using the following method: Obtain sample webpage data for training the Chinese webpage pre-training model; The word embedding vector encoding and position encoding of the sample webpage are determined based on the sample webpage data; The word embedding vector encoding and the position encoding are input into the initial Chinese webpage pre-training model for training to obtain the Chinese webpage pre-training model.

3. The method according to claim 2, characterized in that, The step of determining the word embedding vector encoding and position encoding of the sample webpage based on the sample webpage data includes: Extract the hypertext markup language and Chinese word segmentation of the sample webpage from the sample webpage data; The path language of the sample webpage is determined based on the hypertext markup language and the Chinese word segmentation. The word embedding vector encoding and position encoding of the sample webpage are determined based on the path language.

4. The method according to claim 3, characterized in that, The step of inputting the word embedding vector encoding and the position encoding into the initial Chinese webpage pre-training model for training to obtain the Chinese webpage pre-training model includes: The word embedding vector encoding and the position encoding are input into the initial Chinese webpage pre-training model; The word embedding vector encoding and the position encoding are used for masked language modeling and masked node modeling.

5. The method according to claim 4, characterized in that, After performing masked language modeling and masked node modeling using the word embedding vector encoding and the positional encoding, the method further includes: A downstream task is set for the Chinese webpage pre-training model, so that when the Chinese webpage pre-training model is input with target webpage data containing the target webpage of the interest point, it outputs the target attribute information corresponding to the interest point.

6. A Chinese webpage point of interest retrieval device, characterized in that, include: The acquisition unit is used to acquire target webpage data containing the target webpage that is selected by the user in response to the user's instruction to select a point of interest. The processing unit is configured to extract the target hypertext markup language (HTML) and target Chinese word segmentation of the target webpage from the target webpage data; determine the target path language of the target webpage based on the target hypertext markup language and the target Chinese word segmentation, wherein the target path language is an XML path language used to represent the position information of the target Chinese word segmentation in the document object model tree; input the target Chinese word segmentation and the target path language into a Chinese webpage pre-training model to obtain the target attribute information corresponding to the point of interest, wherein the Chinese webpage pre-training model is obtained by pre-training sample webpage data containing hypertext markup language through a dual-task pre-training process of mask language modeling and mask node modeling, wherein the mask language modeling only masks the tokens of plain text content, and the mask node modeling masks the node sequence composed of HTML tags; The display unit is used to display the target attribute information.

7. The apparatus according to claim 6, characterized in that, The processing unit trains the Chinese webpage pre-training model according to the following method: Obtain sample webpage data for training the Chinese webpage pre-training model; The word embedding vector encoding and position encoding of the sample webpage are determined based on the sample webpage data; The word embedding vector encoding and the position encoding are input into the initial Chinese webpage pre-training model for training to obtain the Chinese webpage pre-training model.

8. The apparatus according to claim 7, characterized in that, The processing unit is also used for: Extract the hypertext markup language and Chinese word segmentation of the sample webpage from the sample webpage data; The path language of the sample webpage is determined based on the hypertext markup language and the Chinese word segmentation. The word embedding vector encoding and position encoding of the sample webpage are determined based on the path language.

9. The apparatus according to claim 8, characterized in that, The processing unit is further configured to: input the word embedding vector encoding and the position encoding into the initial Chinese webpage pre-training model; The word embedding vector encoding and the position encoding are used for masked language modeling and masked node modeling.

10. The apparatus according to claim 9, characterized in that, The processing unit is also used for: A downstream task is set for the Chinese webpage pre-training model, so that when the Chinese webpage pre-training model is input with target webpage data containing the target webpage of the interest point, it outputs the target attribute information corresponding to the interest point.

11. An electronic device, characterized in that, include: processor; Memory used to store the processor's executable instructions; The processor is configured to execute the instructions to implement the Chinese webpage interest point retrieval method as described in any one of claims 1 to 5.

12. A computer storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the Chinese webpage point of interest retrieval method as described in any one of claims 1 to 5.

13. A computer program product comprising a computer program / instructions that, when executed by a processor, implement the steps of the method according to any one of claims 1 to 5.

Citation Information

Patent Citations

CN111737623A
CN114429106A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

CN111737623A

CN114429106A