Web form abnormal data discovery method based on text semantic mapping relationship

CN115659989BActive Publication Date: 2026-06-23SOUTHEAST UNIV

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SOUTHEAST UNIV
Filing Date: 2022-10-06
Publication Date: 2026-06-23

Application Information

Patent Timeline

06 Oct 2022

Application

23 Jun 2026

Publication

CN115659989B

IPC: G06F40/30; G06F40/284; G06F40/18; G06F18/22; G06F18/214; G06F18/24; G06N3/04; G06N3/08

AI Tagging

Application Domain

Semantic analysis Text processing

Technology Topics

Semantic vector Semantic representation

Technical Efficacy Phrases

improve accuracySolve the problem that it is difficult to recognize fuzzy semantic information

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Digital human interaction control method and device based on multi-modal
CN122389919AAccurately capture instant intentionsimprove accuracy Interaction control Feature vector
Fault diagnosis method and device and model prototype acquisition method
CN115658361Bfully excavatedimprove accuracy
A device for visualizing calibration of astigmatic eye focal lines
CN224483971Uimprove accuracy High measurement accuracy Target line Astigmatism
Hub node modeling and analysis data generation method and device for single-layer latticed shell structure
CN122221370AGeometric CAD Design optimisation/simulation
Data processing method and apparatus, storage medium, and electronic device
CN121943471BSolve technical problems with low accuracyimprove accuracyBone tibiaSurgery

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN115659989B_ABST

Patent Text Reader

Abstract

The application discloses a Web table abnormal data discovery method based on a text semantic mapping relationship. The application aims at discovering abnormal data with fuzzy or even wrong semantic information in a Web table. The method mainly comprises three parts: a semantic representation module, a column type inference module and an error discovery module. First, the semantic representation module represents the meaning of cell text. For a cell in a table, the string text in the cell is represented as a semantic vector according to context information. Then, the column type inference module infers the type of the column where the cell is located, and obtains the mode information of the column. Finally, based on the mapping relationship between the column type and the semantic vector of the cell text of the main column cell and the target cell, abnormal data in the table is discovered and labeled.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data anomaly detection and its applications, and in particular to a method for discovering abnormal data in Web tables based on text semantic mapping relationships. Background Technology

[0002] With the rapid development of the World Wide Web, various information websites have gradually integrated into people's lives, becoming indispensable tools for obtaining various kinds of information daily. Relationship tables containing semantic information within web pages are called Web tables. The vast number of Web tables not only facilitates knowledge acquisition but also serves as an important data source for numerous machine learning and training tasks. However, because Web semantic tables are open to users and can be edited by anyone, they contain a large amount of abnormal data and even maliciously tampered information. Effectively identifying abnormal data in Web tables has significant practical implications.

[0003] Traditional table anomaly handling techniques mainly include integrity constraint-based and rule-based anomaly detection methods, as well as attack-based and machine learning-based methods. Integrity constraint-based methods primarily process anomalies based on various pre-built constraint information, such as functional dependencies, containment dependencies, and conditional functional dependencies. This method requires a large amount of constraint information and is difficult to effectively extend to the rich and varied web tables. Rule-based methods inherently contain correct data, but these values rely on high-quality external resources. If the external database lacks relevant knowledge, it cannot detect errors in the data. However, web tables often lack predefined standard schemas. Traditional table anomaly handling techniques, limited by predefined and explicit relational schema information, struggle to address the ambiguity or even errors in the semantic information of web tables. Machine learning-based methods, on the other hand, are limited to feature engineering and other solutions, requiring large amounts of labeled data and lacking a comprehensive set of anomaly detection methods specifically for web tables.

[0004] To address the challenges of traditional table anomaly handling techniques, semantic models can be introduced as a new auxiliary tool to assist in the mining and discovery of ambiguous or erroneous information in Web tables. Table data processing is closely related to general natural language processing problems, both requiring semantic learning and processing of the textual expressions. However, unlike widely studied descriptive text, tables, with their row and column layout, present different semantic meanings, necessitating the development of processing models specifically for table formats. Furthermore, identifying anomalous data based on mining and utilizing the semantic mapping relationships within tables presents a new challenge. Therefore, designing semantic models tailored to the semantic features of tables to achieve anomaly handling in Web tables is a crucial problem that needs to be solved. Summary of the Invention

[0005] Objective of the Invention: Aiming at the problems existing in the above prior art, the present invention proposes a method for discovering abnormal data in Web tables based on text semantic mapping relationships, focusing on solving the problems of difficult recognition of table errors lacking patterns and difficult handling of fuzzy or incorrect semantic information in traditional table anomaly handling techniques. Convert the text strings in cells into semantic vectors in the text semantic space, and use them to infer the pattern information of columns. Finally, based on the method of relational mapping, error discovery is realized to find errors in the table.

[0006] Technical Solution: To achieve the objective of the present invention, the technical solution adopted by the present invention is: A method for discovering abnormal data in Web tables based on text semantic mapping relationships, the method comprising the following steps:

[0007] Step 1. Given the Web table data T to be processed, where T = {c i,j |0 ≤ i < R, 0 ≤ j < C}, R and C respectively represent the number of rows and columns of the table T, c i,j represents the string text of the cell, and each cell string text consists of one or more English words c i,j = (x1, x2,..., x n ); Use the pre-trained semantic model M SR for the table data set. During the training process, splice all the string texts in the same row and the same column where each cell is located as the training set, and input a certain cell c SR in the table T into the model M i,j , and output its corresponding semantic vector v i,j ;

[0008] Step 2. Train the column type inference model M CTI based on a large amount of Web table data. In the string text semantic space, directly classify according to the existing columns, and use the multi-classification training method to train the column type inference model M CTI . Input the table T processed by the semantic model M CTI into the model M SR , and output the column type inference result H = {h j |0 ≤ j < C};

[0009] Step 3. Establish an error discovery model M ED according to the mapping relationship between the core column and the column where the target cell is located. The input of the model is the cell data semantic vector v i,j obtained in Step 1, and the column type inference result H obtained in Step 2, and output the predicted cell data semantic vector Calculate the cosine similarity between the predicted value and the actual value v i,j . Those lower than the threshold are considered to have abnormal data in the cell.

[0010] Furthermore, in step 1, the semantic model M is pre-trained using a tabular data set. SR During training, all string text in the same row and column of each cell is concatenated as the training set and fed into model M. SR Enter a cell c in table T i,j Output its corresponding semantic vector v i,j The specific steps are as follows:

[0011] Step 101. Take any column j from table T and form a column cell dataset {c i,j |0≤i <R} j Randomly arrange the sets generated from multiple columns to form a new ordered dataset. By breaking down the cells and concatenating the cell text, we obtain the column data training set, as shown in the following formula:

[0012]

[0013] Step 102. Process the row cell data according to the method described in Step 101 to obtain the row data training set Set. R ;

[0014] Step 103. Generate the training set (Set) W =Set C ∪Set R , where Set C It is a training set constructed using a column-based data collection method. R The training set is constructed using a row data collection method;

[0015] Step 104. Train the Word2Vec model using the constructed training set to obtain a dictionary of text-semantic vector mappings. Input cell content c i,j =(x1,x2,…,x n By mapping the dictionary, we obtain the semantic vector g(x) corresponding to each string. i ) = v i Cell c is obtained by averaging. i,j The semantic representation vector is calculated using the following formula:

[0016]

[0017] In step 2, the column type inference model M is trained using a multi-classification training method. CTI To model M CTI Input passes through semantic model M SR The processed table T outputs column type inference results H = {h} j|0 ≤ j < C}, and the specific steps are as follows:

[0018] Step 201. The data of a certain column in the table T is C j ={c i,j |0 ≤ i < R} j , traverse all the cell data in this column. If the cell string text is not empty, use the semantic model M SR to process and obtain the set of semantic representation vectors C' corresponding to all the cells in this column j ={v i,j |0 ≤ i < R} j ;

[0019] Step 202. Randomly select l semantic vectors from C' j , and denote them as C j ″={v1, v2,..., v l};

[0020] Step 203. Use the deep learning language model Transformer as the main body of the column type inference model M CTI , insert the [CLS] label at the beginning of the data C j ″, insert the [SEP] label at its end, and after tokenization, use it as the input text;

[0021] Step 204. Send the input text into the word vector layer (TokenEmbedding) to convert each word into a word embedding vector with the same dimension;

[0022] Step 205. Send the input text into the position vector layer (PositionEmbedding) to convert each word into a position embedding vector. Specifically, mark the position of the [CLS] label as E0, mark the position of the [SEP] label as E2, and the corresponding position embedding vectors of all other inputs are E1;

[0023] Step 206. Add the vectors of Token Embedding and Position Embedding as the input of the M CTI model, select the vector output of the [CLS] label part as the processing object of the output, and the output of the [CLS] label is and input it into two layers of multi-layer perceptron and then calculate the result of its normalized exponential function (Softmax function):

[0024]

[0025] O = Softmax(H1W o ) + b o

[0026] Step 207. Change the way of training the model into a multi-classification problem, use the rectified linear unit (ReLU function) as the activation function, and use the cross-entropy loss function:

[0027]

[0028] where M is the number of classes, that is, the total number of statistical column types, and y ic represents the classification label, and p ic represents the frequency of the corresponding classification output by the model;

[0029] Step 208. Obtain the calculation result h j ∈R d 2 of the hidden layer through the multi-layer perceptron model, and this value is used as the semantic representation vector of the data in this column;

[0030] Furthermore, in step 3, the error discovery model M ED is established according to the mapping relationship between the core column and the column where the target cell is located. The specific steps are as follows:

[0031] Step 301. Define the leftmost column of the table as the core column, the leftmost entity in each row as the core entity, and the set of core entities in the table T as {c i,0 | 0 ≤ i < R};

[0032] Step 302. In the table T = {c i,j | 0 ≤ i < R, 0 ≤ j < C}, combine the semantic representation results {v i,j | 0 ≤ i < R, 0 ≤ j < C} obtained in step 1 and the column type inference results H = {h j | 0 ≤ j < C} obtained in step 2, and establish a mapping relationship h based on the core entity in each row:

[0033] h(h0, h1) ≈ h(v i,0 , h i,j )

[0034] Step 303. Establish an error discovery model M ED with the long short-term memory artificial neural network (LSTM) as the core model and the sequence-to-sequence (Seq2Seq) model as the overall framework, model the mapping relationship between the column types of two columns, calculate the mapping relationship of the in-row cell entities to obtain the result, where the input of the model is the column type inference result H and the cell data semantic vector v i,j in the table T, and the output of the model is the predicted cell data semantic vector

[0035] Step 304. In M EDThe model's encoder takes a column-type inference vector H as input and uses the two columns of data to be processed to form the input {h0, h...} at different time steps. j The model outputs the hidden layer vectors {x1,x2}, information vectors, and previous step information at each time step. The hidden layer vector x2 at the end of the model is extracted as the input to the decoding part.

[0036] Step 305. In M ED The model's decoding part (Decoder) takes as input the core entity and attribute entity pairs c. i,0 and c i,j The former constitutes the input word order {v i,0 ,x <go>< / go>}, x <go>< / go> Given a pre-defined label vector, the output x2 of the encoding part is used as the hidden layer input, and the model output is the information vector {o1, o2}. Let...

[0037] Step 306. Calculate the predicted value of the cell semantic vector using cosine similarity. and actual value v i,j Data with a similarity score below a set threshold is considered to contain abnormal data.

[0038] Step 307.M ED The cells in the model output table where the actual and predicted values do not match indicate possible outliers in the table.

[0039] Beneficial effects: Compared with the prior art, the technical solution of the present invention has the following beneficial technical effects:

[0040] (1) It can identify anomalous data in tables lacking explicit schema information. Since Web tables are constructed by users in an open and loosely defined environment without predefined standard schemas, the model must understand the semantic information within the table. However, traditional anomaly detection methods struggle to handle semantically diverse table data and require numerous constraints or external database support. This invention trains a semantic model to recognize cell semantics and column types in the table, enabling it to identify anomalous data in tables lacking standard schema information.

[0041] (2) It can improve the accuracy of Web table anomaly detection. The Web table anomaly detection method based on text semantic mapping relationship solves the problem of difficulty in identifying ambiguous semantic information in traditional anomaly detection methods. It combines cell semantics and column type identification, and uses a relation mapping-based method to infer whether the cell content is incorrect, effectively identifying semantically ambiguous text data. Attached Figure Description

[0042] Figure 1Flowchart of the method for discovering abnormal data in Web tables based on text semantic mapping relationships;

[0043] Figure 2 Example diagram of the method for discovering abnormal data in Web tables based on text semantic mapping relationships. Detailed implementation manners

[0044] The present invention will be further illustrated below in conjunction with the accompanying drawings and specific embodiments.

[0045] The objective of the present invention is to solve the problem of discovering abnormal data in Web tables based on text semantic mapping relationships. Since Web tables are constructed in an open and loose environment, there are often some incorrect or abnormal data. The Web tables processed by the present invention are semantic tables containing entities and the relationships between entities. The content of a certain cell represents an entity. For example, for the cell "Player M", it can be known whether it represents "Star Player M" in combination with the context. And the whole table represents the relationships between entities. For example, the team where "Player O" is located is "Team R". At the same time, the team of Player M in the table may be incorrectly filled as "Team L". The present invention discovers abnormal data in such Web tables, finds the abnormal data in the table and marks them.

[0046] The present invention constructs a model for discovering abnormal data in Web tables based on text semantic mapping relationships. The string text in the cell is converted into a semantic space vector through a semantic representation module, and then the column type inference module is used to infer and represent the column types of the table. The error discovery module combines the above modules to discover the error data in the table based on the mapping relationship. Therefore, the specific implementation steps of the present invention are as follows:

[0047] Step 1. Given the Web table data T to be processed, where T = {c i,j |0 ≤ i < R, 0 ≤ j < C}, R and C respectively represent the number of rows and columns of the table T, c i,j represents the string text of the cell, and each cell string text consists of one or more English words c i,j = (x1, x2,..., x n ); Use the table data set to pre-train the semantic model M SR , and splice all the string texts in the same row and the same column where each cell is located as the training set during the training process. Input a certain cell c SR in the table T into the model M i,j , and output its corresponding semantic vector v i,j ; Example: As Figure 2 shown, this table has three rows and two columns. The first column is the name of the star player, and the second column is the team where the star player is located. It can be seen that there is an abnormality in the team where "Player M" is located. In the table, it is "Team L", but actually it should be "Team B". Input into MSR The model takes the input of the table T and outputs the semantic vectors of the string text representations of each cell;

[0048] Step 2. Train the column type inference model M based on a large amount of Web table data CTI In the string text semantic space, directly classify according to the existing columns, and use the multi-classification training method to train the column type inference model M CTI Input into the model M CTI The processed table T by the semantic model M SR is input, and the column type inference result H = {h j | 0 ≤ j < C} is output; for example, input the star player column and the team column into the model M CTI , where each cell in each column is a semantic vector processed by the M SR model, and two column type semantic vectors are output respectively;

[0049] Step 3. Establish an error discovery model M according to the mapping relationship between the core column and the column where the target cell is located ED The input of the model is the cell data semantic vector v i,j obtained in Step 1, and the column type inference result H obtained in Step 2, and the predicted cell data semantic vector is output Calculate the predicted value and the actual value v i,j The cosine similarity of is calculated, and cells with a similarity lower than the threshold are considered to have abnormal data; for example, after obtaining the semantic vector of "Player M" and the semantic vectors of the column where it is located and the team column respectively, predict the semantic vector of the team where Player M is located, and calculate the cosine similarity between it and the semantic vector of the true value "Team L", and if it exceeds the threshold, there is abnormal data.

[0050] Furthermore, in Step 1, the semantic model M is pre-trained using the table data set SR During the training process, splice all the string texts in the same row and the same column of each cell as the training set, and input a cell c SR in the table T into the model M i,j , and output its corresponding semantic vector v i,j , the specific steps are as follows:

[0051] Step 101. Take any column j in the table T to form a column cell data set {c i,j | 0 ≤ i < R} j , randomly permute the set generated by multiple columns to form a new ordered data set Break its cells, and splice and combine the cell texts to obtain a column data training set, as shown in the following formula:

[0052]

[0053] As Figure 2 shown, the first column and the second column are processed respectively to obtain a column data training set;

[0054] Step 102. Process the row cell data row by row according to the method described in Step 101 to obtain a row data training set Set R , for example, processing the second row may obtain a row data training set of "Team L, Individual M";

[0055] Step 103. Generate a training set collection Set W = Set C ∪ Set R , where Set C is the training set constructed by the column data acquisition method, and Set R is the training set constructed by the row data acquisition method;

[0056] Step 104. Use the constructed training set to train through the Word2Vec model to obtain a mapping dictionary of text-semantic vectors. Input the cell content c i,j =(x1, x2,..., x n ), through the mapping dictionary, obtain the semantic vector g(x i ) = v i , and obtain the semantic representation vector of the cell c i,j by the averaging method. The calculation formula is as follows:

[0057]

[0058] In Step 2, the multi-class training method is used to train the column type inference model M CTI , input the table T processed by the semantic model M CTI into the model M SR , and output the column type inference result H = {h j |0 ≤ j < C}. The specific steps are as follows:

[0059] Step 201. The cell data of a certain column in the table T is C j ={c i,j |0 ≤ i < R} j , traverse all the cell data in this column. If the cell string text is not empty, use the semantic model M SR to process and obtain the semantic representation vector set C' j ={v i,j |0 ≤ i < R} jFor example, the cell data in the first column is post-processed using a semantic model to obtain a set of semantic vectors corresponding to the three cells in that column;

[0060] Step 202. From C′ j Let C be a random selection of l semantic vectors. j "={v1,v2,…,v l};

[0061] Step 203. Use the deep learning language model Transformer as the column type inference model M CTI The main body, in data C j Insert a [CLS] tag at the beginning of "" and a [SEP] tag at the end of it, and use it as input text after word segmentation.

[0062] Step 204. Feed the input text into the TokenEmbedding layer, and convert each word into a word embedding vector of the same dimension;

[0063] Step 205. Feed the input text into the PositionEmbedding layer and convert each word into a position embedding vector. Specifically, label the [CLS] tag position as E0, label the [SEP] tag position as E2, and label the corresponding position embedding vectors of all other inputs as E1.

[0064] Step 206. Sum the vectors of Token Embedding and Position Embedding to obtain M. CTI The model takes as input the vector of the [CLS] label portion as the output processing object, and the output of the [CLS] label is represented as follows: The result of the normalized exponential function (Softmax function) is calculated after inputting it into a two-layer multilayer perceptron:

[0065]

[0066] O = Softmax(H1W) o )+b o

[0067] Step 207. Transform the model training method into a multi-class classification problem, using the Rectified Linear Function (ReLU) as the activation function and the cross-entropy loss function:

[0068]

[0069] Where M is the number of categories, i.e., the total number of column types being counted, and y ic p represents category labels icRepresents the frequency of the corresponding classification output by the model;

[0070] Step 208. Obtain the calculation result of its hidden layer through the multi-layer perceptron model as This value is used as the semantic representation vector of the data in this column;

[0071] Furthermore, in step 3, the error discovery model M is established according to the mapping relationship between the core column and the column where the target cell is located ED , and the specific steps are as follows:

[0072] Step 301. Define the leftmost column of the table as the core column, the leftmost entity in each row as the core entity, and the set of core entities in the table T as {c i,0 | 0 ≤ i < R}; for example, Figure 2 In the table "Mou" in [], the column where it is located is the core class, and "Mou" is the core entity;

[0073] Step 302. In the table T = {c i,j | 0 ≤ i < R, 0 ≤ j < C}, combine the semantic representation results {v i,j | 0 ≤ i < R, 0 ≤ j < C} obtained in step 1 and the column type inference results H = {h j | 0 ≤ j < C} obtained in step 2, and establish a mapping relationship h based on the core entity in each row:

[0074] h(h0, h1) ≈ h(v i,0 , h i,j )

[0075] Step 303. Establish the error discovery model M with the long short-term memory artificial neural network (LSTM) as the core model and the sequence-to-sequence (Seq2Seq) model as the overall framework ED , model the mapping relationship between the column types of two columns, calculate the mapping relationship of the in-row cell entities to obtain the result, where the input of the model is the column type inference result H in the table T and the cell data semantic vector v i,j , and the output of the model is the predicted cell data semantic vector

[0076] Step 304. In the encoding part (Encoder) of the M ED model, the input is the column type inference vector H, and the data of the two columns to be processed form the input {h0, h j} at different time steps. The output of the model is the hidden layer vector {x1, x2}, information vector, and previous step information at each time step. Extract the hidden layer vector x2 at the end of the model as the input of the decoding part;

[0077] Step 305. In the M EDThe model's decoding part (Decoder) takes as input the core entity and attribute entity pairs c. i,0 and c i,j The former constitutes the input word order {v i,0 ,x <go>< / go>}, x <go>< / go> Given a pre-defined label vector, the output x2 of the encoding part is used as the hidden layer input, and the model output is the information vector {o1, o2}. Let...

[0078] Step 306. Calculate the predicted value of the cell semantic vector using cosine similarity. and actual value v i,j Data with a similarity score below a set threshold is considered to contain anomalous data; for example... Figure 2 The predicted value of the semantic vector in the cell, "Team B", has a similarity to the actual value, "Team L", which is below the threshold and is considered abnormal data.

[0079] Step 307.M ED The cells in the model output table where the actual and predicted values do not match indicate possible outliers in the table.

Claims

1. A method for detecting abnormal data in Web tables based on text semantic mapping relationships, characterized in that, The method includes the following steps: Step 1. Given the web table data T to be processed, where T = {c i，j |0≤i<R,0≤j<C},R and C represent the number of rows and columns of the table data T respectively, c i，j represents the string text of a cell, and each cell string text consists of one or more English words i，j = (x1, x2,..., x n ); Pre-train the semantic model M SR using the table data set, concatenate all string texts in the same row and column as the training set during the training process, input the model M sR with a certain cell c i，j in the table data T, and output its corresponding semantic vector v i，j ; Step 2. Train column type inference model M based on massive amounts of Web table data CTI Within the semantic space of string text, the column type inference model M is trained using a multi-classification training method based on the existing columns for direct classification. CTI To model M CTI Input passes through semantic model M SR The processed table data T outputs column type inference results H = {h} j |0≤j<C}; Step 3. Establish an error detection model M based on the mapping relationship between the core column and the column containing the target cell. ED Model M ED The input is the semantic vector v corresponding to the cell obtained in step 1. i，j The predicted cell data semantic vector is output, along with the column type inference result H obtained in step 2. The predicted cell data semantic vector is then compared with v. i，j The cosine similarity is used to determine if cells contain outliers; cells with similarity scores below a certain threshold are considered to contain outliers. Step 301. Define the leftmost column of the table as the core column, and the leftmost entity in each row as the core entity {c i，0 |0≤i<R}; Step 302. In the table data T, combine the v obtained in step 1. i，j Based on the column type inference result H obtained in step 2, establish a mapping relationship between the core entities in each row; Step 303. Establish an error detection model M using a long short-term memory artificial neural network as the core model and a sequence-to-sequence model as the overall framework. ED The mapping relationship between the column types of the two columns is modeled, and the mapping relationship between the cell entities in the row is calculated to obtain the result; Step 304. In M ED The encoding part of the model takes the column type inference result H as input and the two columns of data to be processed as inputs {h0, h...} at different time steps. j The output consists of the hidden layer vector, information vector, and previous step information at each time step. The final hidden layer vector is extracted as the input to the decoding part. Step 305. In M ED The decoding part of the model takes as input the core entity and attribute entity pairs c. i，0 and c i，j The former constitutes the input word order {v i，0 x ＜go＞ }, x ＜go＞ Given a pre-defined flag vector, the terminal hidden layer vector in the output of the encoding part is used as the hidden layer input, and the output is the information vector {o1, o2}. The predicted cell data semantic vector is set to be equal to o2.

2. The method for detecting abnormal data in Web tables based on text semantic mapping relationships according to claim 1, characterized in that, In step 1, the semantic model M is pre-trained using a tabular dataset. SR During training, all string text in the same row and column of each cell is concatenated as the training set and fed into model M. SR Input a cell c from table data T i，j Output its corresponding semantic vector v i，j The specific steps are as follows: Step 101. Take any column j from table data T to form a column cell dataset {c i，j |0≤i<R} j The set generated from multiple columns is randomly arranged to form a new ordered dataset. The cells are then removed, and the cell text is concatenated to obtain the column data training set, Set. C ; Step 102. Process the row cell data according to the method described in Step 101 to obtain the row data training set Set. R ; Step 103. Generate the training set (Set) W =Set C ∪Set R ; Step 104. Train the Word2Vec model using the constructed training set to obtain a dictionary mapping text-semantic vectors; input c i，j = (x1, x2, ..., x n By mapping the dictionary, we obtain the semantic vector g(x) corresponding to each string. k ) = v k c is obtained by averaging. i，j The semantic vector is calculated using the following formula: 。 3. The method for detecting abnormal data in Web tables based on text semantic mapping relationships according to claim 1, characterized in that, In step 2, the column type inference model M is trained using a multi-classification training method. CTI To model M CTI Input passes through semantic model M SR The processed table data T outputs column type inference results H = {h} j The specific steps are as follows: |0≤j<C} Step 201. The data in a certain column of table data T is C. j ={c i，j |0≤i<R} j Iterate through all cells in the column; if the cell string text is not empty, then use the semantic model M. SR The process yields the semantic vector set C′ corresponding to all cells in the column. j ={v i，j |0≤i<R} j ; Step 202. From C′ j A number of semantic vectors, denoted as C″, are randomly selected from the given vectors. j ={v1, v2, ..., v L }; Step 203. Use a deep learning language model as the column type inference model M CTI The main body, in data C″ j Insert a [CLS] tag at the beginning and a [SEP] tag at the end of the text, and use it as input text after word segmentation. Step 204. Feed the input text into the word vector layer and convert each word into a word embedding vector of the same dimension; Step 205. Feed the input text into the position vector layer and convert each word into a position embedding vector. Specifically, label the [CLS] tag position as E0, label the [SEP] tag position as E2, and label the corresponding position embedding vectors of all other inputs as E1. Step 206. Sum the vectors from the word vector layer and the position vector layer to obtain M. CTI The model takes as input the vector of the [CLS] label portion as the output processing object, and the output of the [CLS] label is represented as follows: The result of the normalized exponential function is then calculated after inputting it into a two-layer multilayer perceptron: ； O=Softmax(H1W o )+b o ; Step 207. Transform the training model into a multi-class classification problem, using a linear rectified function as the activation function and a cross-entropy loss function; Step 208. Obtain the calculation result of the hidden layer h using the multilayer perceptron model. j .