Method for automated determination of open source license compatibility and program product thereof

By extracting and fusing multi-dimensional features, the problem of misjudgment in open-source software license compatibility detection in existing technologies has been solved, and high-precision attitude recognition of custom licenses has been achieved, improving the accuracy and interpretability of detection.

CN122241657APending Publication Date: 2026-06-19ZHEJIANG UNIV +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHEJIANG UNIV
Filing Date
2026-03-04
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies, when detecting open-source software license compatibility, rely on a single semantic embedding vector, which cannot cover the actual attitude of the license text data towards the various standard clauses. They lack generalization ability and are prone to misjudgment, especially in handling incompatibility issues with custom licenses.

Method used

By extracting and fusing multi-dimensional features, including textual semantic features, clause attribute features, local attitude tendency features, and global attitude tendency features, graph attention neural networks and multi-head attention mechanisms are used to generate license attribute features that cover deep associations. Combined with a pre-trained recognition model, the attitude type of license text data towards each standard clause is identified.

Benefits of technology

It improves the accuracy of attitude recognition for various standard clauses in open source licenses, especially for new or custom licenses that have not been seen before, and enables interpretable and verifiable license compatibility testing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241657A_ABST
    Figure CN122241657A_ABST
Patent Text Reader

Abstract

This application discloses an automated method and program product for determining open-source license compatibility, applied in the field of computer software technology. The method includes: acquiring license text data; extracting features from the license text data to obtain text semantic features, clause attribute features, local attitude bias features, and global attitude bias features; fusing the text semantic features, clause attribute features, local attitude bias features, and global attitude bias features to obtain license attribute features; and performing recognition processing based on the license attribute features to obtain the attitude type of the license text data towards each standard clause; wherein the attitude type includes any one of mandatory type, permitted type, or prohibited type. This application can effectively improve the accuracy of open-source license attitude recognition towards each standard clause, especially for new or custom licenses that have not been seen before.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer software technology, and in particular to an automated method for determining open-source license compatibility and its program product. Background Technology

[0002] Open source software (OSS) license incompatibility refers to conflicts between multiple licenses in the same project. If there is no license in the newly defined software that can integrate two licenses without conflicting rights / obligations, then the two licenses are incompatible.

[0003] In related technologies, sentence-level and paragraph-level semantic embedding vectors are extracted from license text data, and the semantic embedding vectors are input into a pre-trained neural network model to obtain the attitude type of the license text data towards each standard clause, thereby realizing license compatibility detection.

[0004] However, the relevant technologies rely solely on semantic embedding vectors when detecting license compatibility. Due to the diverse ways license texts are expressed, especially the great flexibility of custom licenses, this approach, which relies on a single feature, often fails to cover the actual attitude of the license text data towards the various standard clauses, lacks generalization ability, and is prone to misjudgment. Summary of the Invention

[0005] This application provides an automated method and program product for determining open source license compatibility, which improves the accuracy of identifying the attitude of open source licenses towards various standard clauses.

[0006] On the one hand, this application provides an open-source license compatibility determination method, including the following steps:

[0007] Obtain license text data;

[0008] Feature extraction is performed on the license text data to obtain text semantic features, clause attribute features, local attitude tendency features, and global attitude tendency features. The text semantic features indicate the semantics of the license text data; the clause attribute features indicate the content in the license text data corresponding to each standard clause; the local attitude tendency features indicate keywords in the license text data associated with attitude tendency; and the global attitude tendency features indicate the similarity between the license text data and multiple standard license texts.

[0009] The text semantic features, the clause attribute features, the local attitude tendency features, and the global attitude tendency features are fused to obtain the license attribute features;

[0010] Based on the license attribute features, an identification process is performed to obtain the attitude type of the license text data towards each of the standard clauses; wherein, the attitude type includes any one of the mandatory type, permitted type, or prohibited type.

[0011] Further, in one embodiment, the feature extraction of the license text data to obtain text semantic features, clause attribute features, local attitude bias features, and global attitude bias features includes:

[0012] The license text data is segmented and encoded to obtain the text semantic features;

[0013] The license text data is subjected to sequence labeling, mapping, and graph modeling to obtain the attribute features of the terms;

[0014] Keyword extraction is performed on the license text data to obtain the local attitude tendency features;

[0015] The global attitude tendency feature is obtained by comparing the license text data with multiple standard license texts.

[0016] Furthermore, in one embodiment, the step of segmenting and encoding the license text data to obtain the text semantic features includes:

[0017] The license text data is segmented to obtain input identifiers and attention masks;

[0018] The input identifier and the attention mask are encoded to obtain the text semantic features.

[0019] Further, in one embodiment, the clause attribute features include clause embedding features and clause association features; the step of performing sequence labeling, mapping, and graph modeling processing on the license text data to obtain the clause attribute features includes:

[0020] The license text data is subjected to sequence labeling and length compensation processing to obtain a clause number sequence; wherein, the clause number sequence includes the content in the license text data corresponding to each standard clause;

[0021] The clause number sequence is mapped to a high-dimensional vector to obtain the clause embedding feature;

[0022] Using the content in the license text data corresponding to each standard clause as nodes, and the correlation between the content in the license text data corresponding to each standard clause as edges, the clause embedding features are processed using a graph attention neural network to obtain the clause association features.

[0023] Furthermore, in one embodiment, the keyword extraction processing of the license text data to obtain the local attitude tendency features includes:

[0024] From the license text data, extract keywords related to the attitude of the terms as target keywords;

[0025] The frequency and position of the target keyword in the license text data are determined as the keyword features.

[0026] Further, in one embodiment, the standard license text includes a first standard text and a second standard text; the step of comparing the license text data with multiple standard license texts to obtain the global attitude tendency feature includes:

[0027] The text similarity between the license text data and the first standard text is used as the first similarity.

[0028] The text similarity between the license text data and the second standard text is used as the second similarity.

[0029] The mean of the first similarity and the second similarity is determined as the global attitude tendency feature.

[0030] Further, in one embodiment, the clause attribute features include clause embedding features and clause association features; the feature fusion of the text semantic features, the clause attribute features, the local attitude bias features, and the global attitude bias features to obtain the license attribute features includes:

[0031] Using the text semantic features as the query and the clause embedding features as the key and value, a multi-head attention mechanism is used to fuse the text semantic features and the clause embedding features to obtain a first fused feature;

[0032] Using the text semantic features as the query and the clause association features as the key and value, a multi-head attention mechanism is used to fuse the text semantic features and the clause association features to obtain a second fused feature;

[0033] The license attribute feature is obtained by concatenating the first fusion feature, the second fusion feature, the clause attribute feature, the local attitude tendency feature, and the global attitude tendency feature and performing average pooling over the time dimension.

[0034] On the other hand, embodiments of this application provide an open-source license compatibility determination device, including:

[0035] The acquisition module is used to obtain license text data;

[0036] The first processing module is used to extract features from the license text data to obtain text semantic features, clause attribute features, local attitude tendency features, and global attitude tendency features; wherein, the text semantic features are used to indicate the semantics of the license text data, the clause attribute features are used to indicate the content in the license text data corresponding to each standard clause, the local attitude tendency features are used to indicate keywords in the license text data associated with attitude tendency, and the global attitude tendency features are used to indicate the similarity between the license text data and multiple standard license texts;

[0037] The second processing module is used to fuse the text semantic features, the clause attribute features, the local attitude tendency features and the global attitude tendency features to obtain the license attribute features.

[0038] The third processing module is used to perform identification processing based on the license attribute features to obtain the attitude type of the license text data towards each of the standard clauses; wherein the attitude type includes any one of mandatory type, permitted type or prohibited type.

[0039] In another aspect, embodiments of this application provide an electronic device, including:

[0040] At least one processor;

[0041] At least one memory for storing at least one program;

[0042] The above-described open-source license compatibility determination method is implemented when at least one of the programs is executed by at least one of the processors.

[0043] In another aspect, embodiments of this application provide a computer program product, including a computer program, characterized in that the computer program, when executed by a processor, implements the above-described open-source license compatibility determination method.

[0044] According to an embodiment of this application, an automated open-source license compatibility determination method and its program product are provided, which involves obtaining license text data; extracting features from the license text data to obtain text semantic features, clause attribute features, local attitude tendency features, and global attitude tendency features; wherein, text semantic features are used to indicate the semantics of the license text data, clause attribute features are used to indicate the content in the license text data corresponding to each standard clause, local attitude tendency features are used to indicate keywords in the license text data associated with attitude tendency, and global attitude tendency features are used to indicate the similarity between the license text data and multiple standard license texts; the text semantic features, clause attribute features, local attitude tendency features, and global attitude tendency features are fused to obtain license attribute features; and recognition processing is performed based on the license attribute features to obtain the attitude type of the license text data towards each standard clause; wherein, the attitude type includes any one of mandatory type, permitted type, or prohibited type. This application extracts and fuses features from multiple dimensions, mining shallow features from four complementary dimensions: semantic understanding, clause characteristics, local attitude bias, and global attitude bias. Through feature interaction, license attribute features covering deep associations are generated, thereby identifying the must / allow / prohibit attitude of license text data towards each standard clause. This can effectively improve the accuracy of open source license attitude identification for each standard clause, especially for new or custom licenses that have not been seen before. Attached Figure Description

[0045] Figure 1 This is a flowchart of an open-source license compatibility determination method provided in this application;

[0046] Figure 2 This is a schematic diagram illustrating the principle of feature extraction provided in this application;

[0047] Figure 3 This is a flowchart of the feature fusion provided in this application;

[0048] Figure 4 This is a structural diagram of an open-source license compatibility discrimination device provided in this application;

[0049] Figure 5 This is a structural diagram of an electronic device provided in this application. Detailed Implementation

[0050] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0051] The present application will be further described below with reference to the accompanying drawings and specific embodiments. The described embodiments should not be considered as limitations on the present application, and all other embodiments obtained by those skilled in the art without inventive effort are within the scope of protection of the present application.

[0052] In the following description, references are made to “some embodiments,” which describe a subset of all possible embodiments. However, it is understood that “some embodiments” may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict.

[0053] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0054] Open source software (OSS) refers to software whose source code is publicly available and allows users to use, modify, and distribute it. OSS licenses specify the conditions that must be followed when reusing, distributing, and modifying the software. In addition to widely used standard licenses (such as the MIT license), developers can also customize their own licenses, often referred to as custom licenses, which are more flexible in their description. The existence of custom licenses presents challenges in understanding licenses and their compatibility, especially license incompatibility. License incompatibility refers to conflicts between multiple licenses within the same project. Two licenses are incompatible if there is no license in the newly defined software that can integrate two licenses without conflicting rights / obligations.

[0055] Currently, most research focuses on checking compatibility between common, known OSS licenses. German et al. developed a license compatibility identification model and conducted case studies on how to resolve license incompatibilities across different software systems. Subsequent research proposed methods for understanding and checking license compatibility between Fedora Linux distributions, Android applications, and Java applications. Wolter et al. investigated license inconsistencies in GitHub repositories and found that many existing repositories do not fully declare all licenses in their source code. However, these studies only address known OSS licenses and cannot handle custom license incompatibilities.

[0056] Other studies have explored the use of deep learning (DL) methods for fine-grained analysis of license terms to uncover the possibility of arbitrary license incompatibilities. It should be understood that license terms refer to a formalized, standardized description of the conditions of software use, such as commercial use, while license term entities refer to specific expressions of license terms in real-world license texts, such as sales or offering for sale.

[0057] For example, Xu et al. proposed LIDETECTOR, a method based on Natural Language Processing (NLP) to interpret any OSS license and detect compatibility. It categorizes license terms into 23 classes, as shown in Table 1, where each term represents a possible action a user might take. To better understand these terms and facilitate incompatibility detection, Xu et al. divided these license terms into two main categories: rights and obligations. They identified 23 predefined license terms (i.e., standard terms) from the license text and determined the attitude type towards each identified term. Attitude types were categorized into three types: MUST, CAN, and CANNOT, thereby achieving license compatibility detection.

[0058] Table 1: Comparison of Standard Clauses for 23 Types of Open Source Licenses

[0059]

[0060] Accordingly, related technologies extract sentence-level and paragraph-level semantic embedding vectors from license text data and input these vectors into a pre-trained neural network model to obtain the attitude type of the license text data towards each standard clause, thereby achieving license compatibility detection. However, these technologies rely solely on semantic embedding vectors when detecting license compatibility. Due to the diverse textual expressions of licenses, especially the high flexibility of custom licenses, this reliance on a single feature often fails to encompass the actual attitude of the license text data towards each standard clause, lacks generalization ability, and is prone to misjudgment.

[0061] In response to this, this application proposes an automated open-source license compatibility determination method and its program product, aiming to effectively improve the accuracy of open-source license attitude identification for various standard clauses.

[0062] Reference Figure 1 , Figure 1 This is a flowchart of an open-source license compatibility determination method provided in this application, which includes the following steps S101-S104.

[0063] S101, Obtain license text data.

[0064] S102, feature extraction is performed on the license text data to obtain text semantic features, clause attribute features, local attitude tendency features, and global attitude tendency features. Among them, text semantic features are used to indicate the semantics of the license text data, clause attribute features are used to indicate the content in the license text data corresponding to each standard clause, local attitude tendency features are used to indicate the keywords in the license text data associated with attitude tendency, and global attitude tendency features are used to indicate the similarity between the license text data and multiple standard license texts.

[0065] In this step, feature extraction is performed on the license text data. Specifically, semantic features at the sentence, term, and structural levels are extracted from the license text data to capture contextual semantics and obtain textual semantic features. The corresponding segments of each standard clause in the text are marked from the license text data, and feature extraction is performed accordingly to obtain clause attribute features. Keywords associated with attitude tendencies are identified from the license text data, and feature extraction is performed accordingly to obtain local attitude tendency features. By comparing the license text data with multiple standard license texts, the similarity between the license text data and multiple standard license texts can be effectively located, thereby obtaining global attitude tendency features. In this way, shallow features associated with the rights and obligations attitudes of license clauses can be initially mined, providing data support from multiple perspectives, including semantic understanding, clause characteristics, local attitude tendencies, and global attitude tendencies, for subsequent fusion. It should be understood that local attitude tendencies refer to attitude tendencies obtained solely from license text data without relying on external data, while global attitude tendencies refer to attitude tendencies obtained by relying on external data (such as standard license texts) and combining them with the license text data.

[0066] S103, the text semantic features, clause attribute features, local attitude tendency features and global attitude tendency features are fused to obtain the license attribute features.

[0067] In this step, textual semantic features, clause attribute features, local attitude tendency features, and global attitude tendency features are fused to achieve feature interaction and fusion between different shallow features. Based on this, we can further mine the deep features related to the rights and obligations attitudes of the license clauses, thereby obtaining license attribute features and providing accurate feature information for subsequent model recognition.

[0068] S104. Based on the characteristics of the license attributes, the attitude type of the license text data towards each standard clause is obtained; wherein, the attitude type includes any one of the mandatory type, permitted type or prohibited type.

[0069] In this step, the license attribute features are input into a pre-trained recognition model to perform parallel recognition of attitudes towards 23 clauses, obtaining the attitude type of the license text data towards each standard clause. Thus, by performing recognition operations based on deep features, interpretable, verifiable, and industry-consensus-compliant license compatibility test results can be obtained.

[0070] The pre-trained recognition model refers to the algorithm model trained using multiple preset samples and their corresponding label information. The preset samples are license attribute feature samples obtained by feature extraction and feature fusion from license text samples. Feature extraction is similar to step S102, and feature fusion is similar to step S103. The corresponding label information refers to the attitude type of the license text sample towards each standard clause. It should be understood that the type of algorithm model can be flexibly set according to the actual situation. For example, the algorithm model can be a traditional machine learning model such as Support Vector Machine (SVM) or Random Forest (RF), or a deep learning model such as Convolutional Neural Network (CNN) or Transformer.

[0071] In one example, the recognition model is a Multilayer Perceptron (MLP). The output of the MLP (sentence-level representation vectors) is reshaped into a tensor of [batch_size, 23, 3], where batch_size represents the number of training batches, and the tensor represents the logits (original output values) of each sample in the batch, corresponding to the 23 items in each of the three attitude categories. The training strategy is as follows: the loss function uses cross-entropy loss, which is compared with the sentiment attitude labels in the original dataset, and items not mentioned in the labels are masked to ensure that only real items participate in the loss calculation; the optimizer uses the AdamW optimizer with a learning rate of 5e-5; gradient accumulation is used to simulate the effect of large-batch training; mixed precision training is supported to accelerate training and save GPU memory; a custom TrainerCallback is implemented to print the loss values ​​in real time during training for easy monitoring; and training of the model ends when the maximum number of training iterations is reached. In practical applications, the forward propagation of the multilayer perceptron yields 3D logits (raw output values) for 23 clauses. argmax (maximum index) is then executed on the logits (raw output values) of each clause to obtain the attitude type (MUST / CANNOT / CAN). The output is a list of predicted attitudes for the 23 clauses corresponding to the input license text.

[0072] In one example, the license text is:

[0073] “This software is provided 'as-is', without any express or implied warranty.

[0074] In no event will the authors be held liable for any damages arising from the use of this software.

[0075] Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:

[0076] 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.

[0077] 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.

[0078] 3. This notice may not be removed or altered from any source distribution.”。

[0079] The method described in this application yields the following identification results: [1, 1, 1, 2, 0, 0, 0, 0, 1, 0, 3, 0, 2, 0, 1, 0, 2, 0, 0, 0, 2, 0, 0], representing the license's attitude towards each of the 23 clauses. Here, 1 represents CAN, 2 represents CANNOT, 3 represents MUST, and 0 represents that the license does not address that clause. For example, the license text's attitude towards the fourth clause (Hold License) is 2 (CANNOT), indicating that the software provider assumes no further responsibility.

[0080] Therefore, this application embodiment extracts and fuses features from multiple dimensions, mining shallow features from four complementary dimensions: semantic understanding, clause characteristics, local attitude bias, and global attitude bias. Through feature interaction, license attribute features covering deep associations are generated, thereby identifying the must / allow / prohibit attitude of license text data towards each standard clause. This can effectively improve the accuracy of open source license attitude identification towards each standard clause, especially for new or custom licenses that have not been seen before.

[0081] In some optional embodiments, after step S101, the method further includes:

[0082] Perform data cleaning on the license text data;

[0083] Standardize the license text data;

[0084] Named entity tagging is applied to the license text data.

[0085] Here, to ensure data quality, preprocessing operations are performed, which may include at least one of the following: cleaning and removing punctuation marks from the license text data; standardizing the license text data; and using the method previously studied by Xu et al. to perform named entity tagging on the license text data, where if the license text involves one of the 23 license clauses, it is tagged with the clause ID (0-22), and non-clause parts are tagged with 23.

[0086] In some alternative embodiments, refer to Figure 2 The above step S102 includes the following steps S201-S204.

[0087] S201, perform word segmentation and encoding on the license text data to obtain text semantic features.

[0088] In this step, the license text data is segmented and encoded to extract semantic features at the sentence, term, and structural levels to capture contextual semantics and obtain textual semantic features.

[0089] Furthermore, in some optional embodiments, step S201 above includes:

[0090] The license text data is segmented to obtain the input identifier and attention mask;

[0091] The input identifier and attention mask are encoded to obtain the text semantic features.

[0092] First, a pre-trained word segmenter is used to segment the license text data, generating sequences of input_ids and attention_masks. Then, a base model is constructed and pre-trained specifically for legal text, resulting in a pre-trained base model. This allows the pre-trained base model to accurately understand the specialized legal terminology in the license, avoiding semantic biases common to general models. The type of base model can be flexibly set according to the actual situation; for example, it could be nlpaueb / legal-bert-base-uncased, but it is not limited to this. Finally, the input identifiers and attention masks are input into the pre-trained base model for encoding, obtaining context-aware sequence representations, i.e., textual semantic features. This accurately captures the contextual semantic information and long-distance dependencies in the license text data, locating shallow features related to the rights and obligations associated with the license terms from a semantic understanding perspective, providing semantic understanding-level data for subsequent compatibility testing.

[0093] S202, sequence labeling, mapping and graph modeling are performed on the license text data to obtain the clause attribute features.

[0094] In this step, sequence labeling, mapping, and graph modeling are performed on the license text data. The aim is to mark the corresponding fragment of each standard clause in the license text data and capture the correlation between different fragments, thereby forming clause attribute features.

[0095] Furthermore, in some optional embodiments, the aforementioned clause attribute features include clause embedding features and clause association features; the aforementioned step S202 includes:

[0096] The license text data is subjected to sequence labeling and length compensation to obtain the clause number sequence; the clause number sequence includes the content in the license text data corresponding to each standard clause;

[0097] The clause number sequence is mapped to a high-dimensional vector to obtain the clause embedding features;

[0098] Using the content in the license text data corresponding to each standard clause as nodes, and the correlation between the content in the license text data corresponding to each standard clause as edges, a graph attention neural network is used to process the clause embedding features to obtain clause association features.

[0099] Here, the clause characteristic dimension can be divided into two levels: internal characteristics of clause content and inter-clause characteristics. At the internal characteristics level, sequence labeling is performed on the license text data to automatically identify and map non-standard license text data with disordered numbering, scattered paragraphs, and uneven lengths to various standard clauses, resulting in a clause number (ID) sequence. This sequence includes the corresponding fragments of each standard clause in the license text data, arranged according to clause number. For example, the license text data might include the fragment "In no event will the authors be held liable for any damages arising from the use of this software.", which corresponds to clause number 3 (Hold Liable: holding authors liable for subsequent consequences) out of 23 standard clauses. It should be understood that if this clause does not have a corresponding fragment in the text, it is marked as zero. Subsequently, an independent nn.Embedding layer (i.e., embedding layer) maps each data item in the clause number sequence into a high-dimensional vector, obtaining clause embedding features. These features include the embedding vectors of the corresponding fragments of each standard clause in the license text data, arranged according to clause number, indicating the characteristics of each fragment in the license text data corresponding to a standard clause. In this way, superficial features related to the rights and obligations of the license terms can be captured at the level of the internal characteristics of the terms themselves. This provides data support for subsequent compatibility testing at the level of the internal characteristics of the terms, helping the model to focus on the terms that require attention. In addition, the embedding vector format provides a structured basis for subsequent processing.

[0100] Optionally, to ensure the uniformity of feature dimensions and improve the accuracy of the clause number sequence, a length compensation process is performed on the clause number sequence to obtain the compensated clause number sequence as the final clause number sequence. In this length compensation process, the clause number sequence is padded or truncated to a fixed length, and 24 (i.e., the total number of clauses + 1) is used as the padding parameter padding_idx for padding to ensure that the padded part does not participate in gradient updates, thus obtaining the compensated clause number sequence.

[0101] At the level of inter-clause characteristics, the content in the license text data corresponding to each standard clause (i.e., the embedding vectors in the clause embedding features) is used as nodes, and the relationships between the content in the license text data corresponding to each standard clause are used as edges. A Graph Attention Network (GAT) is used to model and process the clause embedding features, allowing each node to aggregate information from its neighboring nodes, thereby generating feature information containing strong dependencies, i.e., clause association features. In this way, shallow features related to the rights and obligations of the license clauses can be captured at the level of inter-clause characteristics. This provides data support at the level of inter-clause characteristics for subsequent compatibility testing, helping the model to focus not only on a single fragment of content during compatibility testing, but also on the relationships between different fragments of content and the importance of different relationships for compatibility testing.

[0102] S203, keyword extraction processing is performed on the license text data to obtain local attitude tendency features.

[0103] In this step, keyword extraction is performed on the license text data. Keywords associated with attitude tendencies are identified from the license text data, and feature extraction is performed accordingly to obtain local attitude tendency features.

[0104] Furthermore, in some optional embodiments, step S203 above includes:

[0105] Extract keywords related to the attitude of the terms from the license text data as target keywords;

[0106] The frequency and location of target keywords in the license text data are determined as keyword features.

[0107] First, keywords related to the attitude of the license terms are statistically extracted from the license text data, such as must, shall, may, prohibit, and liability, to obtain target keywords. Then, the frequency of occurrence of the target keywords in the license text data is calculated, and their positions are located. The obtained frequency and position are determined as keyword features. This effectively locates the core signals of local attitude bias, capturing shallow features related to the rights and obligations of the license terms at the local attitude bias level. This provides data support for subsequent compatibility testing at the local attitude bias level, helping to improve the model's sensitivity to legal terminology during compatibility testing.

[0108] S204 compares the license text data with multiple standard license texts to obtain global attitude tendency features.

[0109] In this step, by comparing the license text data with multiple standard license texts, the similarity between the license text data and multiple standard license texts can be effectively located, thereby obtaining global attitude tendency features.

[0110] Furthermore, in some optional embodiments, the aforementioned standard license text includes a first standard text and a second standard text; step S204 includes:

[0111] The text similarity between the license text data and the first standard text is used as the first similarity.

[0112] The text similarity between the license text data and the second standard text is used as the second similarity.

[0113] The mean of the first and second similarities is determined as the global attitude tendency feature.

[0114] First, different standard license texts are obtained: a first standard text and a second standard text. These two standard texts can be flexibly set according to the actual situation; for example, the first standard text could be MIT, and the second standard text could be GPL, but it is not limited to these. Then, the Sentence-BERT model is used to calculate the text similarity between the license text data and the first standard text as the first similarity, and the Sentence-BERT model is used to calculate the text similarity between the license text data and the standard license text as the second similarity. Finally, the mean of the first and second similarities is determined as the global attitude tendency feature. This feature indicates the similarity between the license text data and the standard license text in terms of attitude tendency. The higher the similarity, the higher the consistency between the terms of the license text data and the standard license text; conversely, the lower the similarity, the lower the consistency. In this way, the similarity results between the license text data and the two different standard license texts in terms of attitude tendency are determined separately, and the two similarity results are averaged to obtain a more accurate similarity result. This avoids misjudgment of tendency caused by standard selection bias and accurately captures the shallow features of the global attitude tendency of the license text data, providing data basis at the global attitude tendency level for subsequent compatibility testing.

[0115] In some alternative embodiments, refer to Figure 3 The above step S103 includes the following steps S301-S303.

[0116] S301 uses textual semantic features as the query and clause embedding features as the key and value. It uses a multi-head attention mechanism to fuse the textual semantic features and clause embedding features to obtain the first fused feature.

[0117] In this step, textual semantic features are used as queries, and clause embedding features are used as keys and values. A multi-head attention mechanism is employed to fuse the textual semantic features and clause embedding features, resulting in the first fused feature. This first fused feature represents a fusion of the semantic understanding level and the internal characteristics of the conditional content, thereby achieving feature interaction and fusion between the internal characteristics dimension of the clause content and the semantic understanding dimension. In this process, the clause embedding features provide the textual semantic features with information about the internal characteristics of the clause content, while the textual semantic features provide the clause embedding features with contextual information. This allows the feature information from both dimensions to mutually perceive the feature information related to the rights and obligations of the license clause, thus better uncovering the deep features hidden in the data across various dimensions that are associated with the rights and obligations of the license clause.

[0118] S302 uses textual semantic features as the query and clause association features as the key and value. It uses a multi-head attention mechanism to fuse textual semantic features and clause association features to obtain a second fused feature.

[0119] In this step, textual semantic features are used as queries, and clause association features are used as keys and values. A multi-head attention mechanism is employed to fuse the textual semantic features and clause association features, resulting in a second fused feature. This second fused feature represents a fusion of the semantic understanding level and the characteristic level between the conditional content, thereby achieving feature interaction and fusion between the characteristic dimension of clause content and the semantic understanding dimension. In this process, the clause association features provide textual semantic features with characteristic information about the content of the clauses, while the textual semantic features provide contextual information about the clause association features. This allows the feature information of the two dimensions to mutually perceive the feature information contained in each other that is related to the rights and obligations attitudes of the license clauses, thus better uncovering the deep features hidden in the data of each dimension that are related to the rights and obligations attitudes of the license clauses.

[0120] Optionally, the data objects to be processed can be normalized before multi-head attention processing.

[0121] S303, the first fusion feature, the second fusion feature, the clause attribute feature, the local attitude tendency feature, and the global attitude tendency feature are concatenated and averaged over time to obtain the license attribute feature.

[0122] In this step, the first fused feature, the second fused feature, the clause attribute feature, the local attitude tendency feature, and the global attitude tendency feature are concatenated to obtain a concatenated feature sequence. Then, average pooling is performed on the concatenated feature sequence along the time dimension to obtain a sentence-level representation vector, i.e., the license attribute feature. Thus, a more robust license attribute feature representation is constructed through the complementary integration of multi-dimensional features, providing a more reliable data foundation for subsequent compatibility testing.

[0123] Furthermore, refer to Figure 4 This application also provides an open-source license compatibility determination device, including:

[0124] Module 401 is used to obtain license text data;

[0125] The first processing module 402 is used to extract features from the license text data to obtain text semantic features, clause attribute features, local attitude tendency features, and global attitude tendency features. Among them, the text semantic features are used to indicate the semantics of the license text data, the clause attribute features are used to indicate the content in the license text data corresponding to each standard clause, the local attitude tendency features are used to indicate the keywords in the license text data that are associated with the attitude tendency, and the global attitude tendency features are used to indicate the similarity between the license text data and multiple standard license texts.

[0126] The second processing module 403 is used to fuse text semantic features, clause attribute features, local attitude tendency features and global attitude tendency features to obtain license attribute features.

[0127] The third processing module 404 is used to perform identification processing based on the license attribute features to obtain the attitude type of the license text data towards each standard clause; wherein the attitude type includes any one of the mandatory type, permitted type or prohibited type.

[0128] The content of the above method embodiments is applicable to the device embodiments. The specific functions implemented by the device embodiments are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

[0129] In addition, refer to Figure 5 This application provides an electronic device, including:

[0130] At least one processor 501;

[0131] At least one memory 502 is used to store at least one program;

[0132] When at least one program is executed by at least one processor 501, the at least one processor 501 implements the above-described open-source license compatibility determination method.

[0133] The aforementioned memory 502, as a non-transitory network system, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory 502 may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 502 may optionally include memory 502 remotely located relative to processor 501, and these remote memories 502 can be connected to processor 501 via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0134] The aforementioned memory 502 can be implemented as a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 502 can store the operating system and other application programs. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory 502 and is called and executed by the processor 501.

[0135] The processor 501 described above can be implemented using a general-purpose central processing unit (CPU), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this application.

[0136] In some embodiments, the above-described device may further include:

[0137] Input / output interfaces are used to implement information input and output;

[0138] The communication interface is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).

[0139] The bus transmits information between various components of the device (such as processor 501, memory 502, input / output interfaces, and communication interfaces);

[0140] The processor 501, memory 502, input / output interface, and communication interface can communicate with each other within the device via a bus.

[0141] The content of the above method embodiments is applicable to this device embodiment. The specific functions implemented in this device embodiment are the same as those in the above method embodiments, and the beneficial effects achieved are also the same as those achieved in the above method embodiments.

[0142] Finally, this application provides a computer program product, including a computer program, characterized in that the computer program, when executed by a processor, implements the above-described open-source license compatibility determination method.

[0143] The content of the above method embodiments is applicable to the embodiments of this program product. The specific functions implemented by the embodiments of this program product are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

[0144] Although embodiments of this application have been shown and described, those skilled in the art will understand that various changes, modifications, substitutions and variations can be made to these embodiments without departing from the principles and spirit of this application, the scope of which is defined by the claims and their equivalents.

[0145] The above is a detailed description of the preferred embodiments of this application, but this application is not limited to the embodiments described. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of this application, and these equivalent modifications or substitutions are all included within the scope defined by the claims of this application.

Claims

1. A method for determining open-source license compatibility, characterized in that, Includes the following steps: Obtain license text data; Feature extraction is performed on the license text data to obtain text semantic features, clause attribute features, local attitude tendency features, and global attitude tendency features. The text semantic features indicate the semantics of the license text data; the clause attribute features indicate the content in the license text data corresponding to each standard clause; the local attitude tendency features indicate keywords in the license text data associated with attitude tendency; and the global attitude tendency features indicate the similarity between the license text data and multiple standard license texts. The text semantic features, the clause attribute features, the local attitude tendency features, and the global attitude tendency features are fused to obtain the license attribute features; Based on the license attribute features, an identification process is performed to obtain the attitude type of the license text data towards each of the standard clauses; wherein, the attitude type includes any one of the mandatory type, permitted type, or prohibited type.

2. The method according to claim 1, characterized in that, The feature extraction of the license text data yields text semantic features, clause attribute features, local attitude bias features, and global attitude bias features, including: The license text data is segmented and encoded to obtain the text semantic features; The license text data is subjected to sequence labeling, mapping, and graph modeling to obtain the attribute features of the terms; Keyword extraction is performed on the license text data to obtain the local attitude tendency features; The global attitude tendency feature is obtained by comparing the license text data with multiple standard license texts.

3. The method according to claim 2, characterized in that, The step of segmenting and encoding the license text data to obtain the text semantic features includes: The license text data is segmented to obtain input identifiers and attention masks; The input identifier and the attention mask are encoded to obtain the text semantic features.

4. The method according to claim 2, characterized in that, The clause attribute features include clause embedding features and clause association features; the process of performing sequence labeling, mapping, and graph modeling on the license text data to obtain the clause attribute features includes: The license text data is subjected to sequence labeling and length compensation processing to obtain a clause number sequence; wherein, the clause number sequence includes the content in the license text data corresponding to each standard clause; The clause number sequence is mapped to a high-dimensional vector to obtain the clause embedding feature; Using the content in the license text data corresponding to each standard clause as nodes, and the correlation between the content in the license text data corresponding to each standard clause as edges, the clause embedding features are processed using a graph attention neural network to obtain the clause association features.

5. The method according to claim 2, characterized in that, The process of extracting keywords from the license text data to obtain the local attitude tendency features includes: From the license text data, extract keywords related to the attitude of the terms as target keywords; The frequency and position of the target keyword in the license text data are determined as the keyword features.

6. The method according to claim 2, characterized in that, The standard license text includes a first standard text and a second standard text; the process of comparing the license text data with multiple standard license texts to obtain the global attitude tendency feature includes: The text similarity between the license text data and the first standard text is used as the first similarity. The text similarity between the license text data and the second standard text is used as the second similarity. The mean of the first similarity and the second similarity is determined as the global attitude tendency feature.

7. The method according to claim 1, characterized in that, The clause attribute features include clause embedding features and clause association features; the feature fusion of the text semantic features, the clause attribute features, the local attitude tendency features, and the global attitude tendency features to obtain license attribute features includes: Using the text semantic features as the query and the clause embedding features as the key and value, a multi-head attention mechanism is used to fuse the text semantic features and the clause embedding features to obtain a first fused feature; Using the text semantic features as the query and the clause association features as the key and value, a multi-head attention mechanism is used to fuse the text semantic features and the clause association features to obtain a second fused feature; The license attribute feature is obtained by concatenating the first fusion feature, the second fusion feature, the clause attribute feature, the local attitude tendency feature, and the global attitude tendency feature and performing average pooling over the time dimension.

8. An open-source license compatibility determination device, characterized in that, include: The acquisition module is used to obtain license text data; The first processing module is used to extract features from the license text data to obtain text semantic features, clause attribute features, local attitude tendency features, and global attitude tendency features; wherein, the text semantic features are used to indicate the semantics of the license text data, the clause attribute features are used to indicate the content in the license text data corresponding to each standard clause, the local attitude tendency features are used to indicate keywords in the license text data associated with attitude tendency, and the global attitude tendency features are used to indicate the similarity between the license text data and multiple standard license texts; The second processing module is used to fuse the text semantic features, the clause attribute features, the local attitude tendency features and the global attitude tendency features to obtain the license attribute features. The third processing module is used to perform identification processing based on the license attribute features to obtain the attitude type of the license text data towards each of the standard clauses; wherein the attitude type includes any one of mandatory type, permitted type or prohibited type.

9. An electronic device, characterized in that, include: At least one processor; At least one memory for storing at least one program; The open-source license compatibility determination method as described in any one of claims 1-7 is implemented when at least one of the programs is executed by at least one of the processors.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the open-source license compatibility determination method as described in any one of claims 1-7.