Method and apparatus for recommending harmonized system code for goods on basis of large language model
The tariff code recommendation system built using a large language model automates the processing of customs commodity tariff code declarations, solving the inefficiencies of manual review and traditional rule engines. It achieves efficient and accurate tariff code recommendation and anomaly detection, thus maintaining fair trade and tax order.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- NUCTECH JIANGSU CO LTD
- Filing Date
- 2025-06-27
- Publication Date
- 2026-07-02
Smart Images

Figure CN2025104980_02072026_PF_FP_ABST
Abstract
Description
A method and apparatus for recommending tariff codes for goods based on a large language model.
[0001] Cross-references to related applications
[0002] This application claims priority to Chinese Patent Application No. 202411943189.8, filed on December 26, 2024, entitled “Method and Apparatus for Recommending Tariff Numbers of Goods Based on a Large Language Model”, the entire contents of which are incorporated herein by reference. Technical Field
[0003] This disclosure relates to the field of large language model technology, and more specifically, to a method and apparatus for recommending tariff codes for goods based on large language models. Background Technology
[0004] In international trade, customs tariff codes cover a variety of categories and subcategories. When import and export companies declare tariff codes, they often make mistakes, omissions, or intentional misrepresentations. Abnormal tariff code declarations can lead to the loss of customs revenue, disrupt market order, and even involve illegal activities such as smuggling and tax evasion.
[0005] Currently, customs review of commodity tariff codes mainly relies on manual review and rule engines, which are diverse and complex. Manual review is time-consuming, labor-intensive, inefficient, and lacks power. Meanwhile, traditional rule engines use predefined rule sets to detect abnormal tariff code declarations. These rules are based on experience and manual definition, which cannot cover the complex abnormal situations and require continuous updates and maintenance of the rule sets to adapt to new trade patterns.
[0006] To ensure fair trade and effective taxation, customs authorities need an efficient and rapid method to obtain correct tariff codes for goods and to correct any irregularities in tariff code declarations. Therefore, a new method and apparatus for recommending tariff codes for goods is required. Summary of the Invention
[0007] Embodiments of this application provide a method and apparatus for recommending commodity tariff codes based on a large language model, which can recommend the correct commodity tariff codes based on user queries, or determine whether the declared commodity tariff codes are abnormal based on user declaration information.
[0008] According to a first aspect of this application, a method for recommending tax identification numbers (TAN) for goods based on a large language model is provided, comprising: receiving goods information input by a user; inputting the goods information into a large language model for tax identification number recommendation to obtain an initial recommended tax identification number for the goods; inputting query rewriting information based on the goods information into a query rewriting large language model to obtain a goods name and goods specifications; inputting the goods name and goods specifications into a hybrid retrieval module to obtain a hybrid tax identification number retrieval result; and inputting the initial recommended tax identification number and the hybrid tax identification number retrieval result into a decision fusion module to obtain a final recommended tax identification number for the goods.
[0009] According to one embodiment of the first aspect of this application, the product information is a user question text about the product.
[0010] According to an embodiment of the first aspect of this application, inputting the product information into a tax number recommendation language model to obtain an initial recommended tax number for the product includes: determining whether the product description in the user's question text is complete by the tax number recommendation language model; and if the product description is complete, outputting the initial recommended tax number for the product by the tax number recommendation language model based on the product description.
[0011] According to an embodiment of the first aspect of this application, inputting the product information into a tax number recommendation language model to obtain an initial recommended tax number for the product further includes: if the product description is incomplete, the tax number recommendation language model engages in one or more rounds of dialogue with the user to supplement the product description until the tax number recommendation language model determines that the product description is complete; and the tax number recommendation language model outputs the initial recommended tax number for the product based on the product description; wherein, in each round of dialogue, the tax number recommendation language model outputs a prompt question, and the user inputs additional user question text about the prompt question.
[0012] According to an embodiment of the first aspect of this application, inputting query rewriting information based on the product information into a query rewriting large language model to obtain the product name and product specifications includes: inputting the user's query text into the query rewriting large language model to obtain the product name and product specifications.
[0013] According to an embodiment of the first aspect of this application, inputting query rewriting information based on the product information into a query rewriting large language model to obtain the product name and product specifications includes: inputting the user question text, along with one or more prompt questions and one or more additional user question texts from one or more rounds of dialogue, into the query rewriting large language model to obtain the product name and product specifications.
[0014] According to an embodiment of the first aspect of this application, the commodity information is the declared commodity name, declared commodity specifications, and declared commodity tariff code filled in by the enterprise when declaring the commodity.
[0015] According to an embodiment of the first aspect of this application, inputting the commodity information into the tariff code recommendation language model to obtain the initial recommended tariff code of the commodity includes: inputting the declared commodity name and the declared commodity specifications into the tariff code recommendation language model to obtain the initial recommended tariff code of the commodity.
[0016] According to an embodiment of the first aspect of this application, inputting query rewriting information based on the product information into a query rewriting large language model to obtain the product name and product specifications includes: inputting the declared product name and the declared product specifications into the query rewriting large language model to obtain the product name and the product specifications.
[0017] According to an embodiment of the first aspect of this application, the method further includes: comparing the declared commodity tariff code with the final recommended tariff code; if the declared commodity tariff code matches the final recommended tariff code, returning "no error" to the user; and if the declared commodity tariff code does not match the final recommended tariff code, returning "an error" to the user.
[0018] According to an embodiment of the first aspect of this application, inputting the product name and product specifications into a hybrid retrieval module to obtain a tax code hybrid retrieval result includes: the hybrid retrieval module vectorizing the product name and product specifications to obtain vectorized product names and vectorized product specifications; the hybrid retrieval module performing a similarity calculation on the vectorized product names based on a tax code vector database to obtain the top K entries in the tax code vector database that are most similar to the vectorized product names; and the hybrid retrieval module performing a similarity calculation on the vectorized product specifications based on the top K entries to obtain the first entry among the top K entries that is most similar to the vectorized product specifications; wherein, the tax code hybrid retrieval result includes the product name, product specifications, and product tax code in the first entry, as well as the calculated similarity value.
[0019] According to an embodiment of the first aspect of this application, when the product name is empty, the mixed search module returns empty, and the mixed search result for the tax number is empty, wherein when the product name is not empty and the product specification is empty, K = 1.
[0020] According to an embodiment of the first aspect of this application, inputting the initial recommended tax number and the mixed tax number search result into a decision fusion module to obtain the final recommended tax number of the product includes: when the mixed tax number search result is not empty and the similarity value is greater than a first threshold, the decision fusion module outputs the product tax number as the final recommended tax number of the product; and when the mixed tax number search result is empty or the similarity value is less than or equal to the first threshold, the decision fusion module outputs the initial recommended tax number as the final recommended tax number of the product.
[0021] According to an embodiment of the first aspect of this application, the tax identification number recommendation language model is constructed as follows: a general language model is selected as the base model; the general language model is trained for the first time using a pre-trained tax identification number dataset, wherein the training method is full parameter fine-tuning; the difference curve of the loss function during the first training is monitored, and the first training is stopped when the difference curve converges; the general language model is trained for the second time using a tax identification number fine-tuning dataset, wherein the training method is partial parameter fine-tuning; the difference curve of the loss function during the second training is monitored, and the second training is stopped when the difference curve converges; the general language model is trained for the third time using a tax identification number preference optimization dataset to obtain the tax identification number recommendation language model.
[0022] According to an embodiment of the first aspect of this application, the pre-training dataset for tariff codes includes the Customs Import and Export Tariff of the People's Republic of China, historical declaration data of commodity tariff codes, and Internet data, wherein the fine-tuning dataset for tariff codes includes a predefined multi-turn question-and-answer dataset for tariff codes, and wherein the optimization dataset for tariff code preferences includes a predefined question-and-answer dataset for tariff code preferences.
[0023] According to an embodiment of the first aspect of this application, the query rewriting large language model is constructed in the following manner: dividing the query rewriting dataset into a training set and a test set; selecting multiple general-purpose large language models as base models; training the multiple general-purpose large language models respectively using the training set, wherein the training method is partial parameter fine-tuning; for each general-purpose large language model, monitoring the difference curve of the loss function during training, and stopping training when the difference curve converges; testing each trained general-purpose large language model using the test set, and selecting the trained general-purpose large language model with the highest test score as the query rewriting large language model.
[0024] According to an embodiment of the first aspect of this application, the query rewriting dataset includes a predefined tax ID query rewriting multi-turn question-and-answer dataset.
[0025] According to an embodiment of the first aspect of this application, the tariff code vector database is constructed by: cleaning historical declaration data of commodity tariff codes to obtain a database including commodity names, cleaned commodity specifications, and commodity tariff codes; and vectorizing the commodity names and cleaned commodity specifications respectively to obtain the tariff code vector database including commodity names, cleaned commodity specifications, vectorized commodity names, vectorized cleaned commodity specifications, and commodity tariff codes.
[0026] According to a second aspect of this application, a tariff code recommendation device for goods based on a large language model is provided, comprising: a processor and a memory storing instructions that, when executed by the processor, cause the processor to perform the method of the first aspect of this application.
[0027] According to a third aspect of this application, a computer-readable storage medium is provided having instructions stored thereon that, when executed by a computer, cause the computer to perform the method of the first aspect of this application.
[0028] The method and apparatus for recommending tax identification numbers for goods based on a large language model according to embodiments of this application obtain a tax identification number recommendation large language model by training a general large language model using a tax identification number pre-training dataset, a tax identification number fine-tuning dataset, and a tax identification number preference optimization dataset; obtain a query rewriting large language model by training multiple general large language models using a query rewriting dataset; and obtain a tax identification number vector database by cleaning and vectorizing historical tax identification number declaration data for goods. Further, by receiving product information input by a user and inputting the product information into the tax identification number recommendation large language model, an initial recommended tax identification number for the product can be obtained. Inputting query rewriting information based on the product information into the query rewriting large language model obtains the product name and product specifications, and inputting the product name and product specifications into the hybrid retrieval module obtains a hybrid tax identification number retrieval result. Finally, inputting the initial recommended tax identification number and the hybrid tax identification number retrieval result into the decision fusion module obtains the final recommended tax identification number for the product. Furthermore, when the product information consists of the declared product name, specifications, and tariff code provided by the enterprise during the declaration process, the declared tariff code can be compared with the final recommended tariff code to determine if there are any anomalies. Therefore, this system can recommend the correct tariff code based on user queries or determine whether the declared tariff code is abnormal based on the user's declaration information. This provides customs authorities with more effective tariff code recommendation and anomaly detection tools, thereby maintaining fair trade and the stability of tax collection. Attached Figure Description
[0029] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the embodiments of this application will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on the drawings without creative effort.
[0030] Figure 1 is a flowchart of a tariff code recommendation method for goods based on a large language model according to an embodiment of this application;
[0031] Figure 2 is a schematic diagram of the recommended tariff code for goods according to an embodiment of this application;
[0032] Figure 3 is a flowchart of a hybrid retrieval module according to an embodiment of this application;
[0033] Figure 4 is a flowchart of a decision fusion module according to an embodiment of this application;
[0034] Figure 5 is a flowchart of a method for constructing a large language model for tax number recommendation according to an embodiment of this application;
[0035] Figure 6 is a flowchart of a method for constructing a query rewriting large language model according to an embodiment of this application;
[0036] Figure 7 is a flowchart of a method for constructing a tax identification number vector database according to an embodiment of this application; and
[0037] Figure 8 is a schematic diagram of the hardware structure of a tariff code recommendation device for goods based on a large language model according to an embodiment of this application. Detailed Implementation
[0038] The features and exemplary embodiments of various aspects of this application will now be described in detail. To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are only configured to explain this application and are not configured to limit this application. For those skilled in the art, this application can be implemented without some of these specific details. The following description of the embodiments is merely to provide a better understanding of this application by illustrating examples of this application.
[0039] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising..." does not exclude the presence of additional identical elements in the process, method, article, or apparatus that includes said element.
[0040] The features and exemplary embodiments of various aspects of this application will now be described in detail. Furthermore, the features, structures, or characteristics described below may be combined in any suitable manner in one or more embodiments.
[0041] Large language models have been a focus of research for scholars in recent years. They are artificial intelligence models based on deep learning technology that solve various natural language processing tasks through dialogue, such as text generation, summary extraction, question answering systems, and so on.
[0042] Embodiments of this application provide a method and apparatus for recommending commodity tariff codes based on a large language model, which can recommend the correct commodity tariff codes based on user queries, or determine whether the declared commodity tariff codes are abnormal based on user declaration information.
[0043] Figure 1 is a flowchart of a commodity tariff code recommendation method based on a large language model according to an embodiment of the present application; Figure 2 is an architecture diagram of commodity tariff code recommendation according to an embodiment of the present application; Figure 3 is a flowchart of a hybrid retrieval module according to an embodiment of the present application; and Figure 4 is a flowchart of a decision fusion module according to an embodiment of the present application.
[0044] As shown in Figure 1, the tariff code recommendation method for goods based on a large language model according to an embodiment of this application includes the following steps S110 to S150.
[0045] S110: Receive product information input by the user.
[0046] In one embodiment, product information can be user question text about the product. For example, user question text could be "What is the tax code for walnuts?", "What is the tax code for shelled walnuts?", "What is the tax code for durian?", "What is the tax code for frozen durian?", and so on.
[0047] Alternatively, in one embodiment, the commodity information may be the declared commodity name, declared commodity specifications, and declared commodity tariff code filled in by the enterprise when declaring the commodity. For example, the commodity information may be "walnut" (declared commodity name), "shelled" (declared commodity specification), and "0802320000" (declared commodity tariff code) filled in by the enterprise when declaring the commodity.
[0048] S120: Input the product information into the tax number recommendation language model to obtain the initial recommended tax number for the product.
[0049] In one embodiment, when the product information is a user question text about the product, step S120 may include: determining whether the product description in the user question text is complete using a tax number recommendation language model; and if the product description is complete, outputting an initial recommended tax number for the product based on the product description using the tax number recommendation language model.
[0050] For example, if a user's question is "What is the tax code for shelled walnuts?", the tax code recommendation language model determines that the product description in the user's question is complete. Specifically, because the user's question contains the product name "walnuts" and the product specification "shelled", the tax code recommendation language model can output an initial recommended tax code for the product "walnuts" based on the product name "walnuts" and the product specification "shelled", such as "0802320000". In one embodiment, the initial recommended tax code is a ten-digit tax code.
[0051] In one embodiment, when the product information is user question text about the product, step S120 may further include: if the product description is incomplete, the tax number recommendation big language model engages in one or more rounds of dialogue with the user to supplement the product description until the tax number recommendation big language model determines that the product description is complete; and the tax number recommendation big language model outputs an initial recommended tax number for the product based on the product description; wherein, in each round of dialogue, the tax number recommendation big language model outputs a prompt question, and the user inputs additional user question text about the prompt question.
[0052] For example, if a user asks, "What is the tax code for walnuts?", the tax code recommendation language model determines that the product description in the user's question is incomplete. Specifically, because the user's question only contains the product name "walnut" and not the product specifications, the tax code recommendation language model cannot output an initial recommended tax code for the product "walnut" based solely on the product name "walnut". For instance, the tax code for shelled walnuts is different from that for unclamped walnuts.
[0053] In this scenario, the tax ID recommendation big language model needs to engage in one or more rounds of dialogue with the user to supplement the product description until the tax ID recommendation big language model determines that the product description is complete. Specifically, in each round of dialogue, the tax ID recommendation big language model outputs a prompt question, and the user inputs additional user-generated questions about the prompt question.
[0054] For example, based on the user's question "What is the tax code for walnuts?", the tax code recommendation language model can output the prompt question "Please supplement the processing method of walnuts" in the first round of dialogue. The user can also input additional user question text related to this prompt question in the first round of dialogue, such as "The processing method of walnuts is shelling." At this point, the tax code recommendation language model can determine that the product description is complete because it has already been able to output the initial recommended tax code for the product "walnuts" based on the product name "walnuts" and the product specification "shelling," for example, "0802320000." Furthermore, if after the first round of dialogue, the tax code recommendation language model still determines that the product description is incomplete, it can engage in a second round of dialogue with the user, a third round, and so on, until the tax code recommendation language model determines that the product description is complete.
[0055] In one embodiment, if the product information is the declared product name, declared product specifications, and declared product tariff code filled in by the enterprise when declaring the product, step S120 may include: inputting the declared product name and declared product specifications into the tariff code recommendation language model to obtain the initial recommended tariff code for the product.
[0056] For example, the declared product name could be "walnut" and the declared product specification could be "shelled". Specifically, the HS code recommendation big language model can output an initial recommended HS code for the product "walnut" based on the product name "walnut" and the product specification "shelled", such as "0802320000".
[0057] S130: Input the query rewrite information based on product information into the query rewrite language model to obtain the product name and product specifications.
[0058] In one embodiment, if the product information is a user query text about the product, and if the product description in the user query text is complete, step S130 may include: inputting the user query text into a query rewrite large language model to obtain the product name and product specifications.
[0059] For example, you can input the user's question "What is the tax code for shelled walnuts?" into the query rewriting language model to obtain the product name "walnuts" and the product specification "shelled".
[0060] Alternatively, in one embodiment, where the product information is user question text about the product, and where the product description in the user question text is incomplete, step S130 may include: inputting the user question text, along with one or more prompt questions and one or more additional user question texts from one or more rounds of dialogue, into a query rewrite large language model to obtain the product name and product specifications.
[0061] For example, the user's question text "What is the tax code for walnuts?", the prompt question in the first round of dialogue "Please add the processing method of walnuts", and the additional user's question text in the first round of dialogue "The processing method of walnuts is shelling" can be input into the query rewriting large language model to obtain the product name "walnuts" and the product specification "shelling".
[0062] Alternatively, in one embodiment, if the product information is the declared product name, declared product specifications, and declared product tariff code filled in by the enterprise when declaring the product, step S130 may include: inputting the declared product name and declared product specifications into the query rewrite large language model to obtain the product name and product specifications.
[0063] For example, you can input the declared product name "walnut" and the declared product specification "shelled" into the query rewrite big language model to obtain the product name "walnut" and the product specification "shelled".
[0064] S140: Input the product name and product specifications into the mixed search module to obtain mixed search results for tax ID.
[0065] In one embodiment, step S140 may include: the hybrid retrieval module vectorizing the product name and product specifications to obtain vectorized product names and vectorized product specifications; the hybrid retrieval module calculating the similarity of the vectorized product names based on the tax number vector database to obtain the top K entries in the tax number vector database that are most similar to the vectorized product names; and the hybrid retrieval module calculating the similarity of the vectorized product specifications based on the top K entries to obtain the first entry among the top K entries that is most similar to the vectorized product specifications; wherein, the tax number hybrid retrieval result includes the product name, product specifications, and product tax number in the first entry, as well as the calculated similarity value.
[0066] For example, the hybrid retrieval module can utilize a text vector representation model to vectorize product names and product specifications separately, thereby obtaining vectorized product names and vectorized product specifications. For instance, the text vector representation model could be a BGE-small model, a BGE-large model, a TFIDF model, a word-to-vector model, etc. This application does not impose any limitations on this.
[0067] Furthermore, the hybrid retrieval module can compare the vectorized product name with entries in the tax ID vector database (e.g., perform similarity calculations) to obtain the top K entries in the tax ID vector database that are most similar to the vectorized product name. Here, K can be an integer greater than or equal to 1, such as 3, 5, 7, etc. In addition, similarity calculations can include, for example, calculating cosine similarity, Euclidean similarity, Euclidean distance, Manhattan distance, etc.
[0068] After retrieving the top K entries in the tax code vector database that are most similar to the vectorized product name, the hybrid retrieval module can compare the vectorized product specification with these top K entries (e.g., perform similarity calculations) to obtain the top K entries that are most similar to the vectorized product specification. Thus, the hybrid retrieval module can return the tax code hybrid retrieval result, which includes the product name (e.g., "walnut"), product specification (e.g., "shelled"), and tax code (e.g., "0802320000") of the top K entry, along with the calculated similarity value (e.g., 1).
[0069] Furthermore, in one embodiment, if the product name is empty, the hybrid search module returns empty, and the tax number hybrid search result is also empty. For example, when the product name input to the hybrid search module is empty, the hybrid search module cannot perform similarity calculations for the vectorized product names; that is, it cannot compare and search in the tax number vector database. In this case, the hybrid search module returns empty by default and sets the tax number hybrid search result to empty.
[0070] Further, in one embodiment, when the product name is not empty and the product specification is empty, K = 1. For example, after the hybrid retrieval module obtains the top K entries in the tax code vector database that are most similar to the vectorized product name, if the product specification input to the hybrid retrieval module is empty, the hybrid retrieval module cannot perform similarity calculation for the vectorized product specification; that is, it cannot compare and retrieve for the top K entries. Therefore, K can be set to 1. That is, when performing similarity calculation for the vectorized product name, only the top 1 entry in the tax code vector database that is most similar to the vectorized product name is obtained, and the product name (e.g., "walnut"), product specification (e.g., "shelled"), and product tax code (e.g., "0802320000") in the top 1 entry, along with the calculated similarity value (e.g., 1), are used as the tax code hybrid retrieval result.
[0071] S150: Input the initial recommended tariff code and the mixed tariff code search results into the decision fusion module to obtain the final recommended tariff code for the product.
[0072] In one embodiment, step S140 may include: if the mixed treasury number search result is not empty and the similarity value is greater than a first threshold, the decision fusion module outputs the commodity treasury number as the final recommended treasury number of the commodity; and if the mixed treasury number search result is empty or the similarity value is less than or equal to the first threshold, the decision fusion module outputs the initial recommended treasury number as the final recommended treasury number of the commodity.
[0073] For example, when the mixed treasury number search result is not empty and the similarity value (e.g., 0.96) is greater than the first threshold (e.g., 0.95), the decision fusion module can output the commodity treasury number (e.g., "0802320000") in the first entry of the mixed treasury number search result as the final recommended treasury number for the commodity.
[0074] For example, when the mixed search result for tax ID is empty or the similarity value (e.g., 0.94) is less than or equal to the first threshold (e.g., 0.95), the decision fusion module can output the initial recommended tax ID (e.g., "0802320000") from the tax ID recommendation big language model as the final recommended tax ID for the product.
[0075] In one embodiment, when the product information consists of the declared product name, declared product specifications, and declared product tax code filled in by the enterprise when declaring the product, the product tax code recommendation method based on the large language model according to the embodiments of this application further includes: comparing the declared product tax code with the final recommended tax code; returning no error to the user if the declared product tax code and the final recommended tax code are consistent; and returning an error to the user if the declared product tax code and the final recommended tax code are inconsistent.
[0076] For example, if the tariff code for goods declared by a company matches the final recommended tariff code mentioned above, it indicates that the company's declared tariff code is correct. In this case, the user can be notified that the company's declared tariff code is correct. Conversely, if the tariff code for goods declared by a company does not match the final recommended tariff code mentioned above, it indicates that the company's declared tariff code is incorrect. In this case, the user can be notified that the company's declared tariff code is incorrect. This provides users with a more accurate tool for detecting tariff code anomalies.
[0077] The tariff code recommendation method for goods based on a large language model according to embodiments of this application obtains an initial recommended tariff code by receiving product information input by a user and inputting the product information into a tariff code recommendation large language model. Inputting query rewrite information based on the product information into the query rewrite large language model yields the product name and specifications, and inputting the product name and specifications into a hybrid retrieval module yields a hybrid tariff code retrieval result. Finally, inputting the initial recommended tariff code and the hybrid tariff code retrieval result into a decision fusion module yields the final recommended tariff code for the product. Furthermore, when the product information consists of the declared product name, declared product specifications, and declared tariff code filled in by the enterprise during product declaration, the declared tariff code can be compared with the final recommended tariff code to determine if there are any anomalies in the declared tariff code. Therefore, it is possible to recommend the correct tariff code based on the user's query, or to determine whether the declared tariff code is abnormal based on the user's declaration information, thereby providing customs departments with more effective tariff code recommendation and anomaly detection tools, and thus maintaining fair trade and the stability of tax collection.
[0078] Figure 5 is a flowchart of a method for constructing a large language model for tax identification number recommendation according to an embodiment of this application. As shown in Figure 5, the method for constructing a large language model for tax identification number recommendation according to an embodiment of this application includes the following steps S510 to S560.
[0079] S510: Select the general large language model as the base model.
[0080] In one embodiment, the general large language model may include, for example, GLM1, GLM2, GLM3, GLM4, QWEN, etc. This application does not impose any limitations on this.
[0081] S520: The general large language model is trained for the first time using the tax number pre-trained dataset, and the training method is full parameter fine-tuning.
[0082] In one embodiment, the pre-training dataset for tariff codes may include the Customs Import and Export Tariff of the People's Republic of China, historical declaration data of commodity tariff codes, and internet data, etc. For example, internet data may include books related to tariff codes, tariff code exercises, general knowledge, etc.
[0083] Furthermore, the tax ID pre-training dataset is a cleaned dataset, which removes noise and irrelevant content and ensures data quality. For example, tax ID exercises obtained from the internet contain many HTML tags and non-textual content, which need to be cleaned. After cleaning, the cleaned text can be segmented and serialized. An appropriate tokenizer (e.g., GLM) is used to segment the text, and the segmented text is converted into tokens that a general-purpose large language model can understand. For example, the format of the tax ID pre-training dataset can be: {"text":"context1"}.
[0084] Furthermore, during the initial training of the general-purpose large language model using the tax identification number pre-trained dataset, the training method can be full parameter fine-tuning. That is, all parameters of the general-purpose large language model are fine-tuned.
[0085] S530: Monitor the difference curve of the loss function during the first training period, and stop the first training when the difference curve converges.
[0086] In one embodiment, the first training of a general large language model can be considered complete when the difference curve of the loss function of the first training converges (e.g., when the difference no longer decreases significantly).
[0087] S540: Use the tax number fine-tuning dataset to train the general large language model a second time, where the training method is partial parameter fine-tuning.
[0088] In one embodiment, the tax ID fine-tuning dataset includes a predefined tax ID multi-turn question-and-answer dataset. For example, the format of the predefined tax ID multi-turn question-and-answer dataset could be: {"user":"query1","gpt":"response1", ...,"user":"queryn","gpt":responsen"}, where query1, query2, ..., queryn are user-asked questions, and response1, response2, ..., responsesen are the responses from a general large language model. For example, query1 could be "What is the tax ID for walnuts?", response1 could be "Please specify the processing method of walnuts", query2 could be "The processing method for walnuts is shelling", and response2 could be "The tax ID for shelled walnuts is 0802320000".
[0089] Furthermore, during the second training of the general large language model using the tax identification number fine-tuning dataset, the training method can be partial parameter fine-tuning. That is, fine-tuning a subset of the parameters of the general large language model.
[0090] S550: Monitors the difference curve of the loss function during the second training period, and stops the second training when the difference curve converges.
[0091] In one embodiment, the second training of a general large language model can be considered complete when the difference curve of the loss function of the second training converges (e.g., when the difference no longer decreases significantly).
[0092] S560: Use the tax number preference optimization dataset to train the general large language model for the third time to obtain the tax number recommendation large language model.
[0093] In one embodiment, the tax identification number preference optimization dataset includes a predefined tax identification number preference question-and-answer dataset. For example, the format of the predefined tax identification number preference question-and-answer dataset could be: {"user":"query1","chosen":"response1","rejected":"response2"}, where query1 is the user's question text, response1 is the response type accepted by the general large language model, and response2 is the response type rejected by the general large language model. For example, query1 could be "What is the tax identification number for shelled walnuts?", response1 could be "Referring to the tax code, chapter name, and because walnuts are processed by shelling, the tax identification number is 0802320000", and response2 could be "The tax identification number for shelled walnuts is 0802320000". Considering practical application scenarios, response1 is better than response2.
[0094] Figure 6 is a flowchart of a method for constructing a query rewriting large language model according to an embodiment of the present application. As shown in Figure 6, the method for constructing a query rewriting large language model according to an embodiment of the present application includes the following steps S610 to S650.
[0095] S610: Divide the query rewrite dataset into a training set and a test set.
[0096] In one embodiment, the query rewriting dataset may include a predefined multi-turn question-and-answer dataset for tax ID query rewriting. For example, the format of the predefined multi-turn question-and-answer dataset for tax ID query rewriting may be: {"inputs": "q1r1,q2r2,...,qn", "outputs": "query_new"}, where q1r1,q2r2,...,qn are the concatenation of each round of user input and each round of output from the large language model (excluding the last round's output) in one or more rounds of dialogue between the user and the tax ID recommendation large language model, and query_new is the new query. Here, this new query can improve the accuracy and efficiency of retrieval against the tax ID vector database.
[0097] For example, an entry in a predefined tax number query rewriting multi-turn question-and-answer dataset could be {"inputs": "What is the ten-digit tax number for walnuts? / Please add the processing method of walnuts / The processing method is shelling", "outputs": {"commodity name": "walnuts", "declaration element": "shelling"}.
[0098] In one embodiment, the ratio of the training set to the test set of the query-rewritten dataset can be 9:1. Furthermore, the ratio of the training set to the test set of the query-rewritten dataset can be any suitable ratio, and this application does not impose any restrictions on it.
[0099] S620: Select multiple general-purpose large language models as base models.
[0100] In one embodiment, multiple general-purpose large language models can be selected as base models. For example, general-purpose large language models may include, for example, GLM1, GLM2, GLM3, GLM4, QWEN, etc. This application does not impose any limitations on this.
[0101] S630: Multiple general-purpose large language models are trained using the training set, with the training method involving fine-tuning of some parameters.
[0102] In one embodiment, multiple general-purpose large language models are trained simultaneously using training sets. Furthermore, during the training of multiple general-purpose large language models using training sets, the training method can involve fine-tuning only some parameters. That is, fine-tuning a subset of parameters of the general-purpose large language models.
[0103] S640: For each general large language model, monitor the difference curve of the loss function during training, and stop training when the difference curve converges.
[0104] In one embodiment, training for the general large language model can be considered complete when the difference curve of the loss function trained for each general large language model converges (e.g., when the difference no longer decreases significantly).
[0105] S650: Test each trained general-purpose large language model using a test set, and select the trained general-purpose large language model with the highest test score as the query rewriting large language model.
[0106] In one embodiment, the trained general-purpose large language model with the highest test score can be selected as the query rewriting large language model, thereby enabling more accurate output when using the query rewriting large language model in the future.
[0107] Figure 7 is a flowchart of a method for constructing a tax identification number vector database according to an embodiment of the present application. As shown in Figure 7, the method for constructing a tax identification number vector database according to an embodiment of the present application includes the following steps S710 to S720.
[0108] S710: Clean the historical declaration data of commodity tariff codes to obtain a database including commodity names, cleaned commodity specifications, and commodity tariff codes.
[0109] In one embodiment, in the historical declaration data of commodity tariff codes, because commodity tariff codes are standardized, the same commodity with different specifications (e.g., shape, processing method, use, or storage method) will correspond to different commodity tariff codes. Furthermore, the historical declaration data also includes other commodity specifications (e.g., brand, packaging, contract date, etc.) that do not affect the commodity tariff code. Therefore, it is necessary to clean the historical declaration data of commodity tariff codes to obtain a database including commodity names, cleaned commodity specifications (which affect the commodity tariff code), and the commodity tariff codes.
[0110] S720: Vectorize the product name and the cleaned product specifications respectively to obtain a tax code vector database including the product name, the cleaned product specifications, the vectorized product name, the vectorized cleaned product specifications, and the product tax code.
[0111] In one embodiment, a text vector representation model can be used to vectorize both the product name and the cleaned product specifications, thereby obtaining vectorized product names and vectorized cleaned product specifications. For example, the text vector representation model can be a BGE-small model, a BGE-large model, a TFIDF model, a word-to-vector model, etc. This application does not impose any limitations on this. Here, the tax identification number vector database includes vectorized product names and vectorized cleaned product specifications, enabling the aforementioned similarity calculation of the hybrid retrieval module.
[0112] This application also provides a tariff code recommendation device for goods based on a large language model, comprising: a processor and a memory storing instructions, which, when executed by the processor, cause the processor to perform the aforementioned tariff code recommendation method for goods based on a large language model.
[0113] Figure 8 is a schematic diagram of the hardware structure of a tariff code recommendation device for goods based on a large language model according to an embodiment of this application. As shown in Figure 8, the tariff code recommendation device for goods based on a large language model may include a processor 81 and a memory 82 storing computer program instructions.
[0114] Specifically, the processor 81 may include a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits that can be configured to implement the embodiments of this application.
[0115] Memory 82 may include a large-capacity memory for data or instructions. Where appropriate, memory 82 may include removable or non-removable (or fixed) media. In a particular embodiment, memory 82 is a non-volatile solid-state memory. Memory 82 may include read-only memory (ROM), random access memory (RAM), disk storage media devices, optical storage media devices, flash memory devices, electrical, optical, or other physical / tangible memory storage devices. Thus, typically, memory includes one or more tangible (non-transitory) computer-readable storage media (e.g., memory devices) encoded with software including computer-executable instructions, and when the software is executed (e.g., by one or more processors), it is operable to perform the above-described method for recommending tariff codes for goods based on a large language model.
[0116] The processor 81 reads and executes computer program instructions stored in the memory 82 to implement the tariff code recommendation method for goods based on a large language model in the above embodiments.
[0117] In one example, the tariff code recommendation device for goods based on a large language model may further include a communication interface 83 and a bus 84. As shown in Figure 8, the processor 81, memory 82, and communication interface 83 are connected via bus 84 and communicate with each other.
[0118] Bus 84 includes hardware, software, or both, that couples components of a tariff code recommendation device for goods based on a large language model together. For example, and not limitingly, the bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an Infinite Bandwidth Interconnect, a Low Pin Count (LPC) bus, a memory bus, a Microchannel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a Video Electronics Standards Association Local (VLB) bus, or other suitable buses, or combinations of two or more of these. Where appropriate, bus 84 may include one or more buses. Although specific buses are described and illustrated in embodiments of this application, any suitable bus or interconnect is contemplated herein.
[0119] This application also provides a computer-readable storage medium storing instructions thereon, which, when executed by a computer, cause the computer to perform the above-described method for recommending tariff codes for goods based on a large language model.
[0120] Examples of computer-readable storage media include non-transitory computer-readable storage media such as portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, etc.
[0121] It should be clarified that this application is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of this application is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of this application.
[0122] The above description is merely a specific embodiment of this application. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the device described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here. It should be understood that the protection scope of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the protection scope of this application.
Claims
1. A method for recommending tariff codes for goods based on a large language model, comprising: Receive product information; The product information is input into the tax number recommendation language model to obtain the initial recommended tax number for the product; The query rewrite information based on the product information is input into the query rewrite large language model to obtain the product name and product specifications; Input the product name and product specifications into the hybrid search module to obtain the mixed search results for tax ID; as well as The initial recommended tariff code and the mixed retrieval results of the tariff code are input into the decision fusion module to obtain the final recommended tariff code for the product.
2. The method of claim 1, wherein, The product information is a user question text about the product.
3. The method of claim 2, wherein, The product information is input into the tax number recommendation language model to obtain the initial recommended tax number for the product, including: The large language model recommending the tax number determines whether the product description in the user's query text is complete; and If the product description is complete, the tax number recommendation big language model outputs the initial recommended tax number for the product based on the product description.
4. The method of claim 3, wherein, Inputting the product information into the tax number recommendation language model to obtain the initial recommended tax number for the product also includes: If the product description is incomplete, the tax number recommendation language model will engage in one or more rounds of dialogue with the user to supplement the product description until the tax number recommendation language model determines that the product description is complete; and The recommended tariff code model outputs the initial recommended tariff code for the product based on the product description; In each round of dialogue, the tax number recommendation big language model outputs a prompt question, and the user inputs additional user question text about the prompt question.
5. The method of claim 3, wherein, The query rewriting information based on the product information is input into the query rewriting language model to obtain the product name and product specifications, including: The user's query text is input into the query rewrite large language model to obtain the product name and the product specifications.
6. The method of claim 4, wherein, The query rewriting information based on the product information is input into the query rewriting language model to obtain the product name and product specifications, including: The user's question text, along with one or more prompt questions and one or more additional user question texts from the one or more rounds of dialogue, are input into the query rewrite large language model to obtain the product name and the product specifications.
7. The method of claim 1, wherein, The product information refers to the product name, product specifications, and product tax number that the enterprise fills in when declaring the product.
8. The method of claim 7, wherein, The product information is input into the tax number recommendation language model to obtain the initial recommended tax number for the product, including: The declared product name and product specifications are input into the HS code recommendation language model to obtain the initial recommended HS code for the product.
9. The method of claim 7, wherein, The query rewriting information based on the product information is input into the query rewriting language model to obtain the product name and product specifications, including: The product name and product specifications are input into the query rewrite large language model to obtain the product name and product specifications.
10. The method of claim 7, further comprising: Compare the declared commodity tariff code with the final recommended tariff code; If the declared commodity tariff code matches the final recommended tariff code, return a message to the user indicating no error. as well as If the declared commodity tariff code is inconsistent with the final recommended tariff code, an error is returned to the user.
11. The method of claim 1, wherein, Inputting the product name and product specifications into the hybrid search module to obtain hybrid tax ID search results includes: The hybrid retrieval module performs vectorization processing on the product name and product specifications to obtain vectorized product name and vectorized product specifications; The hybrid retrieval module performs similarity calculations on the vectorized product names based on the tax ID vector database to obtain the top K entries in the tax ID vector database that are most similar to the vectorized product names; and The hybrid retrieval module performs similarity calculation on the vectorized product specifications based on the first K entries to obtain the first entry among the first K entries that is most similar to the vectorized product specifications; The mixed treasury code search results include the product name, product specifications, and product treasury code from the first entry, as well as the calculated similarity value.
12. The method of claim 11, wherein, If the product name is empty, the mixed search module returns empty, and the mixed search result for the tax number is also empty. Wherein, if the product name is not empty and the product specification is empty, then K = 1.
13. The method of claim 12, wherein, The process of inputting the initial recommended tariff code and the mixed retrieval results of the tariff code into the decision fusion module to obtain the final recommended tariff code for the product includes: If the mixed treasury code search result is not empty and the similarity value is greater than the first threshold, the decision fusion module outputs the commodity treasury code as the final recommended treasury code for the commodity; and If the mixed treasury number search result is empty or the similarity value is less than or equal to the first threshold, the decision fusion module outputs the initial recommended treasury number as the final recommended treasury number for the product.
14. The method of claim 1, wherein, The large language model for recommending tax identification numbers is constructed in the following way: Choose a general large language model as the base model; The general large language model was trained for the first time using the tax number pre-trained dataset, wherein the training method was full parameter fine-tuning; Monitor the difference curve of the loss function during the first training period, and stop the first training when the difference curve converges; The general large language model was trained a second time using the tax number fine-tuning dataset, wherein the training method was partial parameter fine-tuning; Monitor the difference curve of the loss function during the second training period, and stop the second training when the difference curve converges; The general large language model is trained a third time using the tax number preference optimization dataset to obtain the tax number recommendation large language model.
15. The method of claim 14, wherein, The pre-training dataset for tariff codes includes the Customs Import and Export Tariff of the People's Republic of China, historical declaration data of commodity tariff codes, and internet data. The tax ID fine-tuning dataset includes a predefined multi-turn question-and-answer dataset for tax IDs. The tax number preference optimization dataset includes a predefined tax number preference question and answer dataset.
16. The method of claim 1, wherein, The query rewriting large language model is constructed in the following way: The query rewrite dataset is divided into a training set and a test set; Choose multiple general-purpose large language models as base models; The training set is used to train the multiple general-purpose large language models respectively, wherein the training method is partial parameter fine-tuning; For each general large language model, monitor the difference curve of the loss function during training, and stop training when the difference curve converges; Each trained general-purpose large language model is tested using a test set, and the trained general-purpose large language model with the highest test score is selected as the query rewriting large language model.
17. The method according to claim 16, wherein, The query rewriting dataset includes a predefined multi-turn question-and-answer dataset for tax ID query rewriting.
18. The method according to claim 11, wherein, The tax identification number vector database is constructed in the following manner: The historical declaration data of commodity tariff codes is cleaned to obtain a database including commodity names, cleaned commodity specifications, and commodity tariff codes; The product name and the cleaned product specifications are vectorized separately to obtain the tax number vector database, which includes the product name, the cleaned product specifications, the vectorized product name, the vectorized cleaned product specifications, and the product tax number.
19. A product tariff code recommendation device based on a large language model, comprising: processor, and A memory storing instructions that, when executed by the processor, cause the processor to perform the method according to any one of claims 1-18.
20. A computer-readable storage medium having instructions stored thereon, which, when executed by a computer, cause the computer to perform the method according to any one of claims 1-18.