A tourism and geographical text classification method and system based on a large language model, a storage medium and a product
By using multi-model label fusion and a lightweight large language model structure, the instability of large language models in text classification under specific domains and small sample conditions is solved, achieving efficient and stable tourism and geography text classification and reducing inference costs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SOUTHWEST JIAOTONG UNIV
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies struggle to effectively utilize large language models for text classification in specific domains or under small sample conditions, resulting in unstable and uncontrollable classification results, high inference costs, and difficulty in achieving lightweight and local deployment.
By constructing a training dataset containing labels, multiple large language models are used to generate preliminary predicted labels and calculate consistency scores. The model is then trained and classified by combining a lightweight basic large language model and a classification head structure. Only the supervised fine-tuned model is retained for classification during the inference stage.
It achieves efficient and stable classification results in tourism and geographic text classification, reduces computational costs, and improves classification accuracy and consistency, making it suitable for large-scale data scenarios.
Smart Images

Figure CN122220522A_ABST
Abstract
Description
Technical Field
[0001] A method, system, storage medium, and product for classifying tourism and geographical texts based on a large language model are disclosed. This method is used for classifying tourism and geographical texts using a large language model and belongs to the field of natural language processing and artificial intelligence technology. It is suitable for semantic recognition and multi-category intelligent annotation of tourism-related texts, geographical knowledge texts, and their mixed corpora. It can be applied to tourism information mining, intelligent recommendation systems, and related data analysis scenarios. Background Technology
[0002] Text classification, a core task of Natural Language Processing (NLP), aims to automatically identify the type of text data and categorize it into predefined categories based on its content. This technology plays a crucial role in sentiment analysis, information retrieval, intelligent customer service, and industry-specific applications (such as tourism, healthcare, and education). With the rapid development of the internet and social media, massive amounts of unstructured text data are constantly being generated. How to efficiently and accurately perform automated text classification has become a fundamental issue of long-term concern in the field of artificial intelligence.
[0003] Early text classification methods primarily relied on manually designed feature engineering and traditional machine learning algorithms, such as Naive Bayes, Support Vector Machines (SVM), and Logistic Regression. While these methods performed well in specific domains, they heavily depended on manually constructed bag-of-words models and TF-IDF features, making it difficult to capture semantic hierarchical information and contextual dependencies.
[0004] With the development of deep learning, neural network-based models are gradually replacing traditional algorithms. Recurrent Neural Networks (RNNs) and their variant, Long Short-Term Memory (LSTM), can model the sequence characteristics of text to some extent, thereby improving classification performance. However, these models still face problems such as difficulty in modeling long-distance dependencies and low training efficiency.
[0005] In recent years, the introduction of the Transformer architecture has fundamentally changed the technological landscape of natural language processing. Pre-trained language models based on Transformers (such as BERT, RoBERTa, GPT, and the LLaMA series) achieve global modeling and perception of contextual information through self-attention, significantly improving the ability to understand, process, and generate text. In various tasks, especially text generation, Transformer models, with their pre-training-fine-tuning paradigm, have demonstrated far superior performance and transfer learning capabilities compared to traditional models.
[0006] However, with continuous updates to model architecture and weights, the number of available models is growing exponentially. For example, large-scale commercial language models typically suffer from large parameter sizes, high inference costs, and difficulty in direct deployment on local servers, limiting their application in large-scale tourism text processing scenarios. Furthermore, existing technologies lack a solution to effectively transfer the inference capabilities of large-scale language models to lightweight classification models and form a stable and usable classification system. Therefore, a new technical approach is urgently needed to fully leverage the inference and knowledge advantages of large-scale language models while ensuring classification accuracy and semantic consistency, and to achieve efficient, stable, and locally deployable automatic classification of tourism and geographical texts.
[0007] In specific domains or under limited sample conditions, directly utilizing large language models to perform text classification tasks in existing technologies may lead to the following adverse consequences: 1. Large language models are usually pre-trained on general corpora, and their internal knowledge distribution is mainly based on general language patterns. When directly applied to texts in professional fields such as tourism and geography, the model may be unstable in its judgment and semantic understanding of geographical and tourism-specific concepts, implicit geographical relationships and industry terms. Under small sample conditions, class boundary drift is prone to occur, resulting in poor consistency of classification results between different batches and difficulty in forming stable classification rules. 2. In the absence of sufficient domain-labeled samples, large language models often rely on their inherent generative mechanisms to complete classification judgments. Although this mechanism has a relatively strong semantic understanding ability of natural language, its output results have obvious randomness and uncontrollability, which leads to the same semantic text being assigned different category labels in multiple calls. This is not conducive to data management and scientific research's requirements for the reliability and interpretability of results. Summary of the Invention
[0008] The purpose of this invention is to provide a method, system, storage medium, and product for tourism and geography text classification based on a large language model, which solves the problem that existing technologies cannot efficiently utilize suitable large language models for classification tasks in specific fields or under small sample conditions.
[0009] To achieve the above objectives, the technical solution adopted by the present invention is as follows: A method for classifying tourism and geographical texts based on a large language model includes the following steps: Step 1: Construct a classification sample dataset based on the tourism and geography text dataset, and formulate text labels for multiple categories for text classification; Step 2: Based on the classification sample dataset and text labels, construct a training dataset containing labels through label generation and label fusion strategies. The label generation and label fusion strategies refer to using at least two different large language models (LLM) and a unified prompt word template to generate corresponding preliminary predicted labels for the classification sample dataset and text classification labels, and calculating consistency scores to construct a training dataset containing labels. Step 3: Train the constructed classification model based on the training dataset to obtain the target model for tourism and geography text classification. The classification model is set after the last hidden state of the local lightweight basic large language model. Step 4: Use the target model to classify the tourism and geography texts to be classified and output the corresponding category labels.
[0010] Furthermore, the specific steps of step 1 are as follows: Step 1.1: First, obtain the original text dataset in the fields of tourism and geography. The original text data includes, but is not limited to, scenic spot introduction text, travel guide text, tourism policy and planning text, and geographical knowledge description text. Chinese texts were randomly selected from the original text datasets in the fields of tourism and geography as the classification sample dataset, with the number of randomly selected texts being greater than or equal to 10,000. Step 1.2: Analyze the text features of the original text dataset in the fields of tourism and geography. Combine the topic distribution, semantic structure and domain knowledge system of the corpus to formulate text classification labels. The text classification labels include geography and geology, history and culture, environmental protection, architecture and landscape, tourism industry construction and planning, scenic spot introduction, travel guides and itineraries, tourism vocational education and tourism policy.
[0011] Furthermore, the specific steps of step 2 are as follows: Step 2.1: Use at least two different external Large Language Models (LLMs) to generate preliminary predicted labels for the classification sample dataset and text classification labels using a unified prompt word template. Each LLM independently completes the annotation, forming a multi-model annotation set. The LLMs include GPT-4.1, Deepseek-R1, Deepseek-V3, or Moonshot-V1. The unified prompt word template includes a task description, a category constraint, an input text placeholder, and an output format constraint. The task description instructs the LLM to perform a text classification task and specifies that the classification target belongs to the tourism and geography text domain. The category constraint lists or limits the category range of the optional text classification labels to constrain the classification output of the LLM. The input text placeholder carries the classification sample dataset, and the output format constraint specifies the expression form of the text classification labels in the model output. Step 2.2: Calculate the consistency scores of the Cohen's Kappa coefficients for the two major language models (LLMs); Step 2.3: Based on the consistency score, each language model LLM calculates a weighted Kappa score for the prediction results of a single text in the classification sample dataset, retaining only high-confidence labels above a given threshold, and finally obtaining a training dataset containing labels.
[0012] Furthermore, the classification model in step 3 is a classification head structure set after the last hidden state of the local lightweight basic large language model, so as to map the last hidden vector to a preset fixed text category label set and output the corresponding text classification result. The local lightweight basic large language model includes Llama-3.2, Qwen-2.5 and Qwen-3, and the classification model is obtained by adding the classification head structure to Llama-3.2, Qwen-2.5 and Qwen-3.
[0013] Furthermore, the classification head structure includes a selection layer, a fully connected layer, and a category logit projection layer connected in sequence; The selection layer is used to select the hidden state corresponding to the last token in the last hidden vector sequence of the input sequence as the aggregated representation of the text context information; The dimension of the fully connected layer is "from hidden layer size to hidden layer size", and it processes the results obtained from the selected layer; The category logit projection layer maps the dimension of the result obtained from the fully connected layer from the hidden layer size to the number of predefined categories, and outputs the corresponding category logit, where logit represents the logical value.
[0014] Furthermore, in step 3, during the training phase, the classification model utilizes an external large language model to generate soft labels or performs label semantic fusion, and learns its probability distribution through supervised fine-tuning methods. During the inference phase, only the supervised fine-tuned target model is retained for independent classification, and the external large language model LLM is no longer called.
[0015] A tourism and geography text classification system based on a large language model includes a memory, a processor, and a computer program stored in the memory. The processor executes the computer program to implement the steps of the tourism and geography text classification method based on a large language model.
[0016] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the tourism and geography text classification method based on a large language model.
[0017] A computer program product includes a computer program that, when executed by a processor, implements the steps of the tourism and geography text classification method based on a large language model.
[0018] Compared with the prior art, the advantages of the present invention are as follows: This invention demonstrates good practicality and promotional value on large-scale real-world datasets. Specifically, when seven models were trained and evaluated using this invention, based on the evaluation set, the accuracy of this invention reached 95.83%, with the lowest average cross-entropy loss value of 0.1613. After deploying this invention, approximately 460,000 high-quality Chinese tourism and geography texts were automatically classified, successfully obtaining complete and structured label distribution results, specifically as follows: I. In terms of label system design, the semantic boundaries of classification labels are reconstructed and optimized by combining national standards and actual tourism geography corpus features. This design effectively reduces semantic coverage and overlap between classes and improves the usability and stability of labels in automatic classification scenarios. Experimental results show that the label system helps to improve the overall classification accuracy and the recognition effect of few sample categories. Second, this invention integrates label acquisition and training dataset construction with consistency constraints of multiple models (i.e., multiple large language models LLM). By introducing a consistency evaluation mechanism between the labeling results of multiple models, it achieves effective suppression of noisy labels. Compared with existing methods that rely on a single model or manual sampling labeling, this invention can stably generate high-quality, uniquely labeled datasets in large-scale data scenarios, providing a reliable training foundation for subsequent supervised fine-tuning. Third, this case adopts a structure design of "local lightweight basic large language model + classification head structure" as the classification model, which effectively combines the semantic modeling capability of general large language model with classification task, avoiding the problem of traditional classification based on prompt word template relying on complex context construction and high token consumption in the inference stage. This structure achieves stable representation and efficient classification of long text semantics while keeping the model parameter scale small. Fourth, during the inference process, only the supervised fine-tuned classification model is called to perform the classification inference process. External or commercial large language models are no longer introduced to participate in classification decisions and result generation. This ensures the consistency and stability of classification results, improves the inference efficiency of tourism and geographical text classification, and reduces computation and calling costs. Attached Figure Description
[0019] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation on the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0020] Figure 1 This is a schematic diagram of the process of the present invention; Figure 2 This represents the classification result of the target model for the text to be classified in this embodiment of the invention. Detailed Implementation
[0021] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0022] A method for classifying tourism and geographical texts based on a large language model includes the following steps: Step 1: Construct a classification sample dataset based on the tourism and geography text dataset, and formulate text labels for multiple categories for text classification, i.e., formulate category labels; The specific steps are as follows: Step 1.1: First, obtain the original text dataset in the fields of tourism and geography. The original text data includes, but is not limited to, scenic spot introduction text, travel guide text, tourism policy and planning text, and geographical knowledge description text. High-quality Chinese texts were randomly selected from the original text dataset in the tourism and geography domain as classification sample datasets, with the number of randomly selected texts being greater than or equal to 10,000. Due to the lack of labeled data for long text classification in the tourism domain, manual labeling is costly and susceptible to subjective bias. 10,000 texts were randomly extracted from the high-quality Chinese text subset of IndustryCorpus2 (tourism and geography), covering content such as scenic spots, geography, and culture, to construct training and validation datasets to support the training and evaluation of the model in long text classification tasks.
[0023] Step 1.2: Analyze the text features of the original text dataset in the tourism and geography fields. Combining the topic distribution, semantic structure, and domain knowledge system of the corpus, develop text classification labels. These labels include geography and geology, history and culture, environmental protection, architecture and landscaping, tourism industry development and planning, scenic spot introductions, travel guides and itineraries, tourism vocational education, tourism policies, and others. The text classification label design is based on the national standard GB / T 18972-2017, while also incorporating the characteristics of tourism texts and expert opinions. By analyzing relevant knowledge and text types related to tourism resource surveys, classification themes are identified.
[0024] Step 2: Based on the classification sample dataset and text labels, construct a training dataset containing labels through label generation and label fusion strategies. The label generation and label fusion strategies refer to using at least two different large language models (LLM) and a unified prompt word template to generate corresponding preliminary predicted labels for the classification sample dataset and text classification labels, and calculating consistency scores to construct a training dataset containing labels. The specific steps are as follows: Step 2.1: Use at least two different external Large Language Models (LLMs) to generate preliminary predicted labels for the classification sample dataset and text classification labels using a unified prompt word template. Each LLM independently completes the annotation, forming a multi-model annotation set. The LLMs include GPT-4.1, Deepseek-R1, Deepseek-V3, or Moonshot-V1, etc. The unified prompt word template includes a task description, a category constraint, an input text placeholder, and an output format constraint. The task description instructs the LLM to perform a text classification task and specifies that the classification target belongs to the tourism and geography text domain. The category constraint lists or limits the category range of the optional text classification labels to constrain the classification output of the LLM. The input text placeholder carries the classification sample dataset, and the output format constraint specifies the expression form of the text classification labels in the model output.
[0025] Step 2.2: Calculate the consistency scores of the Cohen's Kappa coefficients for the two major language models (LLMs); Step 2.3: Based on the consistency score, each language model's LLM calculates a weighted Kappa score for the prediction results of individual texts in the dataset to be classified, retaining only high-confidence labels above a given threshold, thus obtaining a training dataset containing labels. Finally, some fused label samples with low confidence are removed, resulting in over 7800 labeled data points.
[0026] Step 3: Train the constructed classification model based on the training dataset to obtain the target model for tourism and geography text classification. The classification model is set after the last hidden state of the local lightweight basic large language model. In text classification tasks based on few-shot prompts, small-scale models (such as Qwen-2.5-1.5B) struggle to achieve high-precision classification due to the limited size of their weight matrices, and often fail to faithfully adhere to the template requirements when generating output. Therefore, we add a classification head structure to the last hidden layer of the model, mapping the representation of the final hidden layer to a fixed label set, thus achieving output and optimization for specific classification tasks.
[0027] The classification model is created by adding a classification head structure after the last hidden state of a local lightweight basic large language model. This structure maps the last hidden vector to a pre-defined set of fixed text category labels and outputs the corresponding text classification results. The local lightweight basic large language model includes Llama-3.2, Qwen-2.5, and Qwen-3, etc. The classification model is obtained by adding a classification head structure to Llama-3.2, Qwen-2.5, and Qwen-3, etc.
[0028] The classification head structure includes a selection layer, a fully connected layer, and a category logit projection layer connected in sequence; The selection layer is used to select the hidden state corresponding to the last token in the last hidden vector sequence of the input sequence as the aggregated representation of the text context information; The dimension of the fully connected layer is "from hidden layer size to hidden layer size", and it processes the results obtained from the selected layer; The category logit projection layer maps the dimension of the result obtained from the fully connected layer from the hidden layer size to the number of predefined categories, and outputs the corresponding category logit, where logit represents the logical value.
[0029] During the training phase, the classification model utilizes an external large language model to generate soft labels or perform label semantic fusion, and learns its probability distribution through supervised fine-tuning methods. During the inference phase, only the supervised fine-tuned target model is used independently for classification, without invoking an external large language model (LLM). This is because the classification head structure is untrained, and its weights must be updated through supervised fine-tuning to ensure the hidden layer states correctly map to the target category while maintaining output normalization and achieving reliable classification. Fine-tuning was performed using labeled data generated by commercial models and Chain-of-Track (CoT) text; specifically, a fine-tuning dataset of over 7800 labeled data points was created and subjected to 50-fold validation.
[0030] Step 4: Use the target model to classify the tourism and geography texts to be classified and output the corresponding category labels.
[0031] Taking tourism texts as an example, this invention has developed classification standards and few-shot learning templates for tourism texts through expert consultation. By employing technologies such as few-shot learning from commercial large models, label fusion algorithms based on multi-model pseudo-labels and Cohen's Kappa values, secondary development of open-source large language models, and efficient parameter fine-tuning, a novel text classification method has been established that is easy to deploy and reproduce, has customizable categories, good classification results, and requires no pre-labeling.
[0032] In the example, the following text from the tourism and geography domains is selected as the text to be classified: 1. "The Palace Museum announced that starting next month, it will implement a reservation and visitor limit measure, restricting the number of visitors to no more than 30,000 per day." 2. "The best three-day itinerary for Dali, Yunnan: Day 1: Visit the ancient city; Day 2: Take a boat trip on Erhai Lake; Day 3: Head to Cangshan Mountain." 3. Abstract of "A Study on the Impact of Ecotourism on Vegetation Cover of Nature Reserves": This paper uses remote sensing monitoring... 4. "A Winter Trip to Jiuzhaigou: Jiuzhaigou in its silver-clad state has a unique charm; Changhai Lake is completely frozen..." Input the above text into the classification process of this invention: First, the text is constrained by the classification task category based on the pre-set text label system; then, a training dataset is constructed through the few-shot learning and label fusion strategy of the Large Language Model (LLM); after the classification model is trained, the target model is obtained, and the target model is used to classify and output the above text.
[0033] In this usage example, the target model's classification result for the text to be classified is shown below. Figure 2 As shown; In practical applications, the training dataset is typically constructed by sampling from the text dataset to be classified. However, the number of tourism and geography texts to be classified is often significantly larger than the training dataset. In one embodiment, 10,000 text samples are sampled from the text dataset to be classified. After label generation and fusion processing, 7,800 samples with fused labels are retained. These are then randomly sampled at a 4:1 ratio to construct the training and validation datasets. During the model application phase, approximately 460,000 tourism and geography texts are classified. Based on the analysis of the validation set obtained from the above sampling, the posterior accuracy is approximately 95.83%, which is 4.17% higher than the RoBerta of the baseline model. Finally, the optimal model is used to classify all 459,751 high-quality Chinese texts. The results verify the effectiveness of the label fusion strategy and the few-shot generation method in improving dataset reliability and classification performance. The classification results can be used for tasks such as intelligent tourism recommendation, geographical knowledge construction, and tourism map generation.
Claims
1. A method for classifying tourism and geographical texts based on a large language model, characterized in that, Includes the following steps: Step 1: Construct a classification sample dataset based on the tourism and geography text dataset, and formulate text labels for multiple categories for text classification; Step 2: Based on the classification sample dataset and text labels, construct a training dataset containing labels through label generation and label fusion strategies. The label generation and label fusion strategies refer to using at least two different large language models (LLM) and a unified prompt word template to generate corresponding preliminary predicted labels for the classification sample dataset and text classification labels, and calculating consistency scores to construct a training dataset containing labels. Step 3: Train the constructed classification model based on the training dataset to obtain the target model for tourism and geography text classification. The classification model is set after the last hidden state of the local lightweight basic large language model. Step 4: Use the target model to classify the tourism and geography texts to be classified and output the corresponding category labels.
2. The tourism and geography text classification method based on a large language model according to claim 1, characterized in that, The specific steps of step 1 are as follows: Step 1.1: First, obtain the original text dataset in the fields of tourism and geography. The original text data includes, but is not limited to, scenic spot introduction text, travel guide text, tourism policy and planning text, and geographical knowledge description text. Chinese texts were randomly selected from the original text datasets in the fields of tourism and geography as the classification sample dataset, with the number of randomly selected texts being greater than or equal to 10,000. Step 1.2: Analyze the text features of the original text dataset in the fields of tourism and geography. Combine the topic distribution, semantic structure and domain knowledge system of the corpus to formulate text classification labels. The text classification labels include geography and geology, history and culture, environmental protection, architecture and landscape, tourism industry construction and planning, scenic spot introduction, travel guides and itineraries, tourism vocational education and tourism policy.
3. The tourism and geography text classification method based on a large language model according to claim 2, characterized in that, The specific steps of step 2 are as follows: Step 2.1: Use at least two different external Large Language Models (LLMs) to generate preliminary predicted labels for the classification sample dataset and text classification labels using a unified prompt word template. Each LLM independently completes the annotation, forming a multi-model annotation set. The LLMs include GPT-4.1, Deepseek-R1, Deepseek-V3, or Moonshot-V1. The unified prompt word template includes a task description, a category constraint, an input text placeholder, and an output format constraint. The task description instructs the LLM to perform a text classification task and specifies that the classification target belongs to the tourism and geography text domain. The category constraint lists or limits the category range of the optional text classification labels to constrain the classification output of the LLM. The input text placeholder carries the classification sample dataset, and the output format constraint specifies the expression form of the text classification labels in the model output. Step 2.2: Calculate the consistency scores of the Cohen's Kappa coefficients for the two major language models (LLMs); Step 2.3: Based on the consistency score, each language model LLM calculates a weighted Kappa score for the prediction results of a single text in the classification sample dataset, retaining only high-confidence labels above a given threshold, and finally obtaining a training dataset containing labels.
4. A method for classifying tourism and geographical texts based on a large language model according to claims 1 to 3, characterized in that, The classification model in step 3 is a classification head structure set after the last hidden state of the local lightweight basic large language model, so as to map the last hidden vector to a preset fixed text category label set and output the corresponding text classification result. The local lightweight basic large language model includes Llama-3.2, Qwen-2.5 and Qwen-3, and the classification model is obtained by adding the classification head structure to Llama-3.2, Qwen-2.5 and Qwen-3.
5. The tourism and geography text classification method based on a large language model according to claim 4, characterized in that: The classification head structure includes a selection layer, a fully connected layer, and a category logit projection layer connected in sequence; The selection layer is used to select the hidden state corresponding to the last token in the last hidden vector sequence of the input sequence as the aggregated representation of the text context information; The dimension of the fully connected layer is "from hidden layer size to hidden layer size", and it processes the results obtained from the selected layer; The category logit projection layer maps the dimension of the result obtained from the fully connected layer from the hidden layer size to the number of predefined categories, and outputs the corresponding category logit, where logit represents the logical value.
6. The tourism and geography text classification method based on a large language model according to claim 5, characterized in that, In step 3, during the training phase, the classification model uses an external large language model (LLM) to generate soft labels or perform label semantic fusion, and learns its probability distribution through supervised fine-tuning methods. During the inference phase, only the supervised fine-tuned target model is retained for independent classification, and the external large language model LLM is no longer called.
7. A tourism and geographic text classification system based on a large language model, comprising a memory, a processor, and a computer program stored in the memory, characterized in that: The processor executes the computer program to implement the steps of the method according to any one of claims 1-6.
8. A computer-readable storage medium having a computer program stored thereon, characterized in that: When executed by a processor, the computer program implements the steps of the method according to any one of claims 1-6.
9. A computer program product, comprising a computer program, characterized in that: When executed by a processor, the computer program implements the steps of the method according to any one of claims 1-6.