Railway survey and design standard specification retrieval method based on large model

By constructing a knowledge graph and multi-path recall mechanism for railway survey and design standards and specifications, the problem of low efficiency in existing technologies has been solved, and the comprehensiveness and accuracy of railway survey and design specification retrieval have been achieved, improving retrieval efficiency and the relevance of results.

WO2026123845A1PCT designated stage Publication Date: 2026-06-18CHINA RAILWAY DESIGN GRP CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
CHINA RAILWAY DESIGN GRP CO LTD
Filing Date
2025-09-10
Publication Date
2026-06-18

Smart Images

  • Figure CN2025120445_18062026_PF_FP_ABST
    Figure CN2025120445_18062026_PF_FP_ABST
Patent Text Reader

Abstract

Disclosed in the present invention is a railway survey and design standard specification retrieval method based on a large model, comprising: S1, extracting standard specification provisions from a specification document; S2, using a large model to construct a knowledge graph for railway survey and design standard specification retrieval; S3, using a vector database to vectorize and store provision information; S4, using the large model to extract key information in a user question; S5, screening the standard specification provisions by means of a multi-channel recall mechanism; and S6, using the large model to perform secondary ranking and outputting a result. The method realizes accurate retrieval of railway survey and design standard specifications, and provides effective technical guarantee for improving the quality of railway survey and design results; the knowledge graph is constructed for railway survey and design standard specification retrieval; and the comprehensiveness, accuracy and stability of standard provision retrieval are improved by means of performing multi-channel recalling and performing secondary ranking by the large model.
Need to check novelty before this filing date? Find Prior Art

Description

A method for retrieving railway survey and design standards and specifications based on large models Technical Field

[0001] This invention relates to the field of rail transit information technology, and in particular to a method for retrieving railway survey and design standards and specifications based on a large model. Background Technology

[0002] Improving the quality of railway survey and design deliverables is a crucial objective of railway engineering. Quality control and feedback during the survey and design process, as well as upon delivery, play a vital role in ensuring the smooth completion and safe operation of railway projects. Currently, quality control of railway survey and design deliverables largely relies on manual, item-by-item review. This involves manually consulting numerous relevant standards to verify whether the design deliverables meet the requirements, ultimately generating a review report. This review method is inefficient and has many shortcomings. For example, manually reviewing each clause of the standard documents consumes significant human resources and time, leading to errors in comparing standard provisions. The low review efficiency also results in delays in delivering design deliverables on time. This, to some extent, affects the construction cycle and costs, and even jeopardizes project quality and safety.

[0003] Railway survey and design specifications contain a wealth of engineers' experience and researchers' experimental knowledge, serving as crucial standards for controlling the quality of design outcomes. Therefore, accurately and comprehensively retrieving the required survey and design specifications based on user needs has become a necessity. Existing retrieval methods primarily rely on Natural Language Processing (NLP), using metrics such as term frequency and vector similarity to find the most similar specification documents to the user's question. However, these methods neglect two main issues: firstly, they overlook the citation relationships between clauses, resulting in incomplete search results; secondly, when evaluating relevance to the user's question, they employ a simplistic evaluation approach and lack effective methods for assessing semantic similarity. Summary of the Invention

[0004] To address the problems in the background art, this invention provides a comprehensive, accurate, reliable, and efficient method for retrieving railway survey and design standards and specifications based on a large model.

[0005] Therefore, the present invention adopts the following technical solution:

[0006] A method for retrieving railway survey and design standards and specifications based on large models includes the following steps:

[0007] S1, Extract standard specification clauses from the specification document:

[0008] First, the docx library in Python is used to process the entire document of railway survey and design specifications in doc or docx format, extracting the content of each paragraph one by one;

[0009] Then, use regular expressions to determine whether the article number appears at the beginning of the paragraph text;

[0010] If it occurs, record a clause and use it as the current standard specification clause;

[0011] If it does not appear, and a clause exists in the current standard specification, then this text segment will be added to the content of the current standard specification clause; otherwise, no processing will be performed.

[0012] Finally, a list of clauses in the specification document is obtained, which includes all standard specification clauses marked with clause numbers;

[0013] S2, using a large model to construct a knowledge graph for retrieving railway survey and design standards and specifications, includes the following steps:

[0014] S21, use the large model prompt word engineering to extract keywords from each standard specification clause in the clause list in S1;

[0015] S22, Use large model prompt word engineering to obtain an overview of the standard specification clauses;

[0016] S23, using the large model prompt word engineering, by writing corresponding prompt words, the reference relationship of each standard specification clause is extracted;

[0017] S24. To improve the accuracy of the reference relationships, the reference relationships of some standard and specification clauses are first manually processed. Combined with the prompt words in S23, a large model fine-tuning dataset in the form of question-and-answer pairs is constructed. Then, the large model is fine-tuned using the large model fine-tuning dataset and the QLora fine-tuning method to obtain a large model of reference relationships. The large model of reference relationships is then used to extract the remaining standard and specification clauses to obtain the reference relationships of the corresponding standard and specification clauses.

[0018] S25, store the standard specification clauses and the citation relationships obtained in S24 into the graph database to obtain a knowledge graph for standard specification retrieval;

[0019] S3, using a vector database to store the text information in a vectorized manner:

[0020] The keywords and summaries in S2 are converted into vectors. Using a vector database, the vectorized keywords and summaries are stored separately to obtain a keyword vector database and a summary vector database.

[0021] S4, using a large model to extract key information from user questions:

[0022] Using large model prompt word engineering, key information is extracted from the text input by the user. The key information includes: engineering object and object attributes;

[0023] S5 uses a multi-channel recall mechanism to filter standard specification clauses:

[0024] For the engineering objects and object attributes extracted in step S4, the top k approximate keywords are obtained from the keyword vector database obtained in step S3 based on the vector distance. Then, a graph database query statement is constructed for the approximate keywords. The standard and specification articles containing these keywords and their citation relationships are selected from the knowledge graph obtained in S2. After overall deduplication, they are used as the initial screening results of the articles.

[0025] The standard and normative clauses in the initial screening results are sorted using the BM25 text matching algorithm. The top n standard and normative clauses are selected and used as the results of graph query recall.

[0026] Then, the large model is used to directly vectorize the text input by the user in S4. The first n most relevant standard and normative clauses to the text input by the user are queried from the summary vector database of standard and normative clauses obtained in step S3. The queried content is used as the result of vector similarity-based recall.

[0027] Finally, the results of the graph-based query recall and the results of the vector similarity-based recall are deduplicated, and the remaining standard and normative clauses are used as the results of the standard and normative clause selection.

[0028] S6, perform secondary sorting using a large model and output the results:

[0029] The large model and corresponding prompt words are used to calculate the relevance between the normative clauses selected in step S5 and the text input by the user, and a similarity value is calculated for each normative clause after multi-path recall; then, all normative clauses after multi-path recall are sorted according to the similarity value, and the top n clauses are returned as the final query results.

[0030] The regular expression in S1 is “'^[0-9]+\.[0-9]+\.[0-9a-zA-Z]+'”; the article number includes numbers, decimal points, and uppercase and lowercase letters.

[0031] The reference relationship mentioned in step S2 includes: the name of the referenced external specification document, the clause number of the referenced external standard specification text, and the other clause numbers in the referenced current specification document; the external specification document refers to other specification documents besides the current specification document.

[0032] The similarity value in step S6 is between 0 and 100.

[0033] In step S25, using the citation relationships obtained in step S24, the keywords in step S21, and the overview of the standard and specification clauses in step S22, three types of triplet relationships are constructed for each standard and specification clause: <clause (overview), including keywords>, <clause (overview), citation, clause (overview)>, <clause (overview), reference, specification document>, and <specification document, containing, clause (overview)>, thereby forming a knowledge graph for standard and specification retrieval.

[0034] Preferably, the prompts are written using Markdown syntax.

[0035] Preferably, in step S4, if the information extracted by the large model also includes text other than key information, the regular expression '(?<={).*? (?=})' is used to filter out irrelevant information to ensure that the large model outputs accurate key information.

[0036] Compared with the prior art, the present invention has the following beneficial effects:

[0037] 1. The method of this invention focuses on the citation relationships between standard and specification clauses, and uses large model prompts and fine-tuning techniques to extract the citation relationships between standard document clauses, as well as the keywords and overview information contained in the clauses, to construct a knowledge graph for standard and specification retrieval, thereby improving the comprehensiveness and accuracy of standard and specification retrieval results and providing a reliable information source for subsequent retrieval processes.

[0038] 2. The method of the present invention comprehensively utilizes multiple retrieval and ranking methods such as knowledge graph query, BM25 ranking, and vector similarity retrieval to perform multi-way recall retrieval and ranking of standard and specification provisions, ensuring the stability, comprehensiveness, and accuracy of standard and specification retrieval results.

[0039] 3. The method of this invention utilizes the natural language understanding capabilities of large models and uses prompt word technology to score the relevance of the initial screening results of standard and normative clauses to achieve secondary ranking. From the perspective of semantic understanding, it ensures that standard and normative clauses that are more relevant to user questions are displayed first. Attached Figure Description

[0040] Figure 1 is a flowchart of the standard specification retrieval method of the present invention;

[0041] Figure 2 shows the overall framework of the standard specification retrieval method of this invention;

[0042] Figure 3 shows an example of the standard specification knowledge graph constructed in this invention. Detailed Implementation

[0043] The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments.

[0044] As shown in Figures 1 and 2, the railway survey and design standard and specification retrieval method based on a large model of the present invention includes the following steps:

[0045] S1, Extract standard specification clauses from the specification document:

[0046] First, the docx library in Python is used to process railway survey and design specification documents in doc or docx format, extracting the content of each paragraph one by one.

[0047] Then, regular expressions are used for filtering. The regular expression used is: "'^[0-9]+\.[0-9]+\.[0-9a-zA-Z]+'". The meaning of this expression is: to filter whether the text of the paragraph begins with a clause number. The clause number includes numbers, decimal points, and uppercase and lowercase letters.

[0048] If the text contains an article number, a new article information is recorded and used as the current article.

[0049] If the text segment does not contain a clause number, but a clause from the current standard exists, then the text segment is added to the content of the current standard clause; otherwise, no processing is performed. This process is repeated until all paragraphs in the standard document have been read.

[0050] Finally, a list of clauses in the specification document is obtained, which includes all standard specification clauses marked with clause numbers.

[0051] S2, using a large model to construct a knowledge graph for retrieving railway survey and design standards and specifications, includes the following steps:

[0052] S21. Utilize the large-model keyword extraction method to extract keywords from the standard specification clauses obtained in step S1. The keyword extraction principles and output format of the large-model are defined in the keyword extraction code, and a few examples are provided to guide the large-model in extracting keywords from the standard specification clauses. The specific keyword content is as follows:

[0053] -------------

[0054] Now you are a word segmentation expert, and the requirements are:

[0055] 1. Your answer only contains word segmentation results;

[0056] 2. All extracted words are either project objects or object attributes;

[0057] 3. The word segmentation results are output in the format [word1; word2; word3]. Below are some examples of word segmentation; please refer to these examples for your own word segmentation:

[0058] (1) Example 1:

[0059] Input: 3.4.5 The fire separation distance between civil buildings with a building height greater than 100m and adjacent buildings shall comply with Articles 3.4.5, 3.5.3 and 4.2.1 of this code.

[0060] Output: [Building height; Civil building; Adjacent buildings; Fire separation distance].

[0061] (2) Example 2:

[0062] Input: Storage tanks of the same or similar fire hazard class should preferably be arranged within each fire dike. Boiling-over oil storage tanks should not be located within the same fire dike as non-boiling-over oil storage tanks. Above-ground and semi-underground storage tanks should not be located within the same fire dike as underground storage tanks.

[0063] Outputs: [fire dikes; fire hazards; storage tanks; boiling overflow oil storage tanks; non-boiling overflow oil storage tanks; above-ground tank farms; semi-underground tank farms; underground tanks].

[0064] (3) Example 3:

[0065] Input: 4.3.7 The fire separation distance between liquid hydrogen and liquid ammonia storage tanks and buildings, storage tanks, storage yards, etc. can be determined by reducing the fire separation distance of the corresponding volume of liquefied petroleum gas storage tanks by 25% according to Article 4.4.1 of this code.

[0066] Outputs: [Liquid hydrogen; liquid ammonia storage tanks; buildings; storage tanks; storage yards; fire separation distances; liquefied petroleum gas storage tanks].

[0067] What should the output be when the input is {input}?

[0068] -------------

[0069] When using the above prompts for keyword extraction, simply replace "{input}" in the prompts with the standard specification of the keywords to be extracted, and send the entire prompt to the large model to obtain keywords in the specified format.

[0070] S22. Using the large model prompt word project, briefly summarize the main content of the standard specification clauses. The summary should not exceed 200 words and should include the engineering objects and object attributes that are of concern in the original standard specification clauses, thus obtaining an overview of the standard specification clauses.

[0071] S23. Based on the technologies used in the large-scale model prompt word project, prompt words are written. The citation relationships of each standard specification clause are extracted. These relationships include: the name of the referenced external specification document, the clause number of the referenced external standard specification text, and the clause numbers of other referenced clauses in the current specification document. External specification documents refer to specification documents other than the current specification document. Related technologies include using Markdown syntax to write prompt words. The specific prompt word content is as follows: -------------

[0072] #background

[0073] You are an information extraction expert. Please extract the referenced standard documents and related standard clauses from known standard and specification clauses, and output the following format:

[0074] {"Standard Document":["Document 1":{"Relevant Clauses":["Clause 1 of Document 1","Clause 2 of Document 1"]},"Document 2":{"Relevant Clauses":["Clause 1 of Document 2","Clause 2 of Document 2"]}],"Relevant Clauses":["Clause 1 of this document","Clause 2 of this document"]}.

[0075] #Require

[0076] The following requirements must be met when extracting information:

[0077] ##1. Each selected specification document and standard specification clause must be the original text of known information;

[0078] ##2. If the known information does not include the standard specification clause number, specification document, or relevant clause, then output: {"Specification document": [],"Relevant clause": []}.

[0079] ##3. Please output the results in JSON format.

[0080] ##4. The relevant clauses must appear in the text in the form of "Article xxx" or "Article xxx, yyy", where "xxx" and "yyy" are the relevant clauses to be extracted. If such a description does not exist, the "Relevant Clauses" field in the output will be empty. Furthermore, if the relevant clauses are repeated, only one should be retained.

[0081] ##5. Standard documents must begin with a book title mark, including the book title mark, its contents, and the code that follows. If the input standard document does not include a book title mark, the "Standard Document" in the output will be empty.

[0082] ##6. If the input standard specification clauses mention referenced standard documents, such as specific clauses of "Document 1", then these clauses will be added to the list of relevant clauses of "Document 1" as "Clause xxx of Document 1".

[0083] ##7. If multiple specification documents with the same name are extracted, only one will be kept.

[0084] ##8. When the referenced standard or specification clauses are expressed in the range of "Article xxx to Article yyy", please calculate and list all the standard or specification clauses within the range as the result of the relevant clauses.

[0085] ##9. The output should strictly conform to the JSON format.

[0086] #example

[0087] When the known information is: "6.1.2 When the waiting area and passenger hall of a railway passenger station meet the following conditions, the building area of ​​each fire compartment shall not exceed 10,000 m²":

[0088] 1. Located on the ground floor, a single-story elevated floor, or on the second floor with half of the direct external evacuation exits and an enclosed indoor stairwell.

[0089] 2. It is equipped with an automatic sprinkler system, smoke exhaust facilities and an automatic fire alarm system, as required by Article 3.4.1 of this standard.

[0090] 3. The interior decoration design complies with the relevant provisions of Articles 1.2.5 and 1.5.6 of the "Code for Fire Protection Design of Interior Decoration of Buildings" (GB50222). Your output should then be:

[0091] {"Standard Document": ["Code for Fire Protection Design of Interior Decoration of Buildings GB50222": {"Relevant Clauses": ["1.2.5", "1.5.6"]}], "Relevant Clauses": ["3.4.1"]}.

[0092] Based on the above, what should your output be when the input is "{input}"? Please think about it step by step and only output the result.

[0093] -------------

[0094] When using the above prompts to extract reference relationships, "{input}" needs to be replaced with the text of the standard specification clause to be processed. In addition, the above prompts use descriptions that conform to Markdown syntax to represent the hierarchical relationship of the specification documents in the prompts, enabling the large model to understand the user's intent more accurately.

[0095] As can be seen, by using the above prompts, the large model can be used to extract and obtain the reference relationships between each standard specification clause and clauses, names and clause numbers in this specification and other specification documents.

[0096] S24. To further improve the accuracy of extracting citation relationships between standard and specification clauses, a fine-tuning dataset is first constructed using the prompt words from step S23 as instructions and manual organization. This dataset consists of a series of question-and-answer pairs stored in JSON format, with the specific format of each pair as follows:

[0097] ------------

[0098] {

[0099] "instruction": "You are an information extraction expert. Please extract the referenced standard documents and related clauses from the known standard specification clauses and form the output. The output format is: {"Standard Document": ["Document 1": {"Related Clauses": ["Clause 1 of Document 1", "Clause 2 of Document 1"]},"Document 2": {"Related Clauses": ["Clause 1 of Document 2", "Clause 2 of Document 2"]}],"Related Clauses": ["Clause 1 of this document", "Clause 2 of this document"]}.

[0100] #The following requirements must be met when extracting information:

[0101] ##1. Each selected specification document and standard specification clause must be the original text of known information;

[0102] ##2. If the known information does not include the stripe document number, standard document, or relevant clauses, then output: {"Standard document": [],"Relevant clauses": []}.

[0103] ##3. Please output the results in JSON format.

[0104] ##4. The relevant clauses must appear in the text in the form of "Article xxx" or "Article xxx, yyy", where "xxx" and "yyy" are the relevant clauses to be extracted. If such a description does not exist, the "Relevant Clauses" field in the output will be empty. Furthermore, if the relevant clauses are repeated, only one should be retained.

[0105] ##5. Standard documents must begin with a book title mark, including the book title mark, its contents, and the code that follows. If the input standard document does not include a book title mark, the "Standard Document" in the output will be empty.

[0106] ##6. If the input standard specification clauses mention referenced standard documents, such as specific clauses of "Document 1", then these clauses will be added as "Clause x of Document 1" to the list of relevant clauses of "Document 1".

[0107] ##7. If multiple specification documents with the same name are extracted, only one will be kept.

[0108] ##8. When the referenced standard or specification clauses are expressed in the range of "Article xxx to Article yyy", please calculate and list all the standard or specification clauses within the range as the result of the relevant clauses.

[0109] ##9. The output should strictly conform to the JSON format.

[0110] "input": "Considering that some buildings have high decoration standards and require the use of combustible materials, and referring to Section 6.5.2 of this specification, in order to meet practical needs without compromising overall safety performance, it is stipulated that fire protection facilities be installed to compensate for the insufficient flammability rating of decoration materials. According to Clauses 5.6.1 and 5.2.3 of the American Standard NFPA 101, if automatic fire suppression systems are adopted, the flammability rating of the decoration materials used can be reduced by one level. This clause is formulated with reference to the above provisions."

[0111] "output": "{"Standard Document": ["Personal Safety Standard NFPA 101": {"Relevant Clauses": ["5.6.1", "5.2.3"]}],"Relevant Clauses": ["6.5.2"]}"

[0112] },

[0113] ------------

[0114] LoRA fine-tuning was performed on the 7b-scale Qwen2.5 large model using a large model fine-tuning dataset containing several question-answer pairs to obtain a large model for extracting reference relationships between standard specification clauses. The reference relationship large model was then used to extract the reference relationships of the remaining standard specification clauses, thus obtaining the reference relationships of the corresponding standard specification clauses.

[0115] S25, save as a knowledge graph:

[0116] By utilizing the citation relationships of standard and specification provisions obtained in step S24, the keywords in step S21, and the overview of standard and specification provisions in step S22, three types of triplet relationships can be constructed for each standard and specification provision: <provision (overview), including keywords>, <provision (overview), citation, provision (overview)>, <provision (overview), reference, specification document>, and <specification document, containing, provision (overview)>. This forms a knowledge graph for standard and specification retrieval, as shown in Figure 3.

[0117] S3, using a vector database to store the text information in a vectorized manner:

[0118] From the standard specification knowledge graph obtained in step S2, all keywords and summaries of standard specification provisions are read. A large model is used to embed the extracted keywords and summaries into text, transforming them from natural language terms into vectors. Then, a vector database is used to store the vectorized keywords and summaries, resulting in a keyword vector database and a summary vector database.

[0119] S4, using a large model to extract key information from user questions:

[0120] Using large model prompts, key information is extracted from the user's input text. This key information includes project objects and object attributes. The prompts used are as follows:

[0121] -------------

[0122] #Background: You are a sentence component analysis expert. Please extract the main engineering objects and their attributes described in the input sentence.

[0123] #Require

[0124] ##1. Please output the extracted results in JSON format.

[0125] ##2. The extracted "project object" must be a project entity. If the input does not contain the attributes of the project object, the attribute in the output should be "".

[0126] ##3. The extracted "project objects" and "attributes" must be the original text from the input statement.

[0127] #Sample

[0128] ##Example 1 When the input is "How should the fire compartments of a railway station be designed?", your output should be

[0129] "{"Project Object": Railway Station Building, "Attribute": Fire Compartment}"

[0130] ##Example 2 When the input is "How should railway station buildings be designed", your output should be "{"Project Object": "Railway Station Building", "Attributes": ""}.

[0131] ##Example 3 When the input is "In the design process of railway station buildings, if the fire compartment is designed on the first floor, how should the fire compartment area be calculated?", your output should be "{"Project Object": "Railway Station Building", "Attribute": "Fire Compartment Area"}.

[0132] Now, please think about what your output should be when my input is "{input}".

[0133] -------------

[0134] When extracting project objects and object attributes using the above prompts, it is necessary to first replace "{input}" with the user's question and input the replaced prompts into the large model. Because the prompts incorporate the phrase "...please think step by step..." using the thought chain technique, the output may contain text other than the target JSON text. Therefore, regular expressions are needed to filter out the extra text. The regular expression used is: '(?<={).*? (?=})'. This regular expression allows the target JSON text to be extracted from the output of the large model.

[0135] S5 uses a multi-channel recall mechanism to screen standard specification clauses, including the following steps:

[0136] S51, Obtain the results of graph-based query recall:

[0137] First, the engineering objects and object attributes extracted in step S4 are vectorized using a large model. Then, using the vectorized engineering objects and object attributes, the top k similar keywords are queried from the keyword vector database obtained in step S3. For each queried keyword, a graph query statement is generated. From the knowledge graph constructed in step S2, all standard and regulatory provisions containing that keyword are retrieved. Assuming a keyword is "keyword 1", using Cypher as an example, the generated graph query statement should be: MATCH(a: Rule) - [: `include`] -> (b: Keyword{text: 'keyword 1'}) RETURN a. Here, "(a: Rule)" represents the node in the knowledge graph with the category "regulatory provisions", and "(b: Keyword{text: 'keyword 1'})" represents the node in the knowledge graph whose keyword text is "keyword 1". Next, use graph query statements to search for the clauses referenced by the aforementioned standard specifications. Taking Cypher language as an example, the statement to search for the clauses referenced by a certain standard specification clause should be: MATCH(a: Rule) - [:`reference`] -> (b: Rule) RETURN b. Here, b is the clause referenced by standard specification clause a. Finally, perform overall deduplication on all the retrieved standard specification clauses to initially filter out the clauses that may be relevant to the user's question.

[0138] Then, the BM25 algorithm is used to sort the initial screening results of the clauses in S51, and the top n standard and normative clauses are selected as the results of graph query recall.

[0139] S52, Obtain the results of vector similarity-based recall:

[0140] Then, the large model is used to directly vectorize the text input by the user. From the text summary vector database obtained in step S3, the top n standard and specification texts most relevant to the user's question are queried. The queried content is used as the result of vector similarity-based recall.

[0141] S53, deduplicatize the 2n standard specification clauses obtained from S51 and S52 to obtain the standard specification clauses after multi-channel recall.

[0142] S6, using a large model for secondary sorting:

[0143] The relevance score between the standard specification clauses retrieved in step S5 and the user-input text is calculated using prompt word engineering. The prompt words used for scoring are as follows:

[0144] -------------

[0145] Please assign a relevance score from 0 to 100 to each of the following entries, based on their relevance to "{query}", with 100 being the most relevant and 0 being completely irrelevant.

[0146] #Require

[0147] 1. The output results are sorted according to their relevance, and the original sequence number of the data is retained;

[0148] 2. The output results are organized in JSON format, for example: {[{Item number: "Original item number", Relevance score: "Relevance score of the item"}, {Item number: "Original item number", Relevance score: "Relevance score of the item"}]}

[0149] #Content Entries

[0150] ##1:{rule1}

[0151] ##2:{rule2}

[0152] ##3:{rule3}

[0153] ##4:{rule4}

[0154] ........

[0155] ##n:{rulem}

[0156] -------------

[0157] When using the above prompts to score the relevance of the standard specification clauses after multi-way recall, "{query}" needs to be replaced with the text entered by the user, and "rule1~m" should be replaced with the original text or the corresponding summary of m different standard specification clauses after multi-way recall. The setting of m should ensure that the total length of the entire prompt does not exceed the context length that the large model used can accept.

[0158] After obtaining the relevance scores of all standard specification clauses retrieved through multiple recalls to the user's question, the standard specification clauses are sorted according to the relevance scores, and finally, the n standard specification clauses most relevant to the user's question are output as the final query results.

Claims

1. A method for retrieving railway survey and design standards and specifications based on a large model, characterized in that, Includes the following steps: S1, Extract standard specification clauses from the specification document: First, the docx library in Python is used to process the entire document of railway survey and design specifications in doc or docx format, extracting the content of each paragraph one by one; Then, use regular expressions to determine whether the article number appears at the beginning of the paragraph text; If it occurs, record a clause and use it as the current standard specification clause; If it does not appear, and a clause exists in the current standard specification, then this text segment will be added to the content of the current standard specification clause; otherwise, no processing will be performed. Finally, a list of clauses in the specification document is obtained, which includes all standard specification clauses marked with clause numbers; S2, using a large model to construct a knowledge graph for retrieving railway survey and design standards and specifications, includes the following steps: S21, use the large model prompt word engineering to extract keywords from each standard specification clause in the clause list in S1; S22, Use large model prompt word engineering to obtain an overview of the standard specification clauses; S23, using the large model prompt word engineering, by writing corresponding prompt words, the reference relationship of each standard specification clause is extracted; S24. To improve the accuracy of the reference relationships, the reference relationships of some standard and specification clauses are first manually processed. Combined with the prompt words in S23, a large model fine-tuning dataset in the form of question-and-answer pairs is constructed. Then, the large model is fine-tuned using the large model fine-tuning dataset and the QLora fine-tuning method to obtain a large model of reference relationships. The large model of reference relationships is then used to extract the remaining standard and specification clauses to obtain the reference relationships of the corresponding standard and specification clauses. S25, store the standard specification clauses and the citation relationships obtained in S24 into the graph database to obtain a knowledge graph for standard specification retrieval; S3, using a vector database to store the text information in a vectorized manner: The keywords and summaries in S2 are converted into vectors. Using a vector database, the vectorized keywords and summaries are stored separately to obtain a keyword vector database and a summary vector database. S4, using a large model to extract key information from user questions: Using large model prompt word engineering, key information is extracted from the text input by the user. The key information includes: engineering object and object attributes; S5 uses a multi-channel recall mechanism to filter standard specification clauses: For the engineering objects and object attributes extracted in step S4, the top k approximate keywords are obtained from the keyword vector database obtained in step S3 based on the vector distance. Then, a graph database query statement is constructed for the approximate keywords. The standard and specification articles containing these keywords and their citation relationships are selected from the knowledge graph obtained in S2. After overall deduplication, they are used as the initial screening results of the articles. The standard and normative clauses in the initial screening results are sorted using the BM25 text matching algorithm. The top n standard and normative clauses are selected and used as the results of graph query recall. Then, the large model is used to directly vectorize the text input by the user in S4. The first n most relevant standard and normative clauses to the text input by the user are queried from the summary vector database of standard and normative clauses obtained in step S3. The queried content is used as the result of vector similarity-based recall. Finally, the results of the graph-based query recall and the results of the vector similarity-based recall are deduplicated, and the remaining standard and normative clauses are used as the results of the standard and normative clause selection. S6, perform secondary sorting using a large model and output the results: The large model and corresponding prompt words are used to calculate the relevance between the normative clauses selected in step S5 and the text input by the user, and a similarity value is calculated for each normative clause after multi-path recall; then, all normative clauses after multi-path recall are sorted according to the similarity value, and the top n clauses are returned as the final query results.

2. The railway survey and design standard and specification retrieval method based on a large model according to claim 1, characterized in that: The regular expression in S1 is "'^[0-9]+\.[0-9]+\.[0-9a-zA-Z]+'"; the article number includes numbers, decimal points, and uppercase and lowercase letters.

3. The railway survey and design standard and specification retrieval method based on a large model according to claim 1, characterized in that: The reference relationship mentioned in step S2 includes: the name of the referenced external specification document, the clause number of the referenced external standard specification text, and the other clause numbers in the referenced current specification document; the external specification document refers to other specification documents besides the current specification document.

4. The method for retrieving railway survey and design standards and specifications based on a large model according to claim 1, characterized in that: The similarity value in step S6 is between 0 and 100.

5. The method for retrieving railway survey and design standards and specifications based on a large model according to claim 1, characterized in that: In step S25, using the citation relationships obtained in step S24, the keywords in step S21, and the overview of the standard and specification clauses in step S22, three types of triplet relationships are constructed for each standard and specification clause: <clause (overview), including keywords>, <clause (overview), citation, clause (overview)>, <clause (overview), reference, specification document>, and <specification document, containing, clause (overview)>, thereby forming a knowledge graph for standard and specification retrieval.

6. The method for retrieving railway survey and design standards and specifications based on a large model according to claim 1, characterized in that: The prompts are written using Markdown syntax.

7. The method for retrieving railway survey and design standards and specifications based on a large model according to claim 1, characterized in that: If the information extracted by the large model in step S4 includes text other than the key information, the regular expression '(?<={).*? (?=})' is used to filter out irrelevant information to ensure that the large model outputs accurate key information.