Dual-Track Decoupling Framework Optimization

Abstract

In recent years, large models have been transitioning from general domains to highly compliant application scenarios such as finance, healthcare, law, and energy, yet the engineering and scientific challenges have not diminished. In response to this background, we propose the overarching concept of "dual-track decoupled industry adaptation": by completely separating general capability enhancement from industry knowledge governance at runtime, we aim to minimize parameter pollution and regression while ensuring knowledge freshness and traceability. Specifically, we use the parameter track to improve language and reasoning capabilities, and the non-parameter track to carry updatable industry knowledge, forming a closed loop at runtime through referee feedback. Compared to previous works such as KBLaM, Self-RAG, RETRO, and kNN-Adapter, we place greater emphasis on systematic divide-and-conquer and governance mechanisms.

However, when planning research, it is inappropriate to adopt a single solution and solidify the technical route, as this field is still rapidly evolving. Recent research indicates that external memory and retrieval methods are rapidly iterating. For example, the neuroscience-inspired HippoRAG utilizes knowledge graphs and Personalized PageRank to achieve multi-hop knowledge integration in a single retrieval step, outperforming existing RAG methods at lower cost^[1]; MemoRAG introduces a lightweight long-range model to construct a global memory of the database, guided by coarse answer clues for retrieval, significantly outperforming traditional RAG in complex long-text tasks^[2]; KBLaM maps knowledge triples into continuous key-value vectors and injects them into the model via rectangular attention, with complexity growing linearly with knowledge scale and supporting dynamic updates^[3]; For parameter-efficient fine-tuning, LoRA freezes the backbone weights and injects low-rank matrices, reducing trainable parameters by up to ten thousand times while maintaining model performance^[4], while DoRA decomposes weights into magnitude and direction, using LoRA to update the directional component to approximate full-parameter fine-tuning learning capacity while keeping inference overhead unchanged^[5]. Additionally, research surrounding model memory governance calls for establishing unified evaluation standards and being mindful of biases in LLM-as-judge^[6]. Security-wise, vector databases introduced by RAG may expose private data and trigger risks such as reverse reconstruction, over-sharing, and data poisoning^[7].

Given these advancements, we have adjusted our plan to "dual-track framework + multi-route exploration," maintaining the decoupling principle while systematically comparing various mechanisms to find the optimal combination and make academic contributions.

1 Research Objectives and Overview

We continue to adhere to dividing the system into Parameter Track (Param-Track) and Non-parameter Track (Nonparam-Track):

Parameter Track: Responsible for language understanding, reasoning capabilities, and style control, without carrying specific facts. For parameter-efficient adaptation, we plan to experiment with different PEFT schemes:
- Low-rank or directional updates (LoRA, DoRA/O-LoRA): LoRA freezes the backbone and uses low-rank matrices for updates^[4], DoRA decomposes weights into magnitude and direction, using LoRA to update the direction to enhance learning capacity while maintaining inference efficiency^[5];
- Hybrid or mixture-of-experts combinations: Explore orthogonalization of directional bases, dynamic routing, or sparse gating to keep multiple skill units mutually non-interfering;
- Incremental transfer and cross-generation alignment: Attempt weight space alignment based on Procrustes or CCA to reduce upgrade costs.
Non-parameter Track: Responsible for carrying industry knowledge and time-sensitive information. We will experiment with various knowledge governance mechanisms:
- Knowledge token / rectangular attention (KBLaM/KB-Adapter): Convert knowledge triples into fixed-length key-value vectors and inject them into the model, eliminating retrieval latency and supporting dynamic updates^[3];
- Knowledge graph + single-step multi-hop retrieval (HippoRAG): Construct a schema-free knowledge graph, utilize Personalized PageRank to complete cross-document reasoning in a single step, already surpassing methods like IRCoT in multi-hop QA^[1][8];
- Global memory + clue-driven retrieval (MemoRAG): Use a lightweight model to generate global memory and produce clues, a heavyweight model retrieves based on these and generates the final answer, suitable for implicit requirements and structured queries^[2];
- Hierarchical tree retrieval (RAPTOR/IRCoT): Build summary trees through recursive clustering, retrieving information at different levels to achieve long document aggregation^[9];
- External vector retrieval + kNN memory: Use vector databases and hierarchical retrieval trees for time-sensitive knowledge, and explore kNN-LM or kNN-Adapter under security controls;
- Episodic memory and RL adjustment (Memento): Explore combining online memory with reinforcement learning to improve agent adaptability through episodic memory.

Through an adaptive router, we will dynamically select and combine various non-parameter paths at runtime based on question difficulty, industry triggers, model uncertainty, and latency budget. We will refer to the "three-part method" proposed in unified evaluation to control combination costs and quality returns, preventing the model from stuffing all evidence into the context causing truncation and delay.

2 Theory and Method Exploration

2.1 Decoupling Motivation and Theoretical Support

The motivation for physically decoupling "general capabilities" from "industry knowledge" at runtime stems from two aspects: first, the challenge of knowledge governance—writing facts into parameters causes forgetting and regression, requiring costly re-fine-tuning during upgrades; second, the chain issues of retrieval augmentation—external retrieval in real environments is susceptible to noise, latency, and data competition. Recent research shows that single-step multi-hop retrieval can integrate scattered evidence at once through knowledge graphs and Personalized PageRank^[1]; memory token injection can scale linearly and update dynamically^[3]; dual-system memory can address information bridging under implicit requirements^[2]. Therefore, the decoupling framework not only alleviates memory interference but also provides a container for these innovative mechanisms.

2.2 Parameter Track Method Research

The parameter track should ensure that general capabilities do not regress or regress nearly zero. We plan to conduct the following explorations:

LoRA vs. DoRA comparison: LoRA uses low-rank matrices to reduce training parameters while maintaining performance^[4], but may suffer from subspace interference in multi-task/multi-tenant scenarios. DoRA decomposes weights into magnitude and direction, using LoRA to update the direction to improve learning capacity while keeping inference overhead, with multiple experiments showing DoRA surpasses LoRA in multimodal tasks^[5].
Orthogonal or low-overlap subspaces: Use orthogonal regularization or mixture of experts to map different skill units to nearly orthogonal directional bases to reduce merging conflicts.
Dynamic mixing and incremental merging: Explore dynamic gating mechanisms like MoE or MoLa to load different skills on demand; use minimal rotation alignment during upgrades to reduce degradation.

2.3 Non-parameter Track Method Research

Non-parameter track exploration will focus on different memory and retrieval strategies:

Rectangular attention and knowledge tokens (KBLaM): By mapping knowledge triples into fixed key-value vectors and injecting them into the model, KBLaM avoids external retrieval chains, with complexity increasing linearly with knowledge scale, capable of injecting over 10,000 knowledge entries into an 8B model, and supporting dynamic addition and deletion^[3].
Knowledge graph indexing and Personalized PageRank (HippoRAG): Use LLMs to convert corpora into schema-free knowledge graphs, then run PPR on the core concepts of queries to complete multi-hop retrieval in a single step^[1][8]. This mechanism outperforms IRCoT on multi-hop QA benchmarks at lower cost.
Global memory and clue-driven retrieval (MemoRAG): Construct a global memory of the database through a long-context lightweight model, generate coarse answers as clues to guide a heavyweight model for retrieval and final answer generation, suitable for tasks with implicit information needs or ambiguous queries^[2].
Hierarchical retrieval and tree structures (RAPTOR/IRCoT): Build retrieval trees through recursive clustering and summarization, performing top-down coarse and fine localization on long documents, effectively solving long-text aggregation^[9].
Other non-parameter extensions: Including kNN-LM (using nearest neighbor embeddings as external memory), paragraph-level memory blocks (like RETRO), knowledge editing and local patching (ROME/MEMIT), and the Memento framework combining episodic memory with reinforcement learning.

2.4 Referee Mechanism and Decoding Guidance

To ensure factuality, logical consistency, and terminology standardization, we will build a discriminative referee module that evaluates generated answers on evidence consistency, logical correctness, and normative expression. Considering that LLM-as-judge suffers from position, order, and self-bias issues^[6], we will employ multi-model cross-validation, annotated data correction, and randomized output order to reduce bias. The scores output by the referee will serve as reward signals to guide the generation model in adjusting probability distributions during decoding, achieving lightweight online optimization. We will also explore strategies like RLHF/RLAIF and minimal edit rewriting to improve quality during the decoding phase in a closed loop, and record evidence paths for auditing.

3 Plan and Timeline

To ensure project feasibility and quantify stage goals, we propose the following timeline (in months, based on 12 months):

Stage	Time Range	Key Tasks	Expected Outputs
Requirement Analysis & Baseline Construction	Months 1–3	Analyze key requirements of high-compliance industries; collect/clean domain corpora; reproduce and evaluate standard RAG, Self-RAG, baseline PEFT (LoRA, DoRA) and knowledge injection schemes (KBLaM) on industry tasks	Form datasets and evaluation metrics; baseline performance report
Multi-mechanism Exploration (Non-parameter Track)	Months 4–7	Implement representative retrieval/memory methods like HippoRAG, MemoRAG, RAPTOR; develop knowledge graph construction pipelines and triple extraction methods; compare different methods on multi-hop QA and regulation-dependent tasks for performance and latency	Non-parameter scheme comparison report; preliminary analysis of advantages and bottlenecks
Parameter Track Extension & Hybridization	Months 5–8	In-depth study of PEFT like LoRA, DoRA, O-LoRA, AdapterFusion; explore directional base orthogonalization and dynamic routing; analyze general capability regression and merging stability combined with non-parameter track experiments	Parameter track optimization schemes; weight packages for specific skills
Referee Model & Closed-loop Optimization	Months 7–9	Build referee evaluation dataset, distill strong models and correct with human annotation; design reward-based decoding guidance; conduct segment rewriting experiments	Referee model and its evaluation report; decoding closed-loop performance comparison
Integration & Comparative Experiments	Months 9–11	Combine parameter track with different non-parameter tracks to build a unified system; use adaptive router to tune between performance and efficiency; conduct end-to-end testing on finance, healthcare, law, energy datasets	Comprehensive performance report; recommended optimal scheme
Summary & Release	Months 11–12	Write paper and technical report; release open-source implementation; develop industry deployment guidelines	Research paper, code, and deployment guidelines

4 Risk Management and Ethical Compliance

4.1 Technical Risks

Model overfitting and regression risk: Parameter track adaptation may cause general capability decline. Using DoRA/LoRA and imposing orthogonal constraints on skill units can mitigate this^[5]. Cross-generation alignment will also be used to reduce performance regression during upgrades.
Memory capacity and retrieval efficiency: Large-scale knowledge graphs or global memory may lead to insufficient VRAM or high latency. We will compare different methods and set circuit breakers, using the adaptive router to select appropriate retrieval depth within the latency budget.
Discriminator bias and evaluation error: LLM-as-judge has biases^[6]. Risk will be reduced by using multi-model voting, manual sampling correction, order randomization, and long-tail sample coverage.

4.2 Ethics and Compliance

Ensure all training data complies with local laws and regulations (e.g., HIPAA, GDPR), with sensitive data being authorized and desensitized before use.
Refuse to perform high-impact decisions based on sensitive features to avoid algorithmic discrimination.
Conduct security audits before releasing models or services, providing explainability and traceability to meet industry compliance review requirements.

5 Expected Contributions and Innovations

Expected contributions of this research include:

Framework Contribution: Achieving "general capability–knowledge governance" division in high-compliance domains through runtime decoupling, theoretically proving this division reduces interference and facilitates upgrade migration;
Method Contribution: Comparing and synthesizing various external memory and retrieval mechanisms (HippoRAG, MemoRAG, KBLaM, RAPTOR, etc.), exploring new combinations suitable for high-compliance scenarios, and optimizing parameter adaptation with PEFT techniques like DoRA;
Evaluation Contribution: Designing unified metrics based on the latest LLM memory governance framework, focusing on fact verifiability, citation consistency, terminology compliance, efficiency cost, and transferability;
Security and Compliance Practice: Proposing security strategies applicable to vector databases and knowledge updates, reducing data leakage, model poisoning, and evaluation bias;
Deployment Contribution: Setting curricula and SOPs according to industry needs, providing upgrade–steady-state evaluation protocols, enabling enterprises to quickly restore industry performance when upgrading base models while retaining old knowledge.

We believe that with a clear timeline, open exploration attitude, and strict risk management, this research can provide valuable practical solutions for large model adaptation in high-compliance industries, offering new research pivots for both academia and industry.

6 Experimental Plan and Evaluation Data

To ensure the framework scheme has sufficient empirical support, we plan a set of reproducible experiments and use publicly available datasets for evaluation.

6.1 Experimental Plan

Base and Method Implementation: At the base model level, we will select 8B and 70B pre-trained models as baselines, and implement parameter adaptation methods like LoRA, DoRA, O-LoRA; implement various memory/retrieval mechanisms like KBLaM, HippoRAG, MemoRAG, RAPTOR in the non-parameter track, forming multiple combinations.
Task Division and Comparison: Experiments will be divided into general capability and industry capability dimensions. In the general dimension, we focus on language understanding, reasoning, and retrieval efficiency; in the industry dimension, we emphasize factuality, terminology standardization, clause compliance, and complex numerical reasoning capabilities.
Multi-round Evaluation and Ablation: For each combination of parameter and non-parameter tracks, we will conduct ablation experiments to analyze the independent contribution of modules. Evaluation metrics include EM/F1, Citation@k, Evidence Consistency, logical consistency rate, terminology standardization rate, end-to-end latency, and VRAM usage.
Large-scale Stress Testing: During the integration phase, simulate concurrent requests under real enterprise loads to evaluate system throughput and stability at P50/P95/P99 latency.
Cross-generation Migration Experiments: After base model upgrade, measure Upgrade Time@Δ, knowledge retention rate, and general capability regression ε by migrating old KMUs and directional bases to validate the effectiveness of cross-generation alignment.

6.2 Evaluation Data and Public Sets

We select the following public datasets for experiments to ensure results are reproducible and easily comparable with other works:

HotpotQA: A multi-hop question answering dataset containing 113k Wikipedia-based QA pairs, requiring models to reason across multiple paragraphs and provide supporting facts^[10].
Natural Questions: An open-domain QA benchmark provided by Google, with questions from real users, requiring models to read entire Wikipedia entries to answer^[11].
QuALITY: A long-document reading comprehension dataset providing contexts averaging 5,000 tokens and multiple-choice questions, designed to examine model understanding of long documents

Subscribe to new posts

Get updates via RSS or Email. No spam.

RSS 订阅 RSS Subscribe Suscribirse por RSS Email Subscribe

Dual-Track Decoupled Large Model Industry Adaptation Framework