| Multimodal continual learning with Mixture-of-Experts (MoE) typically relies on task identifiers to route inputs, limiting deployment when task boundaries are unknown. We propose ComplementarityMoE, which replaces task-ID routing with a Barlow Twins loss adapted for complementarity: instead of enforcing view-invariance (τ =1.0), we target partial correlation (τ <1.0) between router projections, driving experts toward orthogonal feature subspaces without requiring task labels. The architecture freezes a PerceiverIO encoder after the first task and trains only LoRA-based experts and the router on subsequent tasks, enabling positive backward transfer. We scale this approach from 2 to 10 sequential tasks on CMU-MOSEI and uncover a key finding: the optimal complementarity level reverses with task count. At 2 tasks, τ =0.5 is optimal (51.93% accuracy); at 10 tasks, τ =0.2 dominates (46.31% vs 37.82% for τ =0.5), outperforming D-MoLE, CL-MoE, ProgLoRA, EWC, and naive fine-tuning with statistical significance (p<0.05, 10 seeds). Routing analysis reveals that the router converges to near-perfect expert specialization (>99% primary weight) at 10 tasks, contrasting with distributed routing at 2 tasks. These results establish a practical design principle: as the task-to-expert ratio approaches 1, complementarity constraints should favor specialization over collaboration. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.