The UTC Graduate School is pleased to announce that Amin Amiri will present Doctoral research titled, Personalized and Adaptive Therapeutic Music Generation from Biosignals Using Knowledge-Guided Multimodal Large Language Models on 02/09/2026 at 10 AM in MDRB Room 218. Everyone is invited to attend.
Computer Science
Chair: Dr. Yu Liang
Co-Chair: Dr. Dalei Wu
Abstract:
AI-enabled music therapy has the potential to expand access to individualized, non-pharmacologic interventions, but clinically viable therapeutic generation requires more than high-fidelity synthesis: systems must (i) personalize content to a patient’s physiological and contextual state, (ii) remain stable over long horizons, and (iii) satisfy safety- and protocol-driven constraints. This dissertation introduces Qmusic-MLLM, a knowledge-guided multimodal framework that operates in discrete token spaces to enable scalable, controllable, and extensible therapeutic generation across heterogeneous modalities. Methodologically, therapeutic personalization is driven by a large-language-model (LLM) controller that discloses individualized therapeutic modalities (such as relaxation, cognitive engagement, or affect regulation) from multimodal biometric context and translates those needs into auditable constraints and long-horizon plans. To support real-time, high-fidelity multimodal fusion, a universal tokenization interface, the Harmonizer ecosystem, converts continuous signals into transformer-compatible discrete tokens. Empirically, the base Harmonizer demonstrates state-of-the-art reconstruction quality versus modern neural codecs or tokenizers across speech and music benchmarks, indicating a strong foundation for reliable downstream control. Building on the token interface, EEG-Harmonizer compresses multichannel EEG into compact neural tokens and, together with an Electrode-Aware Importance classifier, enables robust cross-subject cognitive or affective state discrimination for neuro-conditioned generation. For future immersive extensions, Video-Harmonizer provides high-resolution video tokenization (4K/8K, 30/60 fps) and shows substantial improvements over recent video tokenizers, including Microsoft VidTok and NVIDIA Cosmos tokenizer baselines, while significantly reducing training cost under limited GPU resources. To incorporate therapeutic domain knowledge and reduce unsupported content, evidence-grounded reasoning is integrated using Chain-of-Thought planning and Retrieval-Augmented Chain-of-Thought (RAG-CoT). In RAG-CoT, curated external corpora are indexed and retrieved to supply non-parametric knowledge that conditions plan generation and document-conditioned responses (demonstrated with medical references, including Schwartz’s Principles of Surgery and Kaplan & Sadock’s Synopsis of Psychiatry). The overall architecture is designed to remain backend-agnostic across LLM families. Closed-loop adaptation is formulated as an EEG-grounded contextual bandit that optimizes a prompt-selection policy using immediate alignment- and safety-weighted rewards. Finally, Biomedical-Harmonizer extends the tokenization ecosystem to cardiac electrophysiology by converting multi-lead ECG into high-fidelity discrete token streams for multimodal patient-state modeling. Across these components, a subject-disjoint validation protocol is emphasized to assess generalization without subject-specific leakage. Together, the proposed framework establishes a scalable foundation for personalized, knowledge-grounded, and physiologically adaptive therapeutic music generation, with clear paths toward graph-based retrieval, richer multimodal conditioning, and prospective clinical evaluation.