The Dawn of the ''Era of Composing Through Language''

According to multiple US technology media reports, OpenAI is developing a new AI tool receiving both text and audio prompts to generate music. Unlike existing AI music services (Suno, Udio, Mubert) that primarily use keyword tag matching to statistically predict musical compositions, OpenAI's approach combines language models with audio models in an integrated multimodal architecture — interpreting linguistic context and emotion to design music's structural flow (rhythm, harmony, texture) while simultaneously recognizing text, sound, and context. The technical significance: when GPT-4o understands voice and recognizes emotion, and Sora generates video narrative context, adding a music generation model creates a system that integrally interprets text, video, and audio to design the emotional tone of entire stories. This represents movement toward a new form of creative AI — reconstructing language, emotion, and narrative within a single structure. Industry implications: (1) For game/film scoring — AI music generation could enable real-time adaptive soundtracks that respond to narrative events and player/viewer emotional states; (2) For music industry — the existing AI music copyright unresolved issues (unresolved litigation involving Universal Music, ASCAP) become more pressing as generation quality improves; (3) For creative roles — the distinction between "composing music" and "directing AI to generate music" will require new frameworks for creative attribution; (4) For accessibility — democratizing music production could enable creators who lack technical music skills to produce emotionally resonant audio content, potentially transforming the economics of content creation across platforms.