This article reviews notable AI research papers published in Week 3 of 2025 (25W03), covering multimodal learning, video understanding, reasoning, robotics, and model optimization.
Multimodal/Vision-Language: SCRIT (Self-Calibrated Referential Instruction Tuning) improves fine-grained visual grounding by dynamically calibrating instruction-following with contrastive learning, achieving SOTA on referential comprehension benchmarks. VideoRAG introduces retrieval-augmented generation for video understanding, enabling models to retrieve and reason over relevant video segments for long-form video QA. LlamaV-o1 adapts chain-of-thought reasoning to vision-language models through structured step-by-step visual reasoning, outperforming models 10x larger on multimodal benchmarks.
Robotics/Embodied AI: OmniManip proposes a unified manipulation framework enabling robots to handle diverse objects and tasks through 3D semantic understanding and contact-point prediction, demonstrating robust generalization across unseen object categories. PR (Predictive Reasoning) introduces forward-model-based planning for robot manipulation, predicting action outcomes before execution to reduce trial-and-error.
Model Optimization/Efficiency: Multiple papers address inference efficiency — including speculative decoding improvements, attention approximation for long contexts, and quantization techniques maintaining quality at 4-bit precision. Training efficiency advances include gradient checkpointing variants and optimizer improvements for large-scale pretraining. Benchmark papers introduce new evaluation frameworks for long-context reasoning, multilingual capabilities, and instruction-following robustness across diverse task types.
![[25W03] Latest AI Paper Tech Trends (SCRIT, VideoRAG, LlamaV-o1, OmniManip, PRM, TP)](https://metax-images-bucket.s3.ap-southeast-2.amazonaws.com/articles/25w03-ai-scrit-videorag-llamav-o1-omnimanip-prm-tpa-biomedica-transformer-2-mini-1065604170172756/img-1.webp)