DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling Paper • 2512.03000 • Published 9 days ago • 34
4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer Paper • 2512.05060 • Published 7 days ago • 18
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos Paper • 2512.01707 • Published 10 days ago • 7
Accelerating Streaming Video Large Language Models via Hierarchical Token Compression Paper • 2512.00891 • Published 11 days ago • 14
CaptionQA: Is Your Caption as Useful as the Image Itself? Paper • 2511.21025 • Published 16 days ago • 25
VisPlay: Self-Evolving Vision-Language Models from Images Paper • 2511.15661 • Published 22 days ago • 42
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning Paper • 2512.02425 • Published 9 days ago • 22
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling Paper • 2511.20785 • Published 16 days ago • 150
view article Article Hugging Face to sell open-source robots thanks to Pollen Robotics acquisition 🤖 +1 Apr 14 • 48
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions Paper • 2409.18042 • Published Sep 26, 2024 • 40