继Sora官宣之后,多模态大模型在视频生成方面的运行简直就像井喷一样涌现进去,LUMA、Gen-3 Alpha等视频生成模型展现了极佳品质的艺术格调和视频场景的细节雕琢才干,文生视频、图生视频的新前沿不时被扩展令大家惊喜不已,抱有等候。
最近,来自中国迷信技术大学、北京大学、上海 AI Lab等团队的钻研人员发布了引人注目标 ShareGPT4Video系列,旨在优化视频了解和生成才干。
在过去半年中,图像-言语多模态畛域在ShareGPT4V的高品质图像-字幕数据集的推出后逐渐看法到详细、准确的图像-字幕数据关于对齐图像与言语模态的关键性。ShareGPT4V数据集推出至今已在HuggingFace平台的VQA>
建设在高品质的ShareGPT4V数据集上,图像了解和图像生成社区也都取得一些打破性的停顿,例如InternVL-Chat-V1.5与PixArt-Σ等上班。
受ShareGPT4V数据集在图文多模态畛域的成功所鼓舞,原作者团队把眼光再次投向视频多模态畛域。视频多模态畛域中闭源商业模型不时处于断层上游的位置,一方面,OpenAI和谷歌近期接连的两场发布会,把AI视频推理卷到了新高度。另一方面,OpenAI的Sora文生视频模型则把文生视频带到了一个全新的高度。
钻研者们以为闭源模型关于视频了解和视频生成畛域的渺小上游雷同离不开详细高品质的视频-字幕数据。因此,该钻研团队再次努力于为视频失掉少量详细而准确的字幕,优化大型视频言语模型的视频了解才干和文生视频模型的视频生成才干。
目前,该钻研在HuggingFace的6月7日Daily Papers中位居榜首,并且在代码发布后迅速取得500+ Star,失掉了国际外的分歧关注。
钻研者们以为用现有的闭源模型生成高品质视频形容的应战有三个方面:
为此,钻研者们精心设计了一种差分滑窗视频形容(Differential Sliding-Window Captioning, DiffSW)战略,该战略可以稳固且高效地为恣意分辨率,宽高比和长度的视频生成高品质形容。
图 1:差分滑动窗口视频形容生成
详细而言,钻研者们每次送入GPT4V的输入是关键帧,上一关键帧以及上一关键帧对应的差分形容,旨在让GPT4V依据观察两帧之间的期间与空间变动总结出帧相关于上一帧的关键空间、时序变动,即帧与上一帧对应的差分形容。最终,一切差分形容会连同期间戳一同送入GPT4中从而总结出最终的关于整个视频的高品质字幕。
该钻研团队展现了几个示例:
The video segment documented a significant event in Kochi, Kerala, where 2 buildings razed in Kochi. The broadcast began with a split-screen presentation: on one side, thick clouds of dust were seen billowing into the sky, marking the onset of the demolition process, while on the other side, reporter Gopikrishnan provided live coverage, indicated by "BREAKING NEWS" captions and a consistent timestamp of "11:10 AM." The news ticker at the bottom of the screen simultaneously ran other global events, maintaining a flow of information. As the video progresses, the split-screen footage of the razed house turns into a close-up. A notable change in the headline to "KOCHI FLATS RAZED" signaled the demolition's culmination. A brief interlude offered a visual contradiction by showcasing the flats presumably before their demolition, providing a stark before and after comparison. As the video progressed, the left building's collapse initiated a dramatic alteration in the skyline, marked by significant dust plumes. Subsequently, another building was shown partially collapsing amid debris, fully obscured by dust in seconds, with surrounding greenery remaining untouched. This transitioned into a graphic interlude featuring the "India Today" logo, briefly pausing the live footage. Resuming to the aftermath, split imagery displayed the rubble and ongoing smoke. Then, the imagery continued to juxtapose the scenes of destruction against intact high-rise buildings nearby. The narrative was augmented by the revelation that the Supreme Court directed the demolition within a broader national news context. Throughout, the report maintained a real-time approach, threading continuity and urgency across the unfolding event's documentation.
The video begins with an individual seated on a gray couch in a cozy domestic setting, about to unbox a product from a red CCM-branded box placed on a white table in front of them. Initially, the person is seen drinking from a blue can, indicating a casual atmosphere. Soon after, the individual shifts attention from the can to the red box, signifying the start of the unboxing process. The red box, initially closed, gradually becomes the focal point as the person prepares to open it, conveying a build-up of anticipation. As the video progresses, the box is flipped over and then opened, revealing its content still hidden under white tissue paper adorned with prints, adding to the suspense. The individual’s engagement with the box evolves, from initially preparing to open it, to actively delving into its contents. A momentary pause in activity is captured before the anticipation culminates with the individual lifting an object from the box. This object, identifiable by a yellow label, is then examined closely by the person, indicating a thorough inspection or perusal of the product or its packaging. Throughout the video, the surrounding environment remains consistent and undisturbed, with household items like a potted plant and a wall clock maintaining the setting's homely ambiance. The camera’s perspective remains fixed, focusing on the unfolding unboxing event without any movement, thus allowing the viewer to observe the narrative closely. Another partially open brown box is visible beside the main red box, though its role or contents are not elaborated upon. The video encapsulates the anticipation, action, and reveal inherent to unboxing experiences in a home setting.
经过这一方法,钻研者们推出了大型“视频-文本形容”数据集--ShareGPT4Video数据集,其中包括4万条(共291小时)由GPT-4V标注的视频数据。这些数据涵盖了宽泛的类别,生成的形容蕴含丰盛的环球常识,对象属性,摄像机静止,以及详细和准确的事情期间形容。
图 2 :(a)数据集涵盖宽泛的内容,包括家养生物、烹饪、体育、景色、第一人称人类优惠、智能驾驶场景等。(c) 字幕的字数关键在 200 到 之间,提供了丰盛的期间消息,可以很好地成功视频了解和生成义务。
在ShareGPT4Video数据集的基础上,为了进一步扩展数据集规模以及便于开源社区在自有数据上的经常使用,钻研者们进一步设计开发了ShareCaptioner-Video,一个能够有效地为恣意视频生成高品质形容的多配置多模态大模型。
图 3:ShareCaptioner-Video 是一款四合一的不凡视频形容模型,具有以下配置:滑动窗口生成视频形容、极速生成视频形容、视频片段对应形容整合,揭示词生成详细形容
详细而言,滑窗视频形容配置可以负责GPT4V搜集标注数据中的所有角色,并且经过滑窗的模式来发生差分形容并汇总出最终的字幕。极速视频形容配置则是把所无关键帧沿竖直方向拼成一张长图一次性性发生最终的字幕,在稍微就义性能的状况下大幅优化标注速度。视频片段总结配置则可以在对完整视频启动一次性滑窗形容后,对其中恣意的视频片段间接总结出字幕而不须要再次启动滑窗形容环节。
在失掉了优秀的视频形容模型后,钻研者们用它进一步标注了480万条,总时长3000小时的丰盛的视频数据。这些视频具有较高的美学评分以及较少的转场成果,旨在为视频生成义务服务。
表1:由 ShareCaptioner-Video 标注的480万条视频数据的导致
试验
在视频了解方面,钻研者们首先经过便捷的等量交流试验,验证了ShareGPT4Video数据集在几种LVLM架构上的有效性。钻研者们把VideoChatGPT数据集中100K视频训练数据中的与详细caption关系的28K数据等量交流成ShareGPT4Video数据集中的子集。从下表可以看到,经过便捷的数据交流,仅仅是字幕数据品质上的优化便可以分歧地为不同架构、不同规模的视频了解多模态大模型带来清楚的性能增益。
表 2:ShareGPT4Video数据集在各模型架构上均能发生性能增益
之后,钻研者们自主搜集了153K的视频VQA数据,并联合ShareGPT4Video数据集中与视频了解关系的28K高品质字幕数据,提出了新的LVLM ShareGPT4Video-8B。仅需8卡以及5个小时的训练开支,即可在多项Benchmark上取得优秀的结果。
表 3 :TempCompass上性能对比
表 4 :VideoBench上性能对比
表 5:MVBench上性能对比
即使是在最近新发生的几个视频了解基准上,ShareGPT4Video-8B也可以在7B参数规模上分歧地展现出具有竞争力的性能。
表 6 :LongVideoBench上性能对比
表 7 :Video-MME基准性能对比
表 8:MMBench-Video基准性能对比
在视频生成方面,钻研者们基于Open-Sora-Plan名目便捷间接地验证了详细的字幕数据关于文生视频模型的协助。下图中,第一行的结果是经常使用了短字幕数据训练出的文生视频模型失掉的,第二行的结果是经常使用了ShareCaptioner-Video标注的高品质字幕数据训练出的文生视频模型失掉的。可以看到,经常使用详细的字幕数据可以让文生视频模型具有优秀的镜头移动控制以及语义内容控制才干。
原文链接: