3D Multimedia

Analytics, Search and Generation

In Conjunction with ICME 2023

July, Brisbane, Australia

News !

  • Jan 30, 2023:   The website is coming. Call for papers.

  • May 1, 2023:   Six papers are accepted. Congratulations to the authors.

  • June 5, 2023:   We are honored to invite Prof. Weizhi Nie to give a keynote.

  • June 12, 2023:   We are honored to invite Dr. Ting Yao to give a keynote.

  • June 20, 2023:   We are honored to invite Prof. Zhaopeng Cui to give a keynote.


   Today, ubiquitous multimedia sensors and large-scale computing infrastructures are producing at a rapid velocity of 3D multi-modality data, such as 3D point cloud acquired with LIDAR sensors, RGB-D videos recorded by Kinect cameras, meshes of varying topology, and volumetric data. 3D multimedia combines different content forms such as text, audio, images, and video with 3D information, which can perceive the world better since the real world is 3-dimensional instead of 2-dimensional. For example, the robots can manipulate objects successfully by recognizing the object via RGB frames and perceiving the object size via point cloud. Researchers have strived to push the limits of 3D multimedia search and generation in various applications, such as autonomous driving, robotic visual navigation, smart industrial manufacturing, logistics distribution, and logistics picking. The 3D multimedia (e.g., the videos and point cloud) can also help the agents to grasp, move and place the packages automatically in logistics picking systems. Therefore, 3D multimedia analytics is one of the fundamental problems in multimedia understanding. Different from 3D vision, 3D multimedia analytics mainly concentrate on fusing the 3D content with other media. It is a very challenging problem that involves multiple tasks such as human 3D mesh recovery and analysis, 3D shapes and scenes generation from real-world data, 3D virtual talking head, 3D multimedia classification and retrieval, 3D semantic segmentation, 3D object detection and tracking, 3D multimedia scene understanding, and so on. Therefore, the purpose of this workshop is to: 1) bring together the state-of-the-art research on 3D multimedia analysis; 2) call for a coordinated effort to understand the opportunities and challenges emerging in 3D multimedia analysis; 3) identify key tasks and evaluate the state-of-the-art methods; 4) showcase innovative methodologies and ideas; 5) introduce interesting real-world 3D multimedia analysis systems or applications; and 6) propose new real-world or simulated datasets and discuss future directions. We solicit original contributions in all fields of 3D multimedia analysis that explore the multi-modality data to generate the strong 3D data representation. We believe this workshop will offer a timely collection of research updates to benefit researchers and practitioners in the broad multimedia communities.

Call for papers

   We invite submissions for ICME 2023 Workshop, 3D Multimedia Analytics, Search and Generation (3DMM2023), which brings researchers together to discuss robust, interpretable, and responsible technologies for 3D multimedia analysis. We solicit original research and survey papers that must be no longer than 6 pages (including all text, figures, and references). Each submitted paper will be peer-reviewed by at least three reviewers. All accepted papers will be presented as either oral or poster presentations, with the best paper award. Papers that violate anonymity, do not use the ICME submission template will be rejected without review. By submitting a manuscript to this workshop, the authors acknowledge that no paper substantially similar in content has been submitted to another workshop or conference during the review period. Authors should prepare their manuscript according to the Guide for Authors of ICME. The paper submission link is at here. For detailed instructions, see here.
  The scope of this workshop includes, but is not limited to, the following topics:

  • Generative Models for 3D Multimedia and 3D Multimedia Synthesis
  • Generating 3D Multimedia from Real-world Data
  • 3D Multimodal Analysis and Description
  • Multimedia Virtual/Augmented Reality
  • 3D Multimedia Systems
  • 3D Multimedia Search and Recommendation
  • Mobile 3D Multimedia
  • 3D Shape Estimation and Reconstruction
  • 3D Scene and Object Understanding
  • High-level Representation of 3D Multimedia Data
  • 3D Multimedia Application in Industry

  Fast Review for Rejected Regular Submissions of ICME 2023
  We set up a Fast Review mechanism for the regular submissions rejected by the ICME main conference. We strongly encourage the rejected papers to be submitted to this workshop. In order to submit through Fast Review, authors must write a front letter (1 page) to clarify the revision of the paper and attach all previous reviews. All the papers submitted through Fast Review will be directly reviewed by meta-reviewers to make the decisions.

Important Dates

Description Date
Paper Submission Deadline April 1,2023
Notification of Acceptance April 23, 2023
Camera-Ready Due Date May 1, 2023
Workshop Date 14 July, 2023

Workshop Agenda

Date Description
8:30-8:40 Opening
8:40-9:20 Keynote 1: 3D Generation via Memory Knowledge
9:20-10:00 Keynote 2: Cross-Modal Vision-and-Language Intelligence: Methodologies and Applications
10:00-10:30 Tea Break
10:30-11:00 Keynote 3: 3D Perception and Understanding with Compositional Neural Radiance Fields
11:00-12:00 6 Oral Presentation (~10min * 6)
12:00-12:10 Announce the Best Paper Award, Discussion and Closing

Invited speakers

 Weizhi Nie
Tianjin University, China
Title: 3D Generation via Memory Knowledge
Abstract: In recent times, the rapid growth of AIGC techonology has led to the release of numerous applications on various websites. These applications primarily involve the generation of textual and visual content. Additionally, the field of 3D generation poses significant challenges, as there are currently no well-established tools available. Compared to text and image content generation models, 3D models contain richer structural and depth information, presenting greater challenges in terms of accuracy and practical application of the generated content. In this report, I will explore the significant contributions of prior knowledge and memory networks in model generation. Furthermore, I will discuss the improvements achieved through this approach in three key aspects: 3D reconstruction, model generation, and point cloud completion. Finally, I would like to engage in a discussion with you regarding the utilization of large-scale models in 3D generation tasks.
Biography: Weizhi Nie received the Ph.D degree from Tianjin University, Tianjin, China, in 2015. From 2012 to 2013, he visited the National University of Singapore as a joint Ph.D student. He is currently an associate professor with the school of electrical and information engineering, Tianjin University. His current research interests include 3D model retrieval, 3D generation, multimedia information processing, and analysis. He is currently the Associate Editor of Multimedia Tools and Applications. He regularly serves as a PC member and an invited reviewer for top-tier conferences and prestigious journals in multimedia and artificial intelligence, like ACM Multimedia, IJCAI, AAAI, CVPR, and ICCV.

 Ting Yao
HiDream.ai, China
Title: Cross-Modal Vision-and-Language Intelligence: Methodologies and Applications
Abstract: Vision and language are two fundamental systems of human representation. Integrating the two in one intelligent system has long been an ambition in multimedia and vision fields, supporting the uniquely cross-modal vision-and-language intelligence. In between, vision to language is capable of describing what the intelligent system see, and language to vision is able to create visual content according to the language inputs. In this talk, we first look into the problem of vision to language, according to the development context of Independency (enhance visual encoder), interaction (boost encoder-decoder with interaction), and symbiosis (learn a universal encoder-decoder as foundation model) between different modalities. Moreover, we present how to efficiently utilize cross-modal foundation model to strengthen language to vision tasks. Finally, we will discuss the practical applications of vision-and-language intelligence in real-world scenarios.
Biography: Ting Yao is currently the Co-Founder and CTO of HiDream.ai, a high-tech startup company focusing on generative intelligence for creativity. Previously, he was a Principal Researcher with JD AI Research in Beijing, China and a Researcher with Microsoft Research Asia in Beijing, China. Dr. Yao has co-authored more than 100 peer-reviewed papers in top-notch conferences/journals, with 12,000+ citations. He has developed one standard 3D Convolutional Neural Network, i.e., Pseudo-3D Residual Net, for video understanding, and his video-to-text dataset of MSR-VTT has been used by 400+ institutes worldwide. He serves as an associate editor of IEEE Transactions on Multimedia, Pattern Recognition Letters, and Multimedia Systems. His works have led to many awards, including 2015 ACM-SIGMM Outstanding Ph.D. Thesis Award, 2019 ACM-SIGMM Rising Star Award, 2019 IEEE-TCMC Rising Star Award, 2022 IEEE ICME Multimedia Star Innovator Award, and the winning of 10+ championship in worldwide competitions.

 Zhaopeng Cui
Zhejiang University, China
Title: 3D Perception and Understanding with Compositional Neural Radiance Fields
Abstract: Neural-based 3D modeling and rendering methods, represented by Neural Radiance Fields (NeRF), have recently received extensive attention in fields such as computer vision and graphics. In comparison to traditional explicit point cloud or mesh-based representations, neural representations have shown increased success in detail expression, model compactness, and realistic rendering. However, these implicit representations cannot be directly applied to other 3D tasks. In this talk, we will first introduce how to learn compositional neural radiance fields that can enable tasks such as 3D editing and photo extrapolation. Next, we will present a new paradigm of amodal 3D scene understanding using compositional neural radiance fields that enables reliable 3D understanding from a panoramic image of a closed environment. Finally, we will discuss the application of neural radiance fields for online 3D perception.
Biography: Zhaopeng Cui received his Ph.D. degree from Simon Fraser University in 2017 and worked as a senior researcher at ETH Zurich from 2017 to 2020. He is currently a research professor in the College of Computer Science, Zhejiang University. His research interests include 3D mapping and localization, 3D scene understanding, visual navigation, image and video editing. He is currently the associate editor of The Visual Computer. He has served as the associate editor for IROS and senior PC member for AAAI and IJCAI.


Shan An
JD Health, China
An-An Liu
Tianjin University, China
Kun Liu
JD Logistics, China
Na Zhao
Singapore University of Technology and Design, Singapore
Guoxin Wang
Zhejiang University, China
Wu Liu
Explore Academy of JD.com, China
Antonios Gasteratos
Democritus University of Thrace, Greece

Accepted Papers

Paper ID Paper Title
26 PointHGN: Point Heterogeneous Graph Neural Network for Point Cloud Learning
31 Expressive Speech-driven Facial Animation with Controllable Emotions
32 A Simple Masked Autoencoder Paradigm for Point Cloud
100 An Algorithm of Three-dimensional Shape Dissection with Mesh Reconstruction
109 Video Background Music Recommendation Based on Multi-level Fusion Features
117 PCaSM: Text-Guided composed Image Retrieval with Parallel Content and Style Modules

Previous Workshop on 3DMM: 3DMM-ICME2022

If you have any questions, feel free to contact < anshan [DOT] tju [AT] gmail [DOT] com >