Zhuodi Cai

3Description: An Intuitive Human-AI Collaborative 3D Modeling Approach

BibTeX

@inproceedings{10.1145/3632776.3632785,
author = {Cai, Zhuodi},
title = {3Description: An Intuitive Human-AI Collaborative 3D Modeling Approach},
year = {2024},
isbn = {9798400708725},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3632776.3632785},
doi = {10.1145/3632776.3632785},
abstract = {This paper presents 3Description, an experimental human-AI collaborative approach for intuitive 3D modeling. 3Description aims to address accessibility and usability challenges in traditional 3D modeling by enabling non-professional individuals to co-create 3D models using verbal and gesture descriptions. Through a combination of qualitative research, product analysis, and user testing, 3Description integrates AI technologies such as Natural Language Processing and Computer Vision, powered by OpenAI and MediaPipe. Recognizing the web has wide cross-platform capabilities, 3Description is web-based, allowing users to describe the desired model and subsequently adjust its components using verbal and gestural inputs. In the era of AI and emerging media, 3Description not only contributes to a more inclusive and user-friendly design process, empowering more people to participate in the construction of the future 3D world, but also strives to increase human engagement in co-creation with AI, thereby avoiding undue surrender to technology and preserving human creativity (Figure 1).},
booktitle = {Proceedings of the 11th International Conference on Digital and Interactive Arts},
articleno = {32},
numpages = {5},
keywords = {3D Object Description, GPT, Human-AI Collaboration, Intuitive 3D Modeling, Multimodal HCI},
location = {Faro, Portugal},
series = {ARTECH '23}
}

Category	Duration	Tool / Material
ML/AI Application; 3D; Real-Time Communications; Design; Development	January 2023 - May 2023 @ New York University	JavaScript; Arduino(C++); Python; Figma

This project is my graduate thesis in the Interactive Telecommunications Program (ITP) at New York University Tisch School of the Arts. A part of the whole experimental project has been composed into paper. Special thanks to Sharon De La Cruz as thesis advisor, the ITP community, and Yuan Li for technical discussions.

Thesis Presentation Slides

Thesis Presentation Speech Content

Page	Speech Content
1	Hi everyone! My name is Zoe and my thesis is called 3Description.
2	Background
3	Nowadays the 3D is playing a more and more important role in our daily life. And I see a lot of you are interested in VR! Cool! So shall we create 3D assets that would affect our tomorrow?
4	Unfortunately, not everyone is ready to jump on board! I am not. Because of wrist tendonitis, it's really painful for me to build 3D models in a traditional way, like using a mouse. I'm not alone. There are 800 million people are suffering from hand and wrist pain. So, what about us? My friends are not ready. Without art or design background, it's even hard for them to draw a decent 2D sketch, let alone the 3D modeling. And for the mainstream 3D modeling software, there are only around 10 million times download. So, what about the others? Should everyone learn 3D modeling all of sudden? Or, should we keep on buying models created by others? Or, should we rely entirely on the AIGC and give up our own creativity?
5	No, thank you!
6	Statement
7	So the statement is... From imagination to 3D, is there any accessible and painless way?
8	Research
9	To this end, I conducted qualitative research with behavior observation. The subjects are from non-art or design background. The task is to imagine and describe 3D appearance to me. To scale down the scope, let's say it's a flower.
10	Experiments The following experiments are based on the research.
11	If we feed the verbal description directly to AI models, we will get results like these. They are abstract and unadjustable.
12	During the interview, I found there were several things the subjects would love to describe. Shape, position, and color. For shape descriptions, there are two main ways, verbal and gestural. For verbal description, analogy is the most frequently used approach, it's very vague. Therefore, in the 1st round experiment, I failed, because I used fixed descriptions to change geometries and soon found it's not suitable for the analogy. Then I turned to do speech recognition with Whisper API, send the result to GPT3.5 API, get the response from GPT3.5, write regex to extract the code part that we need, replace the relevant code in our JavaScript file, and re-render the WebGL canvas. Another way is using the gesture, here I use the hand pose recognition to change shape. Because I found the subjects would use both of their hands to tell the shape like this.
13	The position is very like the shape, there are also two ways of description. For the gesture part, I firstly tried the IoT accelerometer to change the openness. Then I replaced this solution, because I found it took some time for the subjects to understand how to use the device. So I turned to use the hand pose recognition as well. Now we can change distance, for example, between two vertices. Or we can change angles, for example, between two model parts.
14	The color is similar to the shape and position, so let's watch the recording directly!
15	More So, what could be more?
16	More models! Here are some point clouds I got from Point-E. But if we want to apply the same logic as we mentioned before, we might want to convert the point cloud to mesh. If we convert directly, we would get one entity and not able to change, for example, the position relationship between two model parts.
17	So before converting, we need to separate each part by recognizing the color channels like this.
18	Here's the result. The grey one is before the segmentation and the other two are after.
19	What's more? More collaborations! If we use this in a livestream experience, like a 3D collaboration meeting, we can discuss based on a concrete result with real-time visual feedback. This would help us improve the meeting effeciency, because now our ideas are not abstract, they are tangible.
20	What's more? More ways of communications! I got this behavioral data in the user test with accelerometer. I found when the subjects sent me the signals by saying "that's it", "here", or something else, they meant to confirm the current angle. This could be a possible way for us to communicate with the interface. Maybe in future, just with a frown, the solution would be change automatically.
21	Conclusion We see there are a lot things ahead that we can do. And I would really love to discuss more about them. But now let's draw a conclusion. So the whole point is, we describe with words and gestures, and "BOOM" we got this! Action: Here, like a magic show, I suddenly took out a red physical 3D printed flower model made through human-machine collaboration, like the one shown in the previous recording video. What I didn’t mention is that the prompt for the petal is “my heart” and for the sepal is “my soul”!
22	Thank you all who inspired me, directed me, and supported me! This flower is for you! Action: Here I sent the flowers to the audiences!