0:00
/
0:00
Transcript

Roblox’s Cube Model: Creating Interactive 4D Worlds | VP of AI Explains

Interview with Anupam Singh, VP of AI at Roblox

How does Roblox use AI to power a 3D platform for hundreds of millions of users? VP of AI, Anupam Singh dives into their new Cube AI model, integrations with LLMs for complex 3D/4D world-building, and the future of vibe coding and in-experience creation on Roblox.

Watch this episode on YouTube, X/Twitter, or Spotify.

Topics Covered:

  • Roblox is the "YouTube of 3D"

  • The Roblox Vision: 3D Creation & Consumption at Scale

  • The AI Backbone: Safety, Moderation & Infrastructure at Roblox

  • Cube AI: Towards a Foundational Model for 3D

  • Using Cube AI with 3P Large Language Models (LLMs)

  • The Journey to 4D: Crafting Truly Interactive Worlds

  • The Rise of "Vibe Coding" at Roblox Scale

  • Empowering Players: In-Experience 3D Creation

  • AI's Impact on the Creator Economy

  • Powering Discovery: Roblox's AI Recommendation Engine

  • The Evolving Landscape: The Future of UGC and AAA 3D Content

  • Anupam's Advice for Aspiring Creators & Developers

Links to Roblox Releases:

  • Roblox Cube AI: https://corp.roblox.com/newsroom/2025/03/introducing-roblox-cube Cube AI

  • Cube GitHub Repo: https://github.com/Roblox/cube

  • Voice Classifier: https://github.com/Roblox/voice-safety-classifier

Get in touch:

  • Join My Newsletter: https://spatialintelligence.ai

  • Connect with me on X/Twitter here: https://x.com/bilawalsidhu

  • Everywhere else here: https://bilawal.ai

  • Business inquiries: team@metaversity.us

Interview Transcript:

Bilawal Sidhu: Ever wonder how the metaverse gets built? Not just the idea, but the worlds and experiences millions dive into daily.

I'm not talking hypotheticals. Roblox is a colossal platform. In late 2024 alone, 85 million people jumped into Roblox daily, spending an average of two and a half hours in user-generated worlds. Roblox isn't just a game, I call it the YouTube of 3D experiences. And the opportunity is massive.

Last year, creators earned nearly a billion dollars on Roblox. But here's the catch: making 3D content is still really hard. Imagine what'll happen when you slash those barriers to entry, just like we've seen in video creation. The phone in your pocket is practically a visual effects studio.

Today I got something special for y'all. We're sitting down with Anupam Singh, VP of Engineering at Roblox, the man leading the charge on AI and ML at this amazing company that is building the literal instantiation of the metaverse.

Roblox recently announced their Q model, as well as their plans for building a 3D foundational model for this specific purpose of creating interactive 3D worlds. So stick around to hear why they went down this autoregressive approach to tokenize 3D, allowing them not just to predict words, but predict shapes. And how they're building towards this future where one AI model can understand geometry, textures, full body rigging, interactivity, enabling true 4D creation.

We'll also dive into how they're using their own 3D models with the reasoning power of any large language model to build richer, more complex worlds faster than ever before, allowing you to literally speak 3D worlds into existence. And lastly, scale. This isn't just a toy project. This is Roblox building something for their hundreds of millions of users. So if you want to understand the future of 3D and 4D creation, you're not going to want to miss this conversation. Let's get into it.

Anupam Singh: Team is very excited that I'm talking to you. They almost wanted to re-brief me on our technical work, thinking that you're talking to Bilawal, you need to be briefed. I'm like, I've been there since day one when we started this effort.

Bilawal Sidhu: (Laughs)

Anupam Singh: My name is Anupam Singh. I'm VP Engineering here at Roblox with responsibility for infrastructure, the AI platform, discovery, ads engineering, and many other things. Been here at Roblox for three and a half years. Uh, two-time entrepreneur before that. Uh, to summarize my career, uh, it's been about reading some great paper, uh, uh, which is super geeky at at its time, like MapReduce or Transformer, and then spending 10 years trying to make it production-worthy and getting it to billions of dollars in revenue and billions of users.

Bilawal Sidhu: Yeah, don't tell the researchers that. They think it's the zero-to-one innovation, but how do you get that thing out to market at scale?

Anupam Singh: We have those on our team. Uh, you know, we, we have, uh, the person who wrote the ControlNet paper as an advisor, uh, on, on the Cube team, and I always joke with him that it, it'll take 10 years for us to even understand all the implications of ControlNet, for example. Um, but for, for the researchers, it's always very obvious. The future is very clear and obvious, and then it falls to engineers like us to make it actually happen.

Bilawal Sidhu: So speaking of that, Roblox is such an interesting application and ecosystem. I've been calling Roblox the YouTube for 3D experiences because it has that, like, closed loop between creation and consumption. But unlike video, 3D has historically been super challenging and very high barrier to entry. But that's changing. Tell me about what Roblox is doing to shatter these barriers.

Anupam Singh: I think it goes back to almost our founding principle. Uh, uh, we have this principle called Long View, and, uh, since its founding, uh, Dave, our, Dave Baszucki, our founder, has always tried to make it easier and easier. Let's say Bilawal wants to create a 3D game today. Of course, the core coding is hard. You, the core imagination loop is yours, but then you don't know how to get traffic, you know? Um, but if you publish it on the Roblox game, the discovery system will start seeding it with some people, some, some players, and see if they're getting engagement, and then the flywheel starts happening. So the proudest thing for us is when somebody creates a game and within 30 days, they found their audience. So distribution and infrastructure, uh, are the big things. Now, the third leg of the stool is, of course, AI, that's what we're going to talk about.

Bilawal Sidhu: Yeah, I mean, I love that, right? It's, uh, it's, you know, creating a 3D experience is one thing, distributing it at scale and having a huge audience of folks that can experience it from a plurality of devices, I think is equally key. Um, you know, a lot of people talked about the metaverse and kind of equate it with AR VR headsets, but I've always been a fan of the definition of like, the metaverse needs to be AR VR optional. So why not include that low-end Android device as much as like a kitted-out PC that somebody may have or a headset in the future?

Anupam Singh: And the technology to enable that, right? If you have a 2 Giga phone or you have a network connection that is not strong, and you still want to play one of our games, downsampling it, upsampling it, all of that is infrastructure that we want it to be invisible both to our players and creators.

Bilawal Sidhu: Cool.

Anupam Singh: Yeah, I've been on calls with some of our top creators, and they sometimes are curious on what happens after they hit publish. And I want to tell them, that's where our challenge starts because some of the creators are able to get two or three million people into their events. And imagine two or three million people are pressing play at the same time. And you have to distribute this new update to 40,000 servers worldwide across data centers, match you with your friends, and get you inside the game because your patience will last not more than three seconds after you press the play button. So much to your point earlier, Bilawal, it is much more complicated than video because video is one way, whereas if you and I are playing Roblox, I have to make sure that we are synchronized and we are having a great experience irrespective of whether I am on a PC and you are on an Android device.

Bilawal Sidhu: Absolutely. But let's be honest, the metaverse would be a rather empty place without interesting content. So what is Roblox doing to make it easier to populate these virtual worlds with amazing 3D content?

Anupam Singh: The first one is invisible infrastructure so that people don't have to worry about where, where do the bits go. Um, second one is matching you to your audience. So it starts with matchmaking, which is the ability after you press play to put you into the right instance. But a lot of our machine learning and AI work is related to, um, uh, discovery and recommendations, whether you are in the marketplace to buy the latest avatar or whether you are on a homepage trying to figure out what next game you want to play. But one of the core values for us, and that's why I'm so proud about working at Roblox, is safety. Most of the people when they think about ML and AI, they think about recommendations, they think about monetization. But our heavy investment is in safety.

Bilawal Sidhu: Is that moderation?

Anupam Singh: Yeah, it could be, let's take the basic stuff. You and I are chatting on the platform. Uh, every one of the words that you type in goes through a text filter.

Bilawal Sidhu: Wow.

Anupam Singh: And that's, you know, uh, the last public information we've published is maybe 4 billion calls a day, more than 30,000 requests per second. And we might be the one of the few platforms on the planet where if our moderation, if our text filter goes down, we actually take our chat down. We don't do unfiltered chat just because it's too expensive or it's too hard to build. So a lot of our investment has been in safety, and then we open source it. So we've open sourced our voice safety model, uh, where literally while you and I are talking, the best demo that that our founder does with our head of safety engineering is they get on a pretend one-on-one in the town hall and he uses, um, an inappropriate word, and it gives him a warning saying, "Hey, you've just used an inappropriate word and we are, we are giving your first nudge," if you will, right? But doing it in real time where we take voice, feed it to a machine learning model, get an answer in real time has been very challenging, and then we decided to open source it because we think the internet should be a safer place. And so why not open source it? We're starting to see more than 10,000 downloads of the open source voice safety model.

Bilawal Sidhu: I love that. And it's, it's kind of wild to even think about this, you know, notion of moderating voice chat at scale, right? Like in real time. That's actually a staggering engineering and infrastructure problem. It's amazing that y'all are able to pull that off. Once we have these massive spaces, as they're, they're only going to grow. Like the notion of these third spaces that kids spend a lot of their time in are only going to grow. There is no possible way for human moderators to make all of that stuff happen. This is sort of happening under the hood, but it's still nudging you as needed.

Anupam Singh: Yes, yes, yes. And, and, and what has happened over the last 24 months which got us excited is we went from safety, personalization, economy, user experience to changing creation itself. And I think that we are still in day zero about how do you create 3D objects? How do you create 3D worlds? How do you make them 4D, uh, which is make them functional? And so that, that, I think without a transformer architecture would have been extremely difficult to pull off.

Bilawal Sidhu: So let's dig into that. You know, uh, obviously, y'all published a paper called Q, and you're open-sourcing some models associated with that too. Basically, you can type in a text prompt and get out like a 3D object. What was fascinating to me is that y'all, as you mentioned, the transformer architecture, decided to go with this like autoregressive transformer approach versus everyone in the industry, the kind of text and image to 3D models we've been seeing are more of this like diffusion model with some sort of like neural radiance field or Gaussian splat optimization step. Why go about trying to tokenize 3D?

Anupam Singh: Firstly, amazing question. I know you are, you are as much as an expert or even more of an expert in this field than I am. Firstly, full disclosure that we had a lot of debate about whether we should just use diffusion transformer. You know, that, that paper, the Vision Transformer paper, which led to, of course, um, I, you know, ideas and products like Sora, that seemed so, so seminal that we were all fans of it. We're still fans of it. But video is about predicting the next pixel. We wanted to build something that enables four-dimensional interaction. So we don't just want to build the car, but we want to be able to open the door of the car, get inside it, right? Now, that seems a little bit for, at least for our research team and our, um, engineers, that will sound like marketing speak. They're going to laugh at me for saying that. But in reality, it is about tokenizing 3D, right? You take a, take a 3D object and you tokenize it such that it can be cross-attentioned with, and I know that's not a real English word, but you can cross-attention it with, uh, tokens from other modalities. So that's what we set out to do. Um, and honestly, initially, it still seemed like maybe the vision transformers are, are sort of a better idea because we could see people generating video after video and it seemed that vision transformers were making much more progress than tokenizing 3D. With video, you might feel that you can play the game, but you're not really inside the game. You're not inside the car. It is giving you a feeling that you're inside the car.

Bilawal Sidhu: And so just to double-click on that, it sounds fascinating because you're saying not only do you want, you know, future iterations of this model to reason about the surface geometry, but also like the texture atlas-ing detail, but also like the full body pose rigging information, but also the scripts. So you can kind of have a model that can do it all? Is that, is that where this is headed? One model to rule them all?

Anupam Singh: Yeah. So we wanted to solve the hardest problem first, so that if we create a car, if we create a geometry, we should be able to model the interior details, right? So if it's a house, it has room. If it's a car, it has a seat inside. And we've always believed that if you are native 3D, your objects have interior details that you've already modeled in quotes. Um, and that's what we set out to do. Um, again, pretty difficult initially, but now we can say that our objects have meshes, they have parts, they can be part of layouts, they can be textured, they can have scripts, they can have rigging, as you said, right? Um, because we have tokenized 3D, they have all this intelligence built into it.

Bilawal Sidhu: I love that. And I loved in the paper how y'all talked about sort of like the latent space, the sort of hidden layers of meaning that this model makes is like semantically meaningful, where like things that have similar shapes are sort of clustered together in this latent space. And that's kind of wild, right? Because you're kind of creating this sort of like, you're teaching a model to have like vocabulary or grammar for like the structure in the world.

Anupam Singh: Yeah, yeah. And also, one interesting thing, and again, thanks for reading the paper. We always love to geek out with folks. That's why we made it open source. We wanted to be extremely open about our techniques. One other, uh, topic of debate, I can't believe it's already 18 months or more since we were having these debates. One, of course, you talked about was diffusion transformers. But the second one was, should we train our own sort of text model? You know, there's this temptation that we have a lot of CPUs and GPUs, why don't we train our own text model? Uh, we do have a bunch of text that is unique on our platform. But then we decided, no, we want to tokenize 3D such that it can work with another large language model, which is trained on the text modality. And so combining that intelligence, right, saying that this is a red car positions us to, if somebody says, "Build a red racing car with rocket boosters on it," we don't do the semantics of the actual text. That's coming from a large language model because that has world knowledge, that has reasoning. We just have to intersperse that with 3D tokens. And voila, now you have something that understands both the geometry, but also the reasoning behind why a car is a car.

Bilawal Sidhu: This is such a crucial point because I think it's almost like the the genius part of the approach that y'all are taking. Because yeah, like build the models and of course, open source the stuff that y'all are good at. You have access to amazing 3D data given your ecosystem and a bunch of amazing open source data sets. But yeah, these LLMs, there's a bunch of companies that are in this sort of like pseudo arms race to build the biggest, baddest model of them all. I mean, just shortly before our conversation today, Google dropped Gemini 2.5, their like reasoning model, which is like apparently better than DeepSeek and Claude 3.7. And your approach can kind of just take advantage of that because you can just plug in a new LLM. So can we break down a little bit for listeners and in the viewers here, like how does that work exactly? Because in the paper, y'all talk about text to shape, which makes a ton of sense. I give, you know, some text and you predict the shape. But you can also be given a model and end up with a text description. And it sounded like from that, y'all create what you call like this, you know, essentially just to use geek parlance, like a scene graph, like this JSON format description of like what was in that model. But now it's just text, meaning GPT-4o or some other large language model can manipulate that text, and then you can go back into 3D. If you say, "Hey, I want to build like a kitchen or a garage," you can, like you said, lean on the world knowledge of these LLMs to be like, "Well, what kind of objects are typically found in a garage?" "Well, these type of objects." And then you use your text-to-shape model to like generate 3D renditions of those like objects and very quickly start populating a scene that looks cohesive. It's put together. And then again, since you always maintain this like text representation of what's in the scene, adding stuff becomes more conversational. And I suspect that'll get exciting too, because maybe right now I'm just prompting in text to be like, "Oh, add like, you know, another Porsche 911" or, you know, "add like, you know, a bike rack over there." Uh, I could start providing images and maybe other forms of modalities in the future.

Anupam Singh: Yeah, and that took us more time, honestly. Coming up with that architecture where, you know, there is a temptation to say, you know, maybe Bilawal has trained an image transformer, but I'm going to do it better. Like that arms race that you talked about. But real technology advancement happens by actually respecting what you have done. Personally, a big fan of open source and then building it on open source. Operating systems have been built like that, databases have been built, networking stacks have been built. So why not in AI? And, and so we had this very important decision point. Uh, um, Kiran Bhat, who, who, who, who you're familiar with, you're going to talk to. Um, uh, is somebody who, who really thought hard about it. And we worked with professors at Stanford. We had a very extensive, um, uh, academia team that we worked with. And we said, "You know what? We, any other modality we are going to interconnect with, we are going to cross-attention." Now, to your point, that gives us the ability to, I don't really know what a cricket match looks like as a model. We have not trained our model to understand that. To your point, we start with, let's say, a cricket stadium, then we add players to it, then we add countries to it. And the large models are super good at reasoning. So they tell our model what to build. We are very good at building 3D objects, but they tell us what should be the layout, how big is the pitch, how many wickets does cricket have. So that's how we think about interspersing with LLMs to build entire scenes rather than just objects.

Bilawal Sidhu: I love that. And so obviously the next step from scenes, right? Like we talked about objects, you've got scenes. Now we got to infuse them with interactivity. This notion of 4D creation that you talked about. You know, how people or objects in the scene behave and respond. How do we get to that next level of creativity, you know, that is going to be unlocked by something like this? What's on the roadmap for y'all to achieve 4D creation?

Anupam Singh: So with 3D tokenization, there are things like parts and rigging and scripts that get enabled. And you mentioned that earlier that, "Oh, you know, this could be possible." We want to go from this could be possible to make it possible this year, where as soon as you build a racing car, now how many doors does it have? How fast does it go? What sound does it make? One, a lot of that intelligence will come from large models that are being developed by the industry. We are not building a model which knows that the car door opens, but we've spent years and years, you know, uh, maybe 15 or more years in building a physics engine that knows how a car door opens. And this, this, the Cube AI sits in between the great understanding and reasoning that large models have given us, and our amazing engine that runs worldwide on being able to interact with objects. And it's essentially the intelligence layer that sits between, take the reasoning and make the object function. And so this year, we are planning to add more and more interactivity to these objects that you can generate with the Cube. The interesting part is, you as a creator don't have to worry about building that interactivity in. You build your car and leave the rest to us. We will start building interactive in, in, into it. But you can progressively edit the object if you don't like the door, the way we opened it, you can change it.

Bilawal Sidhu: I love that. And this, I think very nicely distinguishes the direction y'all are taking this from the sort of more diffusion, Nerf-based approaches we're seeing for 3D object creation. Like, yeah, that model looks awesome, but it's not interactive and like it's not broken into a bunch of these different parts. It sounds like that's another area y'all want to focus on next is there are these primitives that you use inside the Roblox game, and you want the ability for the model to generate, "Oh yeah, I'll create a very nice mesh for this, but let me use like this iconic style of like Roblox primitives for the rest of this stuff." And I think that obviously multi-multiple parts to an object then obviously opens up the ability to infuse it with interactivity as well.

Anupam Singh: Yeah, if, if, if I go back to, to sort of last year winter, the biggest "oohs" and "aahs" in our internal meeting was when we took a mesh, let's say a semi-truck. I think a lot of video models can generate semi-trucks. It's a well-understood concept. And then for our internal demos, we then showed that we can break it down into 140 parts or 250 parts. Now suddenly, you can change the wheels off of that semi-truck. Then we put a bounding box around it, and I can change the dimensionality of it. Like it could be a very snug 18-wheeler, or it can be a very long 18-wheeler. Given that now you're giving it geometry, uh, you're now giving it parts, you can start rigging it, and you can start giving it behavior. So that was our magic moment. That is when we decided that we got to release this, this, this Cube model. We want to make it open source so that the community can do interesting things with it. But the fact that we can give it parts, the fact that we can give it behavior is a massive difference in how we are approaching 3D AI compared to generating video.

Bilawal Sidhu: And I think what's funny about that is like y'all are obviously building this for the Roblox ecosystem, which is like this interactive gaming platform. But a lot of the other 3D and VFX creators that I talk to also want exactly this ability because sort of, "Oh, great, I hit the slot machine and I got the perfect looking semi-truck." But now I just want to change one or two other things, and doing that in video fashion is very, very hard. You can still make it work with in-painting and imagery, but sort of the 3D equivalent of that, it doesn't exist. To go back to the point that you mentioned sort of as like Cube is sort of this like middle layer between like the intelligence of like, uh, you know, world models, LLMs, and Roblox itself. How do you see that kind of evolving? Like what's the model that's like orchestrating all these other things? That's the LLM, it seems like, right? Like, do you imagine a future where you maybe want to create like a fine-tuned version of an LLM that's like the perfect Roblox conductor model that conducts Cube, you know, all the various third-party models and kind of does that for you? Or how do you think about that?

Anupam Singh: So, two-part answer to that. Very good question. Excellent question. We think about this a lot. Number one, zooming out, the biggest orchestrator is you as a creator and your imagination.

Bilawal Sidhu: Love that. Yes.

Anupam Singh: So, I, I'll tell you this. Uh, I've been playing with the model personally. My aesthetic and creativity are not what I'm known for. I'm a database infrastructure person. And we kept playing with the model. And then one day, you know, fine day, we gave access to our designers. And suddenly we saw designers building these amazing cohesive worlds versus I would create an object, it would be a car, a red car, and then a green tree, and if Bilawal looked at it, it's like, "Hey Anupam, what are you, what story are you trying to tell?" Right?

Bilawal Sidhu: It's your 3D version of the house with the tree next to it and the sky in the background.

Anupam Singh: Exactly. Exactly. Just because you give me something that can, you know, reproduce a beautiful painting doesn't mean I'm going to become a great painter. So the creator is still the centerpiece of our plans, of our thinking, okay? So that's the orchestrator as far as creativity and imagination is. Now comes to the second level, which is, which is a great part of your question. I think the industry is still trying to figure this out. You know, every weekend I go home with a reading list of, I'm, I know you, you might mention this later, Vibe coding. Honestly, Bilawal, at this point, on March 25th, um, 2025, I still don't know what Vibe coding is, right? Uh, but the MCP stuff is exciting, um, because we need a protocol such that these, uh, uh, LLMs can talk to each other. So, um, these large models, they're not necessarily just language models, these large models can talk.

Bilawal Sidhu: Sure. We really do need a better term there because it's, it's not quite visual language models either, because a lot of these reason about audio, so it's like multimodal large model?

Anupam Singh: Multimodal large models, right? You know, uh, uh, uh, and we need a protocol on how they talk to each other, right? So, number one, the creator is the most important in all of these. They are the real orchestrator. Number two, uh, orchestrating across APIs is most likely MCP. And then number three, the thing that is my favorite is we run more than 250 models in production today.

Bilawal Sidhu: Wow.

Anupam Singh: Uh, as a company. Whether you are chatting with somebody, whether you're uploading something, uh, um, it's, it's all pervasive. And so, it's another level of, of question in AI on what keeps production AI going. And so that orchestrator will be different. So you have the creator, you have an orchestrator to talk between models, and then you had that in your question, which is a fine-tuned or a distilled version of a model running in production because you can't really run a massive model in production where Bilawal needs like 40 GPUs just to run his creation, right? It's unaffordable. Uh, so I think all three of these are going to happen in the next six months, which is why I'm super excited about.

Bilawal Sidhu: I love it. I mean, since you brought up Vibe coding, let's dive into it because it totally has like, you know, shout out Andrej Karpathy for coining the term. Sort of this idea of like you're mostly using like voice chat to just talk to your coding editor and asking it to do stuff. And you kind of pretend like the code doesn't even exist. It's sort of in the background. Yes, yes. And you're just like asking it again and again, iterating with it to produce something. Now, you know, somebody who thinks about, you know, productionization and shipping at scale and, you know, like performance criteria, that might make you, you know, like obviously, you know, well up with a little bit of anxiety. But it is to your point, like such a great way to like, like you said, like time to prototype has gone down so drastically. And I think that's really the role Vibe coding plays today. And one of the demos that I I'm I'm going to put up on screen here, one of the demos that's amazing is like, you know, with these like uh Claude MCP integrations to have like this like server that can talk to Blender. You know, it's basically you provide an image reference to this model or to Claude, and it says, "Oh, here's what's in this image that you provided and here's the rough angle. Hey, now I'm going to use like some diffusion, you know, text to 3D model generator to make all those objects for you. And then I'm going to like recursively just take like screenshots and try to figure out how to place, you know, these models in a scene." And a lot of people are having this sort of like aha moment of like, "Holy crap, it isn't just about generating video." It's like we can use like these models to control and create stuff and the tools we all know and love. And control being the keyword because that's exactly what creators want. So when you see stuff like this happening, is it exciting to you? Because it feels almost like since Roblox is so vertically integrated, you own the authoring tool, you own the rendering engine, you're obviously doing the distribution and the serving too. There seems like there's going to be some very interesting opportunities. Um, so when you see Vibe coding sort of taking fold, do you imagine that's also going to happen in the Roblox ecosystem? Like a lot of people that might have been scared to open up Roblox editor might want to pop into it?

Anupam Singh: 100%. I think, I think the thing that we've been doing for, you know, since the founding of Roblox is to make 3D and I would even go one step further, make programming approachable, where somebody can, you know, we, we, we have an amazing experience called Dress to Impress. I mean, you go in there and the interaction is so amazing. And they've never had to worry about infrastructure. They've never had to worry about, you know, engine graphics and, and, and cross-device. So in that way, Vibe debugging, coding, uh, and that was not, not, not a slip of tongue because that's what…

Bilawal Sidhu: The debugging part is the hardest part.

Anupam Singh: It's, I'm seeing so much code getting generated, and the thing that keeps going on in my mind is how are we going to debug this code? Roblox has a culture where even the senior executives in the company are on a PagerDuty on-call rotation.

Bilawal Sidhu: Oh wow, cool.

Anupam Singh: So if an incident happens, I get paged as much as the frontline engineer or the frontline incident commander.

Bilawal Sidhu: How many sev 3 incidents are you getting all the time?

Anupam Singh: Exactly. And so now, it's a quality of life issue for me as an engineering, as an executive at Roblox. It's like, oh my god, how many times am I going to get paged because Bilawal just talked to his phone and created a 3D game, hit publish, and it's now on our platform. But more importantly, what are the implication of Bilawal, the creator who created the game, got their first 10,000 players. You're successful, and now you're trying to update it. Updating a game might be more creative and more complex than actually getting your initial success.

Bilawal Sidhu: Totally, yeah.

Anupam Singh: And we see this all the time. Our creators are continuously updating their creations. They're adding more to the world, they're changing the, the, the interaction patterns. I'm, I, I am as much, uh, a student as, as you are right now of this field. It's like, what happens to Vibe debugging and Vibe updating? Vibe refactoring? So, uh, still an open question. Maybe we'll talk next year and we'll say, "Ah, this is how you do Vibe debugging," but I do not understand it yet.

Bilawal Sidhu: Well, I'm glad you brought that up because it almost feels like it's easier to get to that like, you know, uh, initial prototype, and it is not necessarily that robust foundation for you to ship subsequent iterations. And it's almost easier. And this seems to be a bit of a Vibe coding maxim right now is like, just start from scratch. Once your project reaches a certain place. But of course, if you're trying to build a world that you iteratively expand upon an experience where that has existing audiences, that's easier said than done.

Anupam Singh: It's, it's, it's harder. And I think we are, we as industry, right, is pivoting too hard on the code being generated. The way we like to think about it is Bilawal makes an amazing experience, and I want to add to it. I'm one of your players, right? I join the game. Can I do in-experience creation? You have this beautiful, let's say, you know, I live in San Francisco, and so I'm, I'm a big fan of, of, of the city that I live in. And so let's say Bilawal created a San Francisco experience. But now I want to add fireworks to it, right? And I'm one of your players, and can I just say, "Add fireworks to it"? And can I propose to you, can we make San Francisco futuristic? Right? Suddenly, it's not really San Francisco anymore. You and I are effectively collaborators, whereas you are the creator and I am the player. Now, imagine we, we, we, of course, have games that have millions of players. Imagine millions of people adding to your creation, and now suddenly we are all collaborating, but keeping that cohesive, right? Where you can either accept my creation or within my instance, you can let the creation happen, but maybe not to the other players who like their San Francisco more of 1800s. I like it more in 2010s. So how do you manage the entire lifestyle? So for us, you know, as much as I think about Vibe coding, I think about in-experience creation. And we are seeing this, by the way, since the launch of the Cube, uh, I can't say the exact numbers, but I can say tens of thousands of creations have already happened on our platform.

Bilawal Sidhu: Very cool. And this thing just dropped. And it's funny, it's like in a way to think about in-game creation, which is really fascinating as a concept. And I saw some of the amazing examples that y'all showcased at GDC. It's, it is like the most accessible form of Vibe coding in a sense. Like you're in a world already, in an experience, you're like, "Oh, I want to change this about the environment or the setting that you're in, or I want to spawn like a new type of vehicle that I want to drive around, or create a new type of prop, or change my outfit for like the vibe of like, you know, kind of the, the experience that's going on right now." And being able to do that on the fly is pretty freaking magical, right?

Anupam Singh: Yeah, within 30 hours of the model dropping, we had an avatar being created using the Cube, put in our marketplace, and then we were very proud as engineers because Dave Baszucki, our founder, changed his avatar to that Cube-created avatar.

Bilawal Sidhu: That's cool. That's awesome.

Anupam Singh: It's seldom that you this happens in your career, uh, in the industry where within 30 hours, somebody has created it, and somebody has adopted it. And that if that happens to be your founder CEO, that is even more fun.

Bilawal Sidhu: I love that. Yeah, and these loops from creation and consumption are just going to get tighter and tighter.

Anupam Singh: Amazing. Yeah.

Bilawal Sidhu: Um, building on this creation conversation, both like authoring experiences and then, you know, um, in-game, uh, creation and collaboration. I want to zoom out a little bit and talk about, you know, economics, value creation, and the creator power law, if you will, right? And so, right now, the creator economy often resembles this very steep power law, as, as, as it's been said. There's very few creators that capture an enormous amount of the value. Do you think Roblox's new generative tools could meaningfully shift that balance where like more creators could sustainably thrive, similar to how short-form videos sort of broadened the creator base?

Anupam Singh: I think so. Uh, you know, zooming completely out, we have publicly talked about our, uh, goal of, uh, being 10% of the entire gaming market. Okay? Wow. What is, how does that happen? It happens because we are able to, number one, give you more and more unique experiences if you are a player, right? And so what does that mean for creators? We've had, uh, a 61% increase where games can appear in the top 150 within 90 days of creation. So, to your point, to your questions about the power law, you're 100% right. UGC platforms tend to be, get concentrated on this top 10 or top 150 or top 10,000. Two years ago, we started the journey of, of, of indexing every creation made on the platform so that we are able to see everything that is getting created on the platform. As long as it is safe, as long as it is past moderation, we are able to bring them some impressions to see whether they are getting traction. I know we always talk about the modeling side of AI, but I am equally passionate about the production side of AI, which means that we should be able to rerun this model for you while you're playing a game. So Bilawal pressed play, you're playing something, I am seeing what you're interested in. By the time you exit, I should be able to give you a different 3D experience. It might, you might have played volleyball and I might recommend you anime, but there must be something that I've seen or my model has seen to recommend you that, right? So, to us, that is the, the distribution part of it, democratizing distribution is a huge part, but the other part of it is what you kind of mentioned even at the start of, of our conversation is the ability to create fast. You know, this is what all of us are excited about, that you have to open a, a, a coding studio and then you have to build the game, etc, etc. That is going to fundamentally change, which will get more creators on our platform. So both distribution and creation are going to undergo a lot of change, and we are very well positioned because we were UGC from the get-go. It's just that AI 100Xes the speed at which you can create UGC, 3D UGC content.

Bilawal Sidhu: I love that. Yeah, it's like the volume of content is going to go up, but yeah, then you have to connect that with audiences and people on the other end to experience it. And, um, yeah, I was reading a stat where like 90% of Roblox's engagement originates from algorithmic homepage recommendations. Yes. Yeah. And so like, do you imagine these models getting more and more sophisticated about understanding, you know, obviously you've got models today that can kind of turn a 3D object or a scene into this like very detailed text description, the scene graph representation even. Um, where do we go from here? Like, is it going to be, are you going to like start looking at other signals of how people are, uh, if you had to give like a, I don't know, like a Cliff Notes summary of how recommendation works today and how you imagine it working in the future? Yeah, what would that look like?

Anupam Singh: Oh, beautiful. I would give you two words that are buried deep in our technical report, which is scene understanding. Understanding a 3D scene is much harder than it sounds because our creations are very unique. You know, you have Dress to Impress, you have Pet Simulator, there's lots of amazing creations on our platform. Each one of them is unique. So the scene understanding gives us the foundational, the Cube layer. That's why we named our model the Cube, uh, because it, it tells you much more about a 3D world. Now, moving up the stack, when you think about recommendations and personalization, or you think about safety, this scene understanding changes how we think about personalization, which many other platforms cannot. So, recommendations generally, Cliff Notes wise is, you played a game for 10 minutes, I say, "Oh, Bilawal has great engagement for 10 minutes." Then you go to another game and you play it for 30 minutes. You know, today's models will say the 30-minute game has higher playtime than the 10 minutes, so you obviously had more fun. But there's a, there's a chance that you had more joy and fun in the 10 minutes, right, yeah, than in 30 minutes. And being able to do that is only possible if I understand the scene in which you placed yourself, right? So, I'm very excited about recommendation, discovery, uh, marketplaces and safety changing because of this fundamental advance in scene understanding. So, uh, that that's going to be, in a few months, we are going to be able to change a lot of our algorithms to understand scenes.

Bilawal Sidhu: That's super powerful. Yeah, and I can only imagine where you'll go from there. It's like if you have a good understanding of like the scene that you've been navigating, then you can start looking at like the interactions of what people are doing within that scene, and on and on it goes. And yeah, it's like, it's, it's the thing that I think makes YouTube unique over X. Like, obviously I love X and Twitter, it's like where a bunch of the AI community is. But yeah, like on X, it's like there's like a couple days where your content gets attention, and then it falls off a cliff. How do you sort of create this like search and discovery experience where a lot of interesting content on the platform can continue to get attention? Um, so super excited about.

Anupam Singh: Even for this conversation, I found your videos to be almost more interesting and engaging than the tweets. The tweets tell me what you are thinking about, but then the videos give me more deeper context, yeah, uh, context, uh, which, which is a loaded term in AI right now, but, uh, uh, but I think that's where we are going, where we, we are such a unique experience for everybody who comes into our platform. Each one of our creations is unique. We need to reward that uniqueness. We need to recognize that. For that, we need deeper understanding of 3D world, which goes back to if you can tokenize a 3D world, then you can understand it better.

Bilawal Sidhu: Now, you brought me back to a question I was meaning to ask you earlier, so I'll ask it now, which is, um, you know, you talked about sort of this autoregressive transformer approach, and you just mentioned context windows. And so obviously, you need a massive context window to like, you know, reason over all this different type of data, and not just like, you know, like the triangle soup with the textures and the like scripting, uh, the rigging, all that stuff we talked about. I didn't see a public number on this. Are you able to share how big is the context window and like what it will be in the future? Like how big does it need to be to have these like totally unbounded scenes, uh, you know, coming out, which seems to be where y'all want to take things next.

Anupam Singh: Yeah, stay tuned for that. As we talk more and more about scene generation, uh, uh, generally I'm very comfortable sharing, but, uh, sometimes, you know, I have to think about my responsibility as an exec in the company. So, but, but your question is very, very valid.

Bilawal Sidhu: I think it does get exciting and it's like, uh, where like, you know, Google with their one or two million token window, I think a lot of people don't know what to do with that much context. Given when I, when you're talking about the domain of like text, okay, how many Harry Potter books am I going to pop in? When you come into like our world and we're talking about 3D and interactive 3D, Yes. We could put that context window to use very, very quickly.

Anupam Singh: 100%. Not just the context window, uh, you know, if the tokens get more and more, tokens are not intelligent, but tokens give us intelligence. Sure. I think the scenes can get bigger and bigger and bigger, right? And we'll be talking more about as, as we work on the next versions of our model. We are very excited about this question. So in, in quotes, hold that thought and you will hear more from us.

Bilawal Sidhu: Okay, so to your point about distribution, um, we talked about sort of like the YouTube and kind of Twitter analogy here. I want to make the YouTube and Netflix analogy. In other words, sort of UGC 3D and then AAA 3D, okay? When you look at the video landscape right now, like two interesting things are happening. On one hand, you've got like top creators on YouTube, like MrBeast that are spending like a million on every single YouTube video that they're doing. So budgets are kind of going up. And there's like more people like MrBeast. There's like almost too many creators, right? On the other hand, with the Netflixes of the world, like the OTT platforms, have this, are experiencing this tremendous downward pressure where they're like, "Well, we can't spend 10 million on this one movie. Let's like green light 10 movies for a million each instead so they can hit a broader market." When you fast forward things a little bit, like what is this future? And I mean, really, let's like go into speculation territory now. Like what's going to happen? Are there going to be AAA gaming games that we all play? Is it all going to be UGC? Is everything going to be like, I'm going to come back from the end of a long day and like prompt a game into existence that I play with all of my friends? What does that future look like a decade from now?

Anupam Singh: Well, you know, this might be ducking the question, but I would say all of the above. Uh, and let me explain that. Right? Um, I should be able to create a game to just play with Bilawal, my friend, you know, uh, uh, located in, let's say, uh, in another country, okay? It's, it's a game for only two of us. It's just fun for just the two. You know, if you hang out with your college friends, there are certain jokes that will work only between your college friends. Yeah. Everybody else in your family will be like, "What are they talking about?" Right? That's the, that's the, that's the minimum. And then the maximum is, I know we do make these distinctions in, in video between extraordinary high expense, high budget content and, and, and, and the content that somebody makes while walking the streets and eating street food. Our thinking is we are going to improve our engine and our infrastructure so much that it will be almost indistinguishable that it is this so-called high-end UGC content and so-called low-end UGC content. And if you look at it, if engine is sort of the thing that, that streams the, the, the bits to you on your phone or your device, and infrastructure are the servers that make sure that this runs fast. What I am excited about is sitting in between is the AI platform that can enable the engine to look and feel like it's an amazing, your words, not mine, AAA experience, right? But it's indistinguishable between from the two people who built a chess game, right? And for that, infrastructure, AI platform, and engine, all three of them have to work in concert. And that's what we are investing in.

Bilawal Sidhu: I love that you brought up the, the, the example too. Like I, I think a lot of people may not know that you can make very photorealistic experiences in Roblox. And I saw this like first-person shooter example recently, and I was like, I had to do like a triple take. I was like, "What? Like you can actually do this in Roblox?" It's kind of wild.

Anupam Singh: Yep. And, and for that, you know, you need these things to work in concert because we are trying to build a high-quality experience without you downloading the actual game. We truly believe that that's possible for 3D content. An Android 2G phone, 2 gigabyte phone with a weak network, we have the right AI, if we have the right investment in our physics engine, and if we have the right infrastructure, these three things will work in concert. And you as a creator or a player do not have to worry about it.

Bilawal Sidhu: Building on that point, it sounds like, you know, there's a very unique advantage here since you own the full stack, as so to speak, when it comes to Roblox. You've got the creation tools, you've got the content delivery, you've got the social network, you've got moderation, and now you've got this own core, like the generative AI core, and a plurality of other AI models, the AI platform that's doing a bunch of functions. I'm kind of curious, are there like certain types of creative risks or experiments, um, or product experiences that only Roblox can build? And you described one of them, which is just ubiquitous distribution of 3D experiences. But is there something super audacious you haven't attempted yet, but you believe this like full-stack Roblox approach really positions you well to try?

Anupam Singh: If I think deeply about it, I think bigger and bigger and more complicated worlds, uh, uh, can only be enabled by a streaming 3D platform. So, let's take the example of the Cube. We talked about that, that our objects have intricate details inside them, right? So, what does that mean? Let's say there are 10,000 cars in your view. I created a world where there are 10,000 cars. As you approach a car, because I know that you might enter that car, I might actually quickly put in the parts. Yeah, yeah. But the other 9,900 and some cars, empty, but I know that they are a car. And so, depending on your proximity to an object, I can ascribe more and more intelligence to it and make it more and more functional. Love it, yeah. Right? And so, I am just humbled by our creators on what they will build with that. We are constantly surprised, by the way. Even when we released the Cube, we were just hanging back and seeing what people like you will create on the platform, right? So, that's how I think about it, that very complicated, complex worlds can be, can be created. And my personal dream is to, you know, build either a cricket game with 100,000 people, reproduce some of some of the games, and then place myself in it as, as the leading batsman and not somebody else. So, so that's my personal goal.

Bilawal Sidhu: I don't know if you saw this, but like, uh, there was this like tennis open game where they were like, like an Australian network didn't have the streaming rights to stream the game. So they turned it into this like Wii tennis version of the live stream. You might have seen, yeah, so you did see that. Yeah. I thought that was really, really cool. And I keep imagining, I know like, uh, Apple and Pixar did something like this with an NFL game where they like basically brought like an NFL game to, uh, you know, like, uh, the bedroom in Toy Story, like with all the toys coming to life. I could totally imagine that being a thing for cricket. You should do it. And hey, Roblox is probably one of the only platforms that could probably, like, actually distribute it at scale right now.

Anupam Singh: And I know you've talked about this before in, in your, you know, other talks, but bringing the real world and the 3D world together would be amazing. Everything that I say about cricket or you talk about tennis, it's about bringing the real world and the, the, the sort of fantastical world together. And that intersection is going to be amazing. And so video platforms did that very well because they brought, you're walking around and you're talking about street food in Brazil, right? And somehow I feel like I've been to Brazil with you if I see your video, right? Let's say. I can imagine then you, you know, in a 3D world, I can actually walk with you, which is very different from, "Oh, Bilawal went to this country and he did this." Instead, I'm walking with you, I'm, I'm chatting with you, even though you might have gone to that country maybe a year ago. So, there's a lot of fantastic stuff that can be enabled, but the tech has to work, uh, right? Because it's, it's very expensive to, uh, to run 3D. Um, and as somebody who's responsible for infrastructure cost, that keeps coming back.

Bilawal Sidhu: I think that makes total sense. And I'm, I share your vision. It's like, uh, I used to work on 3D maps at Google, and one of my dreams was like, eventually there will be a world where we're like, Google's making these immaculate 3D replicas of reality, but what if you could explore San Francisco and get like a guided walking tour, like in avatar fashion and exploring like maybe this like Roblox-ified version of San Francisco, you know?

Anupam Singh: And, and in seconds, from your idea to distributing it, to somebody playing it should happen in seconds, not in minutes, not in hours.

Bilawal Sidhu: I love that. I think that's also another place where, I think more people in the Vibe coding community probably need to go try Vibe coding on Roblox because if you see some of the most viral Vibe coding demos, you've probably seen the flight simulator. Most of the tweets that are coming out is like, "Holy crap, how do I deal with like updating like the real world's position of, you know, and it's like, you know, levels, like he's not a game dev. Peter Levels is like crudely, 'Oh, every one second I upload, update the XYZ coordinate of every single person.'" And, you know, there's so many other optimizations that y'all are taking care of that it's one hell of a canvas to go and create an idea and then have it, again, freaking, not just work on a high-end MacBook Pro with like WebGPU, but like the low-end Android devices too. That's amazing.

Anupam Singh: And we will take care of all of that for our creators.

Bilawal Sidhu: I want to talk to you a little bit about your forward-looking stuff and personal reflections. Um, one of the themes I often explore is this idea that technologists and creators, such as yourself, have their own sort of frontier, like the bleeding edge for you and the territories that you're personally excited or maybe even slightly intimidated to venture into next. So as somebody who's like guiding Roblox's AI growth and platform evolution, what's your own personal frontier right now? Stuff that feels almost a little mysterious or especially exciting to you?

Anupam Singh: Oh, thank you for asking that question. Uh, uh, I'll tell you what I'm excited about these days. Four years ago, a friend of mine got me into this autonomous vehicle in San Francisco. So, uh, to me, more and more systems will become autonomous. If you can drive through San Francisco, what else can you do? Can you do debugging on your own? And I know it's getting a little bit polluted because everybody's talking about agents, but when that physical, you know, sort of 5,000-pound object moves through Union Square, I'm thinking we're just, it's just day zero. There's a lot that's going to happen. That's what I'm excited.

Bilawal Sidhu: I'm also curious, what's your advice for, you know, the young creator, the independent developer, or even somebody outside traditional game dev, hearing our conversation, what advice would you have, uh, for them from your own unique vantage point at the edge of AI and dare I say the metaverse? How can they best navigate, but also meaningfully contribute to this frontier as it unfolds and we build the future together?

Anupam Singh: The, the first one would be, uh, uh, use AI. Just get a hands-on experience. And you're a great example. I'm almost intimidated by your ability to take new technology and understand it and then explain it to people. But I think all of us should try that. It, it, it looks and sounds intimidating, right? Understanding Nerf, understanding the Cube, understanding Vibe coding. It seems very overwhelming. Uh, and yet my advice would be, you know, engage with it. Just engage with the technology. Don't be intimidated by it and don't believe all the negative stuff around. Part two of it is, read the papers, not the tweets. Okay? So…

Bilawal Sidhu: Go to the source, the primary source.

Anupam Singh: The primary source, right? As I was, uh, preparing for our conversation, I loved watching your long-form videos because they go deep into stuff rather than, you know, uh, a tweet. A tweet is just informative to jump to something deeper, uh, uh, in learning. So, I would, uh, advise builders who are just getting into, uh, technology to go deep into one topic rather than, like I spend, this week in reading our own paper, and I've seen every version of it, and yet I found something more interesting about scene understanding or how we should think about 3D tokens.

Bilawal Sidhu: I love that. You're totally right. And it's the same experience here, to be honest. When I saw the Twitter coverage of Cube, it was all the usual, "Oh, 3D modelers are cooked. Something, something, Roblox has a new thing," and then shiny visuals. And then I went into it, I was like, "Oh, now I get why they're doing autoregressive. Oh my god, they can do like text-to-scene and text-to-shape and shape-to-text, and that enables scenes. And oh my god, that could enable 4D." And then you have a lot of these revelations that you don't have unless you go to the source material. So that's very well said.

Anupam Singh: Just go engage with the open source community for Cube. That's what I would, I would ask all your listeners to, you know, download the model, uh, play with it. We have a Hugging Face application, play with that, or go into Roblox Studio and use our generative AI stuff, which of course uses the Cube.

Bilawal Sidhu: I love that. Y'all heard it here. Vibe coding is cool, but try Vibe coding in Roblox. You've got a lot of the primitives at your disposal to make something interactive, not just cool to screen record and share on Twitter, but actually get potentially hundreds of thousands, if not millions of people to play.

Anupam Singh: Yes. Thank you very much, Bilawal.

Bilawal Sidhu: That is a wrap with our conversation with Anupam Singh. Look, I absolutely enjoyed this because we journeyed from Roblox's massive infrastructure challenges to the very frontier of AI-driven 3D creation. I find it very clever how they're able to basically use their own foundational models that tokenize 3D, along with the plurality of large language models out there that are not really even large language models. They're more like multimodal large language models, pulling from their world knowledge to make it easier for you to create 3D worlds. And where I would have thought in some of our like Blender MCP videos and other things we've talked about, where the 3D models don't fully understand how to like create a scene, turns out if you just give them a couple of examples of scene graphs, that's enough. That kind of blows my mind. It tells me that few-shot in-context learning is probably more than enough to give us a true 3D creation workflow. And of course, their vision isn't just static objects, it's fully interactive 4D experiences. And they've got a very interesting data set to be able to train off of. So it is very clear to me, with a platform like Roblox that has ubiquitous distribution already, where the vast majority of users are playing user-generated content. If you can unlock creation, we might see the rise of 3D in a similar fashion to the rise of short-form video today. As that barrier to entry got commoditized, instead of being a YouTuber where you got to have this like big camera, this microphone, a bunch of editing that happens, you just use the phone in your pocket. And we're increasingly headed towards a future where you can literally just describe the kind of world and interactivity you want and iterate with these systems to generate that automatically. Anyway, I hope you enjoyed this conversation. A little bit different than the kind of content I do on this channel. If you do like this interview, let me know who you'd like to see me interview next. I want to carve out a bunch of time to have space and place to talk to the people who are actually building the future that we talk about every single week. With that, Bilawal signing out, and I'll see y'all in the next one. Cheers.