Microsoft's TRELLIS: A high-quality 3D asset generation model
MICROSOFT

Microsoft's TRELLIS: A high-quality 3D asset generation model

December 15, 2024

Microsoft has recently proposed a generative method for creating high-quality 3D assets, based on a unified Structured LATent (SLAT) representation and Rectified Flow Transformers, achieving flexible and efficient 3D generation.

Core of the paper

  1. Unified Structured LATent Representation (SLAT)

  • SLAT combines sparse 3D meshes with dense multi-view features extracted from vision foundation models.
  • Captures geometric structure and textural information, supporting multiple decoding formats including Radiance Fields, 3D Gaussians, and Meshes.
  • Provides flexible decoding capabilities to output diverse 3D formats according to different needs.
  • Powerful generative model architecture

    • Uses a Rectified Flow Transformer specifically designed for SLAT as the core model.
    • Trained on a large-scale dataset of 3D assets containing over 500,000 diverse objects, with a parameter scale reaching up to 2 billion.
  • Flexible generation and editing capabilities

    • Supports generating high-quality 3D assets through text or image inputs, significantly outperforming existing methods.
    • Provides flexible output format options and local 3D editing functions, which were previously unavailable in other models.
  • Innovative application scenarios

    • Generated 3D assets can be used for complex artistic designs, asset variant generation, and precise manipulation of local areas.

    Key features and demonstrations

    Text-to-3D asset generation

    Image-to-3D asset generation

    Asset variant generation

    Local area manipulation

    Method overview: SLAT and TRELLIS

    Structured LATent Representation (SLAT)

    SLAT combines sparse structures with visual representations:

    • Defines local latent variables on active voxels intersecting the object surface.
    • Combines dense multi-view rendering image features generated by powerful pre-trained visual encoders.
    • Active voxels provide coarse geometry, while visual features capture fine geometry and texture details.

    TRELLIS model architecture

    1. Two-stage generation pipeline

    • Generates the sparse structure of SLAT.
    • Generates latent variables for non-empty cells.
  • Rectified Flow Transformer

    • Adapts to SLAT sparsity and serves as the backbone model.
  • Multi-format output and editing

    • Maps SLAT into high-quality 3D representations through different decoders to meet diverse requirements.

    Applications

    I tried it on HuggingFace, and the results are decent. However, for commercial use, the controllability still falls short.

    ABOUT THE AUTHOR

    Renee's Entrepreneurial JourneyEssay Editor

    This is my little corner of the internet where I share thoughts, ideas, and interesting stuff I come across in the world of AI. Things in this field move fast, and I use this space to slow down a bit—to reflect, explore, and hopefully spark some good conversations.

    GOOGLE

    See More