Single View to 3D

16-825 Learning for 3D: Assignment 2

Author

Ricky Yuan (rickyy)

Published

September 30, 2025

1. Exploring loss functions

Run

python fit_data.py --type all

to fit a voxel grid, point cloud and mesh to the target data, and create GIF visualizations of the optimized source data and target data.

1.1. Fitting a voxel grid (5 points)

Source (42)	Source (712)	Source (749)

Target (42)	Target (712)	Target (749)

1.2. Fitting a point cloud (5 points)

Source (42)	Source (712)	Source (749)

Target (42)	Target (712)	Target (749)

1.3. Fitting a mesh (5 points)

Source (42)	Source (712)	Source (749)

Target (42)	Target (712)	Target (749)

2. Reconstructing 3D from single view

2.1. Image to voxel grid (20 points)

Training & Evaluation command:

python train_model.py --type 'vox' --lr 4e-5 --max_iter 50000 --num_workers 12 --save_freq 500
python eval_model.py --type 'vox' --load_checkpoint

My decoder architecture

class VoxelDecoder(nn.Module):
    def __init__(self, in_dim=512, out_dim=32):
        super(VoxelDecoder, self).__init__()
        # First decode to 256 * 4 * 4 * 4
        self.fc = nn.Linear(in_dim, 256 * 4 * 4 * 4)

        def up_block(in_channels, out_channels):
            return nn.Sequential(
                nn.ConvTranspose3d(in_channels, out_channels, kernel_size=4, stride=2, padding=1),  # (B, out_channels, 2H, 2W, 2D)
                nn.BatchNorm3d(out_channels),
                nn.ReLU(inplace=True),
            )

        self.conv_blocks = nn.Sequential(
            up_block(256, 128),  # (B, 128, 8, 8, 8)
            up_block(128, 64),   # (B, 64, 16, 16, 16)
            up_block(64, 32),    # (B, 32, 32, 32, 32)
        )
        self.head = nn.Conv3d(32, 1, kernel_size=3, padding=1)  # (B, 1, 32, 32, 32)

    def forward(self, x):
        x = self.fc(x)  # (B, 256*4*4*4)
        x = x.view(-1, 256, 4, 4, 4)  # (B, 256, 4, 4, 4)
        x = self.conv_blocks(x)  # (B, 32, 32, 32, 32)
        x = self.head(x).squeeze(1)  # (B, 32, 32, 32)
        return x

Visualizations

Below are some visualizations of the input RGB image, predicted voxel grid and ground truth voxel grid.

Input RGB	Predicted Voxel Grid	Ground Truth Voxel Grid

2.2. Image to point cloud (20 points)

Training & Evaluation command:

python train_model.py --type 'point' --lr 4e-5 --max_iter 50000 --num_workers 12 --save_freq 500 --n_points 4096
python eval_model.py --type 'point' --load_checkpoint

My decoder architecture

class PointDecoder(nn.Module):
    def __init__(self, in_dim=512, n_point=1024):
        super(PointDecoder, self).__init__()
        self.n_point = n_point
        self.fc_layers = nn.Sequential(
            nn.Linear(in_dim, 1024),
            nn.LeakyReLU(inplace=True),
            nn.Linear(1024, 2048),
            nn.LeakyReLU(inplace=True),
            nn.Linear(2048, self.n_point * 3),
            nn.Tanh(),  # output in range [-1, 1]
        )

    def forward(self, x):
        x = self.fc_layers(x)  # (B, n_point*3)
        x = x.view(-1, self.n_point, 3)  # (B, n_point, 3)
        return x

Visualizations

Below are some visualizations of the input RGB image, predicted point cloud and ground truth point cloud.

Input RGB	Predicted Point Cloud	Ground Truth Point Cloud

2.3. Image to mesh (20 points)

Training & Evaluation command:

python train_model.py --type 'mesh' --lr 4e-5 --max_iter 50000 --num_workers 12 --save_freq 500 --n_points 4096
python eval_model.py --type 'mesh' --load_checkpoint

My decoder architecture

My Mesh Decoder uses the same architecture as the Point Decoder. The only difference is that the number of points is determined by len(mesh_pred.verts_list()[0]), which is created by the mesh initialization function ico_sphere(4, self.device)

Visualizations

Below are some visualizations of the input RGB image, predicted mesh and ground truth mesh.

Input RGB	Predicted Mesh	Ground Truth Mesh

2.4. Quantitative comparisions (10 points)

F1 Score Comparision

Model Type	Voxel Grid	Point Cloud	Mesh
Avg F1@0.05	69.095	83.637	76.656

F1 Score Curve

Intuitive Explanation

From the score and curve above, we can see that the point cloud model achieves the best F1 score, followed by the mesh model, and finally the voxel grid model. The is because the point cloud allows the model to learn the offset freely for each point. In contrast, the mesh model learns the offset for each vertex, but the vertices are connected by faces. The model would have to implicitly learn the surface smoothness through the loss functions, which is more challenging. The voxel grid model is the most constrained, as it has to learn the occupancy for each voxel in a fixed grid structure. This makes it difficult to capture fine details and complex geometries.

2.5. Analyse effects of hyperparams variations (10 points)

Different type of initial mesh

I implemented cube and cylinder as the initial mesh for the mesh decoder. Below are some visualizations of the initial mesh, and the comparison of input RGB image, predicted mesh and ground truth mesh using different initial meshes.

Train and Evaluation commands

python train_model.py --type 'mesh' --init_mesh 'cube' --lr 1e-4 --max_iter 50000 --num_workers 12 --save_freq 5000
python eval_model.py --type 'mesh' --init_mesh 'cube' --load_checkpoint
python train_model.py --type 'mesh' --init_mesh 'cylinder' --lr 1e-4 --max_iter 50000 --num_workers 12 --save_freq 5000
python eval_model.py --type 'mesh' --init_mesh 'cylinder' --load_checkpoint

Initial Mesh Visualizations

The cube is tested with 4 levels of subdivision, and the cylinder is tested with 3 levels of subdivision. The number of points and faces for each initial mesh is as follows:

iso_sphere(4)	cube	cylinder

2562 vertices	1538 vertices	4098 vertices
5120 faces	3072 faces	8192 faces

Visualizations using different initial meshes

From the visualization below, we can see that the predicted mesh using different initial meshes (iso_sphere, cube, cylinder) are quite similar and close to the ground truth mesh. This indicates that the model is robust to the choice of initial mesh as long as the number of points is sufficient, and can effectively learn to deform different initial shapes to match the target shape.

Input RGB	Predicted Mesh (iso_sphere)	Predicted Mesh (Cube)	Predicted Mesh (Cylinder)	Ground Truth Mesh

F1 Score Comparision

From the table below, we can see a trend that the larger the number of points in the initial mesh, the better the F1 score under the same training settings.

Initial Mesh	iso_sphere(4)	Cube	Cylinder
Avg F1@0.05	76.656	75.893	78.595

Visualizations using different level (iterations for mesh faces subdivision)

Initially I set the level to 1 (which means no subdivision), and the number of points for the cube mesh is 8, and then I realized that the points are not sufficient to learn the complex mesh structure. Below are some visualizations of the input RGB image, predicted mesh using different levels. We can see that the level 1 predicted mesh is very coarse since we are not adding any additional points.

Input RGB	Predicted Mesh (Cube level=1)	Predicted Mesh (Cube level=4)

2.6. Interpret your model (15 points)

python analyze_voxel_decoder.py

I visualized the feature maps from the second and third convolutional blocks of the voxel decoder, as well as the logits right before the output layer. For a selected sample, I visualized both the mean feature map and the most strongly activated channel from each layer. I then rendered 2D slices of these feature volumes along the depth, height, and width axes. We can see that as we go deeper into the network, the silhouette of the object becomes clearer, which indicates that the model is learning to focus on the shape of the object. The mean feature maps, however, does not seem very obvious for the shape of the object, which suggests that the model is relying on specific channels to capture the important features for reconstruction.

3. Exploring other architectures / datasets. (Choose at least one! More than one is extra credit)

3.3 Extended dataset for training (10 points)

In the following questions, I trained the point cloud prediction model on the extended dataset.

F1 Score Comparision

The table below shows the F1 score comparison across different training and evaluation settings using the point cloud decoder.

1-Class -> Same Class	3-Class -> 1 Class	3-Class -> 3 Classes	1-Class -> 3 Classes
83.637	84.720	91.496	76.734

Comparing Column 1 and Column 2, we can see a slight improvement in F1 score when the model is trained on multiple classes, even if it is evaluated on a single class. This suggests that multi-class training enhances the model’s generalization ability. However, I noticed that the chair class consistently yields a lower F1 score compared to the other two classes, which lowers the overall average. This implies that the chair class has more structural diversity and is harder for the model to learn accurately. As for Column 3 and Column 4, the one-class model performs poorly when evaluated on multiple classes, which is expected since it tends to predict all samples as chairs. Consequently, the precision and recall for the other categories are significantly reduced.

Visualization Comparision

The qualitative results below illustrate the same trend. The 1-class model tends to reconstruct all objects as chairs, showing poor category discrimination. In contrast, the 3-class model can better capture object-specific geometry and produce more accurate and distinct reconstructions across categories.

Input RGB	Ground Truth Point Cloud	Predicted Point Cloud (1-Class)	Predicted Point Cloud (3-Class)