Single View to 3D
16-825 Learning for 3D: Assignment 2
1. Exploring loss functions
Run
python fit_data.py --type allto fit a voxel grid, point cloud and mesh to the target data, and create GIF visualizations of the optimized source data and target data.
1.1. Fitting a voxel grid (5 points)
| Source (42) | Source (712) | Source (749) |
![]() |
![]() |
![]() |
| Target (42) | Target (712) | Target (749) |
![]() |
![]() |
![]() |
1.2. Fitting a point cloud (5 points)
| Source (42) | Source (712) | Source (749) |
![]() |
![]() |
![]() |
| Target (42) | Target (712) | Target (749) |
![]() |
![]() |
![]() |
1.3. Fitting a mesh (5 points)
| Source (42) | Source (712) | Source (749) |
![]() |
![]() |
![]() |
| Target (42) | Target (712) | Target (749) |
![]() |
![]() |
![]() |
2. Reconstructing 3D from single view
2.1. Image to voxel grid (20 points)
Training & Evaluation command:
python train_model.py --type 'vox' --lr 4e-5 --max_iter 50000 --num_workers 12 --save_freq 500
python eval_model.py --type 'vox' --load_checkpointMy decoder architecture
class VoxelDecoder(nn.Module):
def __init__(self, in_dim=512, out_dim=32):
super(VoxelDecoder, self).__init__()
# First decode to 256 * 4 * 4 * 4
self.fc = nn.Linear(in_dim, 256 * 4 * 4 * 4)
def up_block(in_channels, out_channels):
return nn.Sequential(
nn.ConvTranspose3d(in_channels, out_channels, kernel_size=4, stride=2, padding=1), # (B, out_channels, 2H, 2W, 2D)
nn.BatchNorm3d(out_channels),
nn.ReLU(inplace=True),
)
self.conv_blocks = nn.Sequential(
up_block(256, 128), # (B, 128, 8, 8, 8)
up_block(128, 64), # (B, 64, 16, 16, 16)
up_block(64, 32), # (B, 32, 32, 32, 32)
)
self.head = nn.Conv3d(32, 1, kernel_size=3, padding=1) # (B, 1, 32, 32, 32)
def forward(self, x):
x = self.fc(x) # (B, 256*4*4*4)
x = x.view(-1, 256, 4, 4, 4) # (B, 256, 4, 4, 4)
x = self.conv_blocks(x) # (B, 32, 32, 32, 32)
x = self.head(x).squeeze(1) # (B, 32, 32, 32)
return xVisualizations
Below are some visualizations of the input RGB image, predicted voxel grid and ground truth voxel grid.
| Input RGB | Predicted Voxel Grid | Ground Truth Voxel Grid |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2.2. Image to point cloud (20 points)
Training & Evaluation command:
python train_model.py --type 'point' --lr 4e-5 --max_iter 50000 --num_workers 12 --save_freq 500 --n_points 4096
python eval_model.py --type 'point' --load_checkpointMy decoder architecture
class PointDecoder(nn.Module):
def __init__(self, in_dim=512, n_point=1024):
super(PointDecoder, self).__init__()
self.n_point = n_point
self.fc_layers = nn.Sequential(
nn.Linear(in_dim, 1024),
nn.LeakyReLU(inplace=True),
nn.Linear(1024, 2048),
nn.LeakyReLU(inplace=True),
nn.Linear(2048, self.n_point * 3),
nn.Tanh(), # output in range [-1, 1]
)
def forward(self, x):
x = self.fc_layers(x) # (B, n_point*3)
x = x.view(-1, self.n_point, 3) # (B, n_point, 3)
return xVisualizations
Below are some visualizations of the input RGB image, predicted point cloud and ground truth point cloud.
| Input RGB | Predicted Point Cloud | Ground Truth Point Cloud |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2.3. Image to mesh (20 points)
Training & Evaluation command:
python train_model.py --type 'mesh' --lr 4e-5 --max_iter 50000 --num_workers 12 --save_freq 500 --n_points 4096
python eval_model.py --type 'mesh' --load_checkpointMy decoder architecture
My Mesh Decoder uses the same architecture as the Point Decoder. The only difference is that the number of points is determined by len(mesh_pred.verts_list()[0]), which is created by the mesh initialization function ico_sphere(4, self.device)
Visualizations
Below are some visualizations of the input RGB image, predicted mesh and ground truth mesh.
| Input RGB | Predicted Mesh | Ground Truth Mesh |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2.4. Quantitative comparisions (10 points)
F1 Score Comparision
| Model Type | Voxel Grid | Point Cloud | Mesh |
|---|---|---|---|
| Avg F1@0.05 | 69.095 | 83.637 | 76.656 |
F1 Score Curve

Intuitive Explanation
From the score and curve above, we can see that the point cloud model achieves the best F1 score, followed by the mesh model, and finally the voxel grid model. The is because the point cloud allows the model to learn the offset freely for each point. In contrast, the mesh model learns the offset for each vertex, but the vertices are connected by faces. The model would have to implicitly learn the surface smoothness through the loss functions, which is more challenging. The voxel grid model is the most constrained, as it has to learn the occupancy for each voxel in a fixed grid structure. This makes it difficult to capture fine details and complex geometries.
2.5. Analyse effects of hyperparams variations (10 points)
Different type of initial mesh
I implemented cube and cylinder as the initial mesh for the mesh decoder. Below are some visualizations of the initial mesh, and the comparison of input RGB image, predicted mesh and ground truth mesh using different initial meshes.
Train and Evaluation commands
python train_model.py --type 'mesh' --init_mesh 'cube' --lr 1e-4 --max_iter 50000 --num_workers 12 --save_freq 5000
python eval_model.py --type 'mesh' --init_mesh 'cube' --load_checkpoint
python train_model.py --type 'mesh' --init_mesh 'cylinder' --lr 1e-4 --max_iter 50000 --num_workers 12 --save_freq 5000
python eval_model.py --type 'mesh' --init_mesh 'cylinder' --load_checkpointInitial Mesh Visualizations
The cube is tested with 4 levels of subdivision, and the cylinder is tested with 3 levels of subdivision. The number of points and faces for each initial mesh is as follows:
| iso_sphere(4) | cube | cylinder |
|---|---|---|
![]() |
![]() |
![]() |
| 2562 vertices | 1538 vertices | 4098 vertices |
| 5120 faces | 3072 faces | 8192 faces |
Visualizations using different initial meshes
From the visualization below, we can see that the predicted mesh using different initial meshes (iso_sphere, cube, cylinder) are quite similar and close to the ground truth mesh. This indicates that the model is robust to the choice of initial mesh as long as the number of points is sufficient, and can effectively learn to deform different initial shapes to match the target shape.
| Input RGB | Predicted Mesh (iso_sphere) | Predicted Mesh (Cube) | Predicted Mesh (Cylinder) | Ground Truth Mesh |
|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
F1 Score Comparision
From the table below, we can see a trend that the larger the number of points in the initial mesh, the better the F1 score under the same training settings.
| Initial Mesh | iso_sphere(4) | Cube | Cylinder |
|---|---|---|---|
| Avg F1@0.05 | 76.656 | 75.893 | 78.595 |
Visualizations using different level (iterations for mesh faces subdivision)
Initially I set the level to 1 (which means no subdivision), and the number of points for the cube mesh is 8, and then I realized that the points are not sufficient to learn the complex mesh structure. Below are some visualizations of the input RGB image, predicted mesh using different levels. We can see that the level 1 predicted mesh is very coarse since we are not adding any additional points.
| Input RGB | Predicted Mesh (Cube level=1) | Predicted Mesh (Cube level=4) |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
2.6. Interpret your model (15 points)
python analyze_voxel_decoder.py I visualized the feature maps from the second and third convolutional blocks of the voxel decoder, as well as the logits right before the output layer. For a selected sample, I visualized both the mean feature map and the most strongly activated channel from each layer. I then rendered 2D slices of these feature volumes along the depth, height, and width axes. We can see that as we go deeper into the network, the silhouette of the object becomes clearer, which indicates that the model is learning to focus on the shape of the object. The mean feature maps, however, does not seem very obvious for the shape of the object, which suggests that the model is relying on specific channels to capture the important features for reconstruction.



3. Exploring other architectures / datasets. (Choose at least one! More than one is extra credit)
3.3 Extended dataset for training (10 points)
In the following questions, I trained the point cloud prediction model on the extended dataset.
F1 Score Comparision
The table below shows the F1 score comparison across different training and evaluation settings using the point cloud decoder.
| 1-Class -> Same Class | 3-Class -> 1 Class | 3-Class -> 3 Classes | 1-Class -> 3 Classes |
|---|---|---|---|
| 83.637 | 84.720 | 91.496 | 76.734 |
Comparing Column 1 and Column 2, we can see a slight improvement in F1 score when the model is trained on multiple classes, even if it is evaluated on a single class. This suggests that multi-class training enhances the model’s generalization ability. However, I noticed that the chair class consistently yields a lower F1 score compared to the other two classes, which lowers the overall average. This implies that the chair class has more structural diversity and is harder for the model to learn accurately. As for Column 3 and Column 4, the one-class model performs poorly when evaluated on multiple classes, which is expected since it tends to predict all samples as chairs. Consequently, the precision and recall for the other categories are significantly reduced.
Visualization Comparision
The qualitative results below illustrate the same trend. The 1-class model tends to reconstruct all objects as chairs, showing poor category discrimination. In contrast, the 3-class model can better capture object-specific geometry and produce more accurate and distinct reconstructions across categories.
| Input RGB | Ground Truth Point Cloud | Predicted Point Cloud (1-Class) | Predicted Point Cloud (3-Class) |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |






















































































