← Back
Ruthenium(III) dimethyl sulfoxide pyridinehydroxamic acid complexes as potential antimetastatic agents: synthesis, characterisation and in vitro pharmacological evaluation.
Computational Visual Media
https://doi.org/10.1007/s41095-023-0337-5
Vol. 9, No. 4, December 2023, 687–697
Research Article
Neural 3D reconstruction from sparse views using geometric
priors
Tai-Jiang Mu1 , Hao-Xiang Chen1 , Jun-Xiong Cai1 , and Ning Guo2 (
)
c The Author(s) 2023.
animation, etc. Recently, much progress has been
made with the development of neural implicit 3D
representation. Unlike traditional methods which
directly triangulate explicit 3D surface points by
feature matching, neural implicit methods use
multi-layer perceptrons (MLPs) to parameterize
the underlying scene to be reconstructed using
occupancy or signed distance fields. Such an
implicit representation is usually learned by imposing
photometric consistency via RGB image reconstruction
loss through a volume rendering technique.
Solely applying photometric consistency obviously
leads to an underconstrained surface reconstruction
problem, since many geometries may exist that can
reproduce the same colors upon volume rendering.
Furthermore, the photometric consistency constraint
is less effective in noisy or weakly textured regions.
Thus, geometry directly recovered from a photometric
consistency based neural implicit representation, such
as NeRF [1], usually suffers from over-smoothing and
Keywords sparse views; 3D reconstruction; volume
noise. Reconstructing the underlying geometry with
rendering; geometric priors; neural implicit
better, finer detail requires a dense set of 2D views.
3D representation
In this paper, we show that a more accurate and
detailed geometry can be achieved by incorporating
1 Introduction
monocular geometric cues when training the neural
3D implicit reconstruction method. Depths and
Reconstructing 3D surfaces from sparse 2D views
normals are the two most common kinds of monocular
is a classic computer vision problem with various
geometric cues provided by the local geometry of the
applications in virtual reality, augmented reality,
underlying surface. These cues can be estimated
from monocular images with reasonable quality using
1 BNRist, Department of Computer Science and deep learning approaches, such as Omnidata [2] and
Technology, Tsinghua University, Beijing 100084, MVSNet [3], or the data can be provided directly by
China. E-mail: T.-J. Mu, taiijiang@tsinghua.edu.cn;
depth sensors. Photometric cues and geometric cues
H.-X. Chen, chx20@mails.tsinghua.edu.cn; J.-X. Cai,
are complementary: depths and normals can help
caijunxiong000@163.com.
2 Academy of Military Sciences, Beijing 100091, China. to infer the geometry in textureless regions, while
photometric cues can help to enrich details. Thus, in
E-mail: guoning10@nudt.edu.cn ( ).
Manuscript received: 2022-11-25; accepted: 2023-01-31
addition to RGB image reconstruction loss, we also
Abstract Sparse view 3D reconstruction has attracted
increasing attention with the development of neural
implicit 3D representation. Existing methods usually
only make use of 2D views, requiring a dense set of input
views for accurate 3D reconstruction. In this paper, we
show that accurate 3D reconstruction can be achieved
by incorporating geometric priors into neural implicit
3D reconstruction. Our method adopts the signed
distance function as the 3D representation, and learns
a generalizable 3D surface reconstruction model from
sparse views. Specifically, we build a more effective
and sparse feature volume from the input views by
using corresponding depth maps, which can be provided
by depth sensors or directly predicted from the input
views. We recover better geometric details by imposing
both depth and surface normal constraints in addition
to the color loss when training the neural implicit
3D representation. Experiments demonstrate that our
method both outperforms state-of-the-art approaches,
and achieves good generalizability.
687
688
T.-J. Mu, H.-X. Chen, J.-X. Cai, et al.
impose depth and normal reconstruction loss terms
to the volume rendering results.
The geometric cues can also serve as an initial
guess for the underlying geometry, which helps
to build a reliable and sparse feature volume for
the neural implicit 3D representation, making it
easier to train. To obtain accurate reconstruction,
previous methods [4] usually adopted a hierarchical
implicit representation, which first reconstructs coarse
geometry and then recovers details with a finer
resolution. Though such a hierarchical architecture
can be efficient, the final results are still be affected
by uncertainty and noise in the coarse geometry.
If depths are available, we can avoid the use of
a hierarchical architecture by directly determine
an initial feature volume from the point cloud
reprojected from the depth maps. This initial
geometry is sparse and has comparable accuracy to
the learned geometry [4], and can also be further
refined. Including geometry priors during training of
the neural implicit representation and determining
the initial geometry directly from the geometric cues
(i) improves the quality of the reconstructed geometry,
(ii) improves the generalizability of the method, and
(iii) leads to faster convergence, giving an overall
better approach for sparse view 3D reconstruction.
In summary, our method makes the following
contributions:
• a general approach exploiting geometric priors
to improve 3D reconstruction quality and
generalization for neural sparse view implicit 3D
reconstruction,
• initial geometric reasoning for neural implicit
surface models, which is more effective and easier
to train, and
• extensive experiments which demonstrate that
our method achieves state-of-the-art sparse view
3D reconstruction.
2
Related work
2.1
Multi-view 3D reconstruction using an
explicit representation
Typical multi-view 3D reconstruction is conducted
by first extracting features from 2D views, then
reasoning about the underlying geometry producing
each view, and finally fusing the view geometry to
obtain the final 3D geometry for the whole object
or scene. The fusion process differs depending on
the 3D representation used, such as voxels [5–8], a
point cloud [9, 10], or depths [3, 11–14]. Compared
to traditional methods [15–17] based on hand-crafted
geometric features, current CNN-based methods
are more capable of extracting robust features
and have produced promising results. In particular,
using readily available depth estimation neural
networks [2, 3, 11, 18], depth-based methods [19],
together with dedicated fusion [20–22], can provide
high-quality reconstruction from densely captured
images. However, these methods have shortcomings
when facing image noise or weak textures, and can
fail to recovery a complete surface given insufficient
images or sparse views.
2.2
Neural implicit 3D reconstruction
With the success of novel view synthesis using
neural radiance fields (NeRF) [1, 23–27], neural 3D
implicit representations were quickly applied to multiview 3D reconstruction [4, 28–36]. Such methods
usually extract the geometry from the predicted
voxel density, occupancy field, or signed distance
function (SDF). However, the geometry can suffer
from images noise, and can be unreliable given weak
textures or incomplete data, since they usually only
impose photometric consistency when learning the
implicit representation. To learn a better surface for
incompletely visible objects, SNeS [37] explored the
use of an object symmetry prior to help recover the
invisible parts. Some recent reconstruction methods
consider depths [38] or normals [39, 40] as geometric
priors to help reconstruction; however, to obtain finely
detailed geometry, these methods usually require a
large number of views to be input to perform perscene optimization, leading to difficulties to generalize
to new scenes.
To increase the generalizability of the networks,
efforts have been made to enable the network
to memorize the input views or geometric cues.
MVSNeRF [41] encodes the input views into a feature
volume for better view synthesis. StereoNeRF [42]
learns stereo correspondences from the input sparse
views. PixelNeRF [43] and IBRNet [44] encode
pixel colors into the network in addition to 3D
coordinates and viewing directions. Depth [45, 46]
and normal [47] priors have also been explored in
neural rendering to better constrain the underlying
geometry. Although these methods can synthesize
plausible images, the extracted geometries can still
Neural 3D reconstruction from sparse views using geometric priors
suffer from noise, since all 3D spaces are considered in
their representations. Our method, in contract, can
determine a more accurate, sparse initial geometry
with the help of geometric cues, especially using depth
estimation from the input views.
3
3.1
Neural sparse view 3D reconstruction
with geometric priors
Overview
The pipeline of our method is illustrated in Fig. 1.
It uses a volume rendering scheme for surface
reconstruction [4]. Given sparse input views (typically
3–7 views), our method first obtains a depth map
for each view. Then the sparse geometry reasoning
module builds a sparse feature volume from these
depth maps by projecting them back to the world
space to assemble a sparse point cloud of the
underlying geometry. The resulting sparse feature
volume is then fed into the geometry-guided surface
reconstruction module to reconstruct an accurate
surface with fine details, doing so by imposing
geometric constraints based on losses from rendered
depths and normals, in addition the photometric loss.
3.2
Sparse geometry reasoning with depths
We parameterize the underlying geometry to be
reconstructed using the signed distance function
(SDF), following previous methods [1, 4, 47]. To
efficiently reconstruct the underlying geometry,
previous methods usually exploit a coarse-to-fine
process, first estimating the coarse geometry for
the whole scene, and then only refining details in
689
occupied (non-empty) regions as determined by the
coarse geometry. While this approach is efficient,
it is not robust to noise and unreliable photometric
consistency, since it usually only uses RGB images to
build the underlying geometry.
Given that depth maps of monocular images can
be reliably and efficiently estimated by deep learning
techniques [2, 3], or easily acquired by depth sensors,
we make use of these depth maps for monocular RGB
images to build a sparse and reliable coarse geometry
for our neural implicit model. Specifically, given N
sparse images Ii , i = 0, . . . , N − 1, with poses (Ri , ti ),
we first determine or obtain corresponding depth
maps Di . These depth maps are then reprojected to
the world coordinate system to produce a composite
S −1 −1
point cloud P = N
i=0 πi (Di ) using the camera
intrinsic parameters and poses of all images Ii , where
πi−1 (Di ) denotes the reprojected 3D locations of
pixels in depth map Di .
To construct a feature volume G for geometry
reasoning, we follow a prior method [3] for multiview
depth estimation to build a cost volume C. We first
extract a 2D feature map Fi from each input image
using a 2D feature extraction network, and voxelize
the fused point cloud P into regular voxels with voxel
size d. We then project all points P (v) contained in
each voxel v to feature map Fi and determine the
voxel’s feature Fi (v) as the average feature:
Fi (v) = Avgp∈P (v) Fi (p)
(1)
The cost feature volume of a voxel v is thus obtained
by computing the variance of all the projected features
of the voxel to all input views:
−1
C(v) = Var({Fi (v)}N
(2)
i=0 )
Fig. 1 Pipeline. Our method takes sparse views as input and reconstructs the underlying 3D surface by (i) reliably determining a coarse
geometry from the depth maps, and (ii) in addition to photometric consistency Lc , imposing both depth consistency Ld and normal consistency
Ln , leading to a more general and accurate framework for the neural implicit model.
690
T.-J. Mu, H.-X. Chen, J.-X. Cai, et al.
Finally, the geometric feature volume G is obtained
by applying a sparse 3D CNN to C:
G = CNNsparse3d (C)
(3)
Note that this feature volume is inherently sparse
because of the sparsity of the point cloud. To account
for possible errors in depth maps, we also dilate each
voxel by a distance δd .
In this way, our method directly reconstructs the
underlying geometry using only one level of implicit
field, i.e., the finest level of previous methods, while
still being as efficient as possible, and achieving more
accurate results.
3.3
where δi = ||p(ti+1)−p(ti)||2 is the distance between two
consecutive sample points and Ti = exp(−Σi−1
j=0 δi σj ) is
the accumulated transmittance. Note that we follow
Ref. [4] to calculate the color at each sampled point
which blends the colors of pixels or patches from the
input views. di is the ray distance from the sampled
point to the ray original and ni is the spatial gradient
of the sample point at the predicted SDF.
We now can train a more accurate neural implicit
surface by imposing consistency on depth and normal
as well as color with the total loss in Eq. (9):
Ltotal = Lc + wnLn + wdLd
L =Σ
c
Geometry guided surface reconstruction
Following Ref. [4], given a query 3D position q, our
implicit 3D representation directly predicts its signed
distance SDF(q). Specifically, an MLP network is
applied to predict the surface from the interpolated
geometric features of G, concatenated with q’s
positional encoding PE:
SDF(q) = MLPsdf (PE(q), G(q))
(4)
NeuS [28] used dedicated volume rendering
for multiview 3D reconstruction using a neural
implicit surface and later was extended to sparse
view 3D reconstruction [4]. However, the neural
implicit surface is optimized only using photometric
supervision, which may suffer from noise and unreliable
photometric consistency in textureless regions.
In addition to photometric consistency, we try to
exploit complementary geometric priors to reconstruct
more accurate and detailed geometries from the sparse
input views. Specifically, to render the depth d(r)
and normal n(r) as well as the color c(r) of a ray r
going through the underlying scene, we first query the
depths di , normals ni , and colors ci , and SDF values
si for all M sampled points {p(ti )} along the ray; then
we convert si to densities σi using NeuS [28]:
0
σi = max(−Φs (si )/Φs (si ), 0)
(5)
where Φs (x) = (1 + e−sx )−1 and s is a learnable
parameter. Finally, we combine depths, normals, and
colors with the densities to obtain the rendered values:
d(r) =
n(r) =
c(r) =
M−1
X
i=0
M−1
X
i=0
M−1
X
i=0
Ti (1 − exp(−δi σi ))di
(6)
Ti (1 − exp(−δi σi ))ni
(7)
Ti (1 − exp(−δi σi ))ci
(8)
c(r)k22
r∈R kc(r) − e
e(r)k22
Ln = Σr∈Rkn(r) − n
2
e
(9)
Ld = Σr∈Rkd(r) − d(r)k2
where Lc , Ln , and Ld are the photometric loss, normal
loss, and depth loss, respectively, R is the set of
e(r), and e
all sample rays, ec(r), n
d(r) are the groundtruth color, normal, and depth of the sample ray r,
respectively, and wn and wd weight these losses.
4
Experiments and results
4.1
Dataset
We trained our generalizable sparse view 3D
reconstruction model on the DTU multiview stereo
dataset [48], which contains 75 scenes for training
and a further 15 non-overlapping scenes for testing.
We centrally cropped the images to a resolution of
512 × 640 for both training and testing. To train our
network, ground-truth normals were estimated from
the ground-truth point cloud for the underlying scene
provided with the dataset. While depth maps could be
estimated using current learning-based methods [2, 3]
for each view, to ensure depth consistency between
views, we used the ground-truth depth to determine
the initial geometry required by our network for both
training and testing. To account for sparsity and
errors in the depth map, we set the dilation range δd
to 7 voxels. Our network was trained with 6 views,
including 1 reference view and 5 source views.
4.2
Implementation details
We adopted a feature pyramid network [49] as the
multi-scale 2D image feature extraction network and
used a U-net like sparse 3D convolution network [50].
The 3D voxel resolution was set to 192×192×192 and
the weights in the loss function were set as wd = 0.1
Neural 3D reconstruction from sparse views using geometric priors
and wd = 0.9. We trained our model for 20k iterations
with an initial learning rate of 2×10−4 , adjusted using
a cosine decay schedule, with a factor of 0.5 at 10k
and 15k steps. Our model was trained using the
Jittor deep learning framework [51] on a Titan RTX
GPU using a batch size of 512 rays.
4.3
followed the setting of SparseNeuS by performing
evaluation on two sets of three views for each test
scene. The final metrics average these pairs of results.
We also compared our method to other leading
general sparse view 3D reconstruction methods,
including (i) generic neural rendering methods such
as PixelNeRF [43], IBRNet [44], and MVSNeRF [41],
where the reconstructed mesh is extracted from the
learned implicit field, and (ii) the widely used classic
MVS method COLMAP [19], where the reconstructed
mesh is extracted from the reconstructed point cloud.
Note that, to test the generalizability of all methods,
they were not fine-tuned to suit each scene. Per-scene
chamfer distances and mean chamfer distance on the
DTU test set are reported in Table 1. All values
except those for our method are directly drawn from
Ref. [4]. We also present a visual comparison of
sample output from SparseNeuS and our method in
Fig. 2; we also show cosine similarity of the predicted
geometry’s normals to the ground-truth values.
Metrics
To evaluate the accuracy of 3D reconstruction, we
adopt the commonly used chamfer distance, which
measures point distances between the predicted and
ground-truth geometries. Following Ref. [4], we make
use of the foreground object masks provided in IDR
[34] to remove the background from the reconstructed
results when computing metrics.
4.4
Quantitative and qualitative results
We first compared our method to a baseline neural
surface reconstruction method SparseNeuS [4], which
ignores both depth and normal cues when learning
the 3D implicit field. For a fair comparison, we
Table 1
Scan
691
Chamfer distance assessment of reconstruction errors using the DTU test dataset, for various methods
24
37
40
55
63
65
69
83
97
105
106
110
114
118
122
Mean
Method
PixelNeRF
5.13
8.07
5.85
4.40
7.11
4.64
5.68
6.76
9.05
6.11
3.95
5.92
6.26
6.89
6.93
6.28
IBRNet
2.29
3.70
2.66
1.83
3.02
2.83
1.77
2.28
2.73
1.96
1.87
2.13
1.58
2.05
2.09
2.32
MVSNeRF
1.96
3.27
2.54
1.93
2.57
2.71
1.82
1.72
2.29
1.75
1.72
1.47
1.29
2.09
2.26
2.09
SparseNeuS
1.68
3.06
2.25
1.10
2.37
2.18
1.28
1.47
1.80
1.23
1.19
1.17
0.75
1.56
1.55
1.64
Colmap
0.90
2.89
1.63
1.08
2.18
1.94
1.61
1.30
2.34
1.28
1.10
1.42
0.76
1.17
1.14
1.52
Ours
1.17
2.35
1.94
1.06
1.11
1.42
0.95
1.64
1.18
1.02
0.96
0.81
0.67
1.15
1.31
1.25
Fig. 2 Comparison of results from our method and SparseNeuS [4] for scans 24, 63, and 97 of the DTU [48] test set. Left to right: input
reference view, reconstructed SparseNeuS mesh, normal error map for SparseNeuS, reconstructed mesh from our method, and normal error map
for our method. Brighter red indicates larger normal error. Green and blue indicate regions omitted when calculating the normal error.
692
T.-J. Mu, H.-X. Chen, J.-X. Cai, et al.
Our method achieves the most accurate reconstruction as assessed in terms of chamfer distance. Generic
neural rendering methods, such as PixelNeRF,
IBRNet, and MVSNeRF struggle to recover fine
geometric details using only the input images.
Compared to SparseNeuS, with the help of depth
guided sparse geometry reasoning and the constraints
from both depth and normal priors, our method can
reconstruct more accurate and detailed geometry.
Furthermore, unlike SparseNeuS, our method does
not need to train two (coarse and fine) networks,
making training easier. Note that our method can
outperform the COLMAP classical MVS method
without the need of per-scene fine-tuning, which is
required by SparseNeuS. Indeed, our results are even a
little better than the fine-tuned results of SparseNeuS,
having a mean chamfer distance of 1.27 on the DTU
test dataset.
The reconstructed results for other scenes from the
DTU test set are shown in Fig. 3.
4.5
Ablation study
To demonstrate the effectiveness of our approach, we
conducted experiments by ablating the core modules
of our method providing sparse geometry reasoning,
depth constraints, and normal constraints. In the
following evaluation, results were reconstructed using
6 views for consistency with the training process.
These views were selected according to view pairs
Fig. 3
provided by the DTU dataset [48], where the first
three views are one set of three views used in
SparseNeuS.
The study was performed as follows:
• Without sparse geometry reasoning (SGR). We
replaced our sparse geometry reasoning module
with the two-stage geometry reasoning module
from SparseNeuS [4] and used the same settings
of resolutions for coarse and fine voxels. Depth
and normal constraints were imposed using the
same weights as in our unaltered method.
• Without normal prior. We set the weight for the
normal loss to zero in our full method.
• Without depth prior. We set the weight for the
depth loss to zero in our full method.
A quantitative analysis of 3D reconstruction results
is given in Table 2 and a qualitative assessment is
presented in Fig. 4. As we can see, eliminating the
SGR module leads to over smoothed surfaces and
less accurate geometry; both depth and normal priors
Table 2 Ablation study. Mean chamfer distance with and without
the sparse geometry reasoning (SGR) module, the normal constraint
and the depth constraint
SGR
Depth
Normal
CD
×
X
X
1.737
X
×
X
1.304
X
X
×
1.834
X
X
X
1.281
Further reconstructed results for DTU test scenes using 6 views. Left: reference view. Right: mesh reconstructed by our method.
Neural 3D reconstruction from sparse views using geometric priors
693
Fig. 4 Ablation study. Left to right: (a) reference image, (b) our method without sparse geometry reasoning (SGR), (c) our method without
depth priors, (d) our method without normal priors, and (e) our full model. The SGR module helps to recover more accurate and detailed
geometry; both depth and normal cues improve local details of the underlying geometry.
contribute to the geometric details. We also present
novel view synthesis results in Fig. 5 by rendering
an unseen view for several DTU test scenes. The
results show that, though originally designed for 3D
reconstruction, our geometric priors, especially the
sparse geometry reasoning, are also beneficial when
synthesizing novel views.
4.6
Parameter study
Our method is affected by four main parameters,
including the weight for depth loss wd , the weight for
normal loss wn , the voxel size d (inversely proportional
to the number of voxels in each direction), and the
depth dilation range δd . We varied both wd and wn
from 0.0 to 0.9 with a step size of 0.1 and tested two
settings of voxel resolution, i.e., 96 or 192 voxels in each
dimension. The mean chamfer distances for different
parameter configurations, using the DTU dataset test
scenes are listed in Table 3. We can observe that
(1) d is responsible for geometric reconstruction
accuracy: the smaller d is, the more accurate the
geometry, but at a cost of increased computation.
(2) As the depth weight starts to increase (from 0.01
to 0.1), the geometry becomes more and more
accurate; however, when it continues to increase,
the accuracy drops. This may be somewhat
Table 3 Parameter study, considering mean chamfer distance w.r.t.
normal loss weight, depth loss weight, number of voxels in each
direction, and depth dilation range
wd
wn
Voxels
δd
CD
0.01
0.2
0.01
0.2
192
7
1.639
192
10
0.1
1.535
0.2
96
7
1.993
0.1
0.2
192
7
1.473
0.1
0.3
192
7
1.381
0.1
0.4
192
7
1.372
0.1
0.5
192
7
1.369
0.1
0.6
192
7
1.293
0.1
0.7
192
7
1.357
0.1
0.8
192
7
1.359
0.1
0.9
192
7
1.281
0.2
0.2
192
7
1.510
0.3
0.2
192
7
1.707
0.4
0.2
192
7
1.675
0.5
0.2
192
7
1.882
0.6
0.2
192
7
1.927
0.7
0.2
192
7
2.853
0.8
0.2
192
7
3.028
0.9
0.2
192
7
3.230
affected by the sparse geometry reasoning module,
which uses the depth information to construct an
initial geometry for the underlying scene.
Fig. 5 Novel view synthesis. Left to right: (a) our method without sparse geometry reasoning (SGR), (b) our method without depth prior,
(c) our method without normal prior, (d) our full model, and (e) the ground truth. The differences are highlighted in red box.
694
T.-J. Mu, H.-X. Chen, J.-X. Cai, et al.
(3) A larger normal loss weight reconstructs more
geometric details.
(4) A larger depth dilation range can cover more
true surface regions and thus be more robust to
noise in the depth maps, achieving more accurate
reconstruction results, but at a cost of a greater
computational burden.
To balance the accuracy of the recovered geometry
and computational effort, we suggest setting wd = 0.1,
wn = 0.9, and using higher voxel resolution and a
smaller depth dilation range.
5
Conclusions and future work
This paper has presented a general framework for
neural implicit 3D reconstruction from sparse views.
By leveraging geometric priors, our method can
determine a sparse and reliable coarse implicit geometry
for optimization. This is done by imposing both
depth consistency and normal consistency, as well as
photometric consistency, on the training loss function.
This makes the framework more general and accurate.
Currently, we set a fixed dilation range for
the depth when constructing the initial geometry.
This could be further improved in practice if the
uncertainty of the depth map is known. Our model
can also be per-scene fine-tuned given more views of
a specific scene, using only the color loss.
In future, we would also like to apply our method
to outdoor large scene 3D reconstruction from remote
sensing images or aerial images, for which accurate
depths and normals are even hard to obtain, by
leveraging accurate mapping data.
Acknowledgements
We thank the anonymous reviewers for their valuable
comments on this paper. This work was supported
by the National Natural Science Foundation of China
(Grant No. 61902210).
Declaration of competing interest
The authors have no competing interests to declare
that are relevant to the content of this article.
References
[1] Mildenhall, B.; Srinivasan, P. P.; Tancik, M.; Barron, J.
T.; Ramamoorthi, R.; Ng, R. NeRF: Representing
scenes as neural radiance fields for view synthesis.
Communications of the ACM Vol. 65, No. 1, 99–106,
2022.
[2] Eftekhar, A.; Sax, A.; Malik, J.; Zamir, A. Omnidata:
A scalable pipeline for making multi-task mid-level
vision datasets from 3D scans. In: Proceedings of
the IEEE/CVF International Conference on Computer
Vision, 10766–10776, 2021.
[3] Yao, Y.; Luo, Z.; Li, S.; Fang, T.; Quan, L. MVSNet:
Depth inference for unstructured multi-view stereo.
In: Computer Vision – ECCV 2018. Lecture Notes
in Computer Science, Vol. 11212. Ferrari, V.; Hebert,
M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham,
785–801, 2018.
[4] Long, X.; Lin, C.; Wang, P.; Komura, T.; Wang,
W. SparseNeuS: Fast generalizable neural surface
reconstruction from sparse views. In: Computer Vision
– ECCV 2022. Lecture Notes in Computer Science, Vol.
13692. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G.
M.; Hassner, T. Eds. Springer Cham, 210–227, 2022.
[5] Seitz, S. M.; Dyer, C. R. Photorealistic scene
reconstruction by voxel coloring. International Journal
of Computer Vision Vol. 35, No. 2, 151–173, 1999.
[6] Kar, A.; Häne, C.; Malik, J. Learning a multiview stereo machine. In: Proceedings of the 31st
International Conference on Neural Information
Processing Systems, 364–375, 2017.
[7] Sun, J. M.; Xie, Y. M.; Chen, L. H.; Zhou, X.
W.; Bao, H. J. NeuralRecon: Real-time coherent 3D
reconstruction from monocular video. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 15593–15602, 2021.
[8] Ji, M. Q.; Zhang, J. Z.; Dai, Q. H.; Fang, L. SurfaceNet:
An end-to-end 3D neural network for very sparse multiview stereopsis. IEEE Transactions on Pattern Analysis
and Machine Intelligence Vol. 43, No. 11, 4078–4093,
2021.
[9] Lhuillier, M.; Quan, L. A quasi-dense approach
to surface reconstruction from uncalibrated images.
IEEE Transactions on Pattern Analysis and Machine
Intelligence Vol. 27, No. 3, 418–433, 2005.
[10] Furukawa, Y.; Ponce, J. Accurate, dense, and robust
multiview stereopsis. IEEE Transactions on Pattern
Analysis and Machine Intelligence Vol. 32, No. 8, 1362–
1376, 2010.
[11] Gu, X. D.; Fan, Z. W.; Zhu, S. Y.; Dai, Z. Z.; Tan,
F. T.; Tan, P. Cascade cost volume for high-resolution
multi-view stereo and stereo matching. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2492–2501, 2020.
Neural 3D reconstruction from sparse views using geometric priors
[12] Long, X.; Liu, L.; Theobalt, C.; Wang, W. Occlusionaware depth estimation with adaptive normal
constraints. In: Computer Vision – ECCV 2020.
Lecture Notes in Computer Science, Vol. 12354.
Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds.
Springer Cham, 640–657, 2020.
[13] Long, X. X.; Lin, C.; Liu, L. J.; Li, W.; Theobalt,
C.; Yang, R. G.; Wang, W. Adaptive surface normal
constraint for depth estimation. In: Proceedings of
the IEEE/CVF International Conference on Computer
Vision, 12829–12838, 2021.
[14] Long, X. X.; Liu, L. J.; Li, W.; Theobalt, C.;
Wang, W. P. Multi-view depth estimation using
epipolar spatio-temporal networks. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 8254–8263, 2021.
[15] Fuentes-Pacheco, J.; Ruiz-Ascencio, J.; RendónMancha, J. M. Visual simultaneous localization and
mapping: A survey. Artificial Intelligence Review Vol.
43, No. 1, 55–81, 2015.
[16] Xu, Z. W.; Rong, Z.; Wu, Y. H. A survey: Which
features are required for dynamic visual simultaneous
localization and mapping? Visual Computing for
Industry, Biomedicine, and Art Vol. 4, No. 1, Article
No. 20, 2021.
[17] Özyeşil, O.; Voroninski, V.; Basri, R.; Singer, A. A
survey of structure from motion. Acta Numerica Vol.
26, 305–364, 2017.
[18] Li, Y. Z.; Luo, F.; Xiao, C. X. Self-supervised coarseto-fine monocular depth estimation using a lightweight
attention module. Computational Visual Media Vol. 8,
No. 4, 631–647, 2022.
[19] Schönberger, J. L.; Frahm, J. M. Structure-from-motion
revisited. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 4104–4113,
2016.
[20] Choy, C. B.; Xu, D.; Gwak, J.; Chen, K.; Savarese,
S. 3D-R2N2: A unified approach for single and multiview 3D object reconstruction. In: Computer Vision –
ECCV 2016. Lecture Notes in Computer Science, Vol.
9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds.
Springer Cham, 628–644, 2016.
[21] Huang, P. H.; Matzen, K.; Kopf, J.; Ahuja, N.;
Huang, J. B. DeepMVS: Learning multi-view stereopsis.
In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2821–2830,
2018.
[22] Wang, D.; Cui, X. R.; Chen, X.; Zou, Z. X.; Shi, T. Y.;
Salcudean, S.; Wang, Z. J.; Ward, R. Multi-view 3D
reconstruction with transformers. In: Proceedings of
the IEEE/CVF International Conference on Computer
Vision, 5702–5711, 2021.
695
[23] Liu, L.; Gu, J.; Lin, K. Z.; Chua, T.; Theobalt, C.
Neural sparse voxel fields. In: Proceedings of the
34th International Conference on Neural Information
Processing Systems, Article No. 1313, 15651–15663,
2020.
[24] Trevithick, A.; Yang, B. GRF: Learning a general
radiance field for 3D representation and rendering.
In: Proceedings of the IEEE/CVF International
Conference on Computer Vision, 15162–15172, 2021.
[25] Barron, J. T.; Mildenhall, B.; Tancik, M.; Hedman,
P.; Martin-Brualla, R.; Srinivasan, P. P. Mip-NeRF:
A multiscale representation for anti-aliasing neural
radiance fields. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision, 5835–
5844, 2021.
[26] Verbin, D.; Hedman, P.; Mildenhall, B.; Zickler, T.;
Barron, J. T.; Srinivasan, P. P. Ref-NeRF: Structured
view-dependent appearance for neural radiance fields.
In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 5481–5490,
2022.
[27] Guo, Y. C.; Kang, D.; Bao, L. C.; He, Y.; Zhang, S.
H. NeRFReN: Neural radiance fields with reflections.
In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 18388–
18397, 2022.
[28] Wang, P.; Liu, L.; Liu, Y.; Theobalt, C.; Komura,
T.; Wang, W. NeuS: Learning neural implicit surfaces
by volume rendering for multi-view reconstruction.
In: Proceedings of the 35th Conference on Neural
Information Processing Systems, 27171–27183, 2021.
[29] Zhang, J. Y.; Yao, Y.; Quan, L. Learning signed
distance field for multi-view surface reconstruction.
In: Proceedings of the IEEE/CVF International
Conference on Computer Vision, 6505–6514, 2021.
[30] Niemeyer, M.; Mescheder, L.; Oechsle, M.; Geiger,
A. Differentiable volumetric rendering: Learning
implicit 3D representations without 3D supervision.
In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 3501–3512,
2020.
[31] Oechsle, M.; Peng, S. Y.; Geiger, A. UNISURF:
Unifying neural implicit surfaces and radiance fields
for multi-view reconstruction. In: Proceedings of the
IEEE/CVF International Conference on Computer
Vision, 5569–5579, 2021.
[32] Darmon, F.; Bascle, B.; Devaux, J. C.; Monasse,
P.; Aubry, M. Improving neural implicit surfaces
geometry with patch warping. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 6250–6259, 2022.
696
[33] Yariv, L.; Gu, J.; Kasten, Y.; Lipman, Y. Volume
rendering of neural implicit surfaces. In: Proceedings of
the 35th Conference on Neural Information Processing
Systems, 4805–4815, 2021.
[34] Yariv, L.; Kasten, Y.; Moran, D.; Galun, M.;
Atzmon, M.; Basri, R.; Lipman, Y. Multiview neural
surface reconstruction by disentangling geometry and
appearance. In: Proceedings of the 34th International
Conference on Neural Information Processing Systems,
Article No. 210, 2492–2502, 2020.
[35] Liu, S. H.; Zhang, Y. D.; Peng, S. Y.; Shi, B. X.;
Pollefeys, M.; Cui, Z. P. DIST: Rendering deep implicit
signed distance function with differentiable sphere
tracing. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2016–
2025, 2020.
[36] Kellnhofer, P.; Jebe, L. C.; Jones, A.; Spicer, R.;
Pulli, K.; Wetzstein, G. Neural lumigraph rendering.
In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 4285–4295,
2021.
[37] Insafutdinov, E.; Campbell, D.; Henriques, J. F.;
Vedaldi, A. SNeS: Learning probably symmetric neural
surfaces from incomplete data. In: Computer Vision –
ECCV 2022. Lecture Notes in Computer Science, Vol.
13692. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G.
M.; Hassner, T. Eds. Springer Cham, 367–383, 2022.
[38] Azinović, D.; Martin-Brualla, R.; Goldman, D.
B.; Nießner, M.; Thies, J. Neural RGB-D surface
reconstruction. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern
Recognition, 6280–6291, 2022.
[39] Guo, H. Y.; Peng, S. D.; Lin, H. T.; Wang, Q. Q.;
Zhang, G. F.; Bao, H. J.; Zhou, X. Neural 3D scene
reconstruction with the Manhattan-world assumption.
In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 5501–5510,
2022.
[40] Wang, J. P.; Wang, P.; Long, X. X.; Theobalt, C.;
Komura, T.; Liu, L. J.; Wang, W. NeuRIS: Neural
reconstruction of indoor scenes using normal priors.
In: Computer Vision – ECCV 2022. Lecture Notes in
Computer Science, Vol. 13692. Avidan, S.; Brostow, G.;
Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer
Cham, 139–155, 2022.
[41] Chen, A. P.; Xu, Z. X.; Zhao, F. Q.; Zhang, X.
S.; Xiang, F. B.; Yu, J. Y.; Su, H. MVSNeRF:
Fast generalizable radiance field reconstruction from
multi-view stereo. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision, 14104–
14113, 2021.
T.-J. Mu, H.-X. Chen, J.-X. Cai, et al.
[42] Chibane, J.; Bansal, A.; Lazova, V.; Pons-Moll, G.
Stereo radiance fields (SRF): Learning view synthesis
for sparse views of novel scenes. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 7907–7916, 2021.
[43] Yu, A.; Ye, V.; Tancik, M.; Kanazawa, A. pixelNeRF:
Neural radiance fields from one or few images. In:
Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 4576–4585, 2021.
[44] Wang, Q. Q.; Wang, Z. C.; Genova, K.; Srinivasan,
P.; Zhou, H.; Barron, J. T.; Martin-Brualla, R.;
Snavely, N.; Funkhouser T. A. IBRNet: Learning
multi-view image-based rendering. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 4688–4697, 2021.
[45] Roessle, B.; Barron, J. T.; Mildenhall, B.; Srinivasan, P.
P.; Nießner, M. Dense depth priors for neural radiance
fields from sparse input views. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 12882–12891, 2022.
[46] Johari, M. M.; Lepoittevin, Y.; Fleuret, F.
GeoNeRF: Generalizing NeRF with geometry priors.
In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 18344–
18347, 2022.
[47] Yu, Z.; Peng, S.; Niemeyer, M.; Sattler, T.; Geiger,
A. MonoSDF: Exploring monocular geometric cues for
neural implicit surface reconstruction. In: Proceedings
of the 36th Conference on Neural Information
Processing Systems, 2022.
[48] Jensen, R.; Dahl, A.; Vogiatzis, G.; Tola, E.; Aanæs,
H. Large scale multi-view stereopsis evaluation. In:
Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 406–413, 2014.
[49] Lin, T. Y.; Dollár, P.; Girshick, R.; He, K.
M.; Hariharan, B.; Belongie, S. Feature pyramid
networks for object detection. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition, 936–944, 2017.
[50] Tang, H. T.; Liu, Z. J.; Zhao, S. Y.; Lin, Y. J.;
Lin, J.; Wang, H. R.; Han, S. Searching efficient
3D architectures with sparse point-voxel convolution.
In: Computer Vision – ECCV 2020. Lecture Notes in
Computer Science, Vol. 12373. Vedaldi, A.; Bischof, H.;
Brox, T.; Frahm, J. M. Eds. Springer Cham, 685–702,
2020.
[51] Hu, S. M.; Liang, D.; Yang, G. Y.; Yang, G. W.; Zhou,
W. Y. Jittor: A novel deep learning framework with
meta-operators and unified graph execution. Science
China Information Sciences Vol. 63, No. 12, Article No.
222103, 2020.
Neural 3D reconstruction from sparse views using geometric priors
Tai-Jiang Mu is an assistant researcher
in the Department of Computer Science
and Technology at Tsinghua University.
He received his bachelor degree and
Ph.D. degree in computer science and
technology from Tsinghua University
in 2011 and 2016, respectively. His
research interests include computer
graphics, visual media learning, 3D reconstruction, and 3D
understanding.
Hao-Xiang Chen received his bachelor
degree in computer science from Jilin
University in 2020. He is currently
a Ph.D. candidate in the Department
of Computer Science and Technology,
Tsinghua University. His research interests include 3D reconstruction and 3D
computer vision.
Jun-Xiong Cai is currently a postdoctoral researcher at Tsinghua University, where he received his Ph.D.
degree in computer science and technology in 2020. His research interests
include computer graphics, computer
vision, and 3D geometry processing.
697
Ning Guo is an assistant researcher
at the Academy of Military Sciences.
He received his bachelor degree, master
degree, and Ph.D. degree in information
and communication engineering from
the National University of Defense
Technology in 2014, 2016, and 2020,
respectively. His research interests
include digital earth, 3D GIS, 3D reconstruction, and spatial
databases.
Open Access This article is licensed under a Creative
Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link
to the Creative Commons licence, and indicate if changes
were made.
The images or other third party material in this article are
included in the article’s Creative Commons licence, unless
indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and
your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission
directly from the copyright holder.
To view a copy of this licence, visit http://
creativecommons.org/licenses/by/4.0/.
Other papers from this open access journal are available
free of charge from http://www.springer.com/journal/41095.
To submit a manuscript, please go to https://www.
editorialmanager.com/cvmj.