We conduct comprehensive experiments using PN-Relighting for a number of tasks. First, we provide a detailed description of the dataset we use and provide our training details. Next, we evaluate our method qualitatively and quantitatively from three aspects: portrait appearance, portrait relighting, and novel view synthesis. We compare our method with competitive state-of-the-art methods as well as perform ablation studies to evaluate separate modules in PN-Relighting.
We train PN-Relighting on a Linux cluster with two AMD EPYC 7742 CPUs, GB RAM, and NVidia A6000 GPU with 48G memory. We set the parameters = and for our total loss functions on SMOLAT and in-the-wild dataset respectively. Note that there’s no ground truth to guide the training of Normal and Material networks on the in-the-wild dataset. We therefore fixed the parameters of these two subnets during the training by setting the weights of corresponding loss terms to zero. We use the Adam optimizer with a learning rate of . It takes around 24 hours ( 1 day) to train our network on the SMOLAT, compared with TotalRelighting (Pandey et al., 2021) , which takes 7 days.
Once the training on SMOLAT done, we then add the Pseudo-Albedo and FFHQ dataset to the training of the networks, with probability of occurrence 0.05 and 0.1 respectively. This procedure takes around 24 hours ( 1 day) to reach convergence. In total, we take about 48 hours to train our network.
For data augmentation, we perform regular augmentation strategies on the input, including color adjustment, image shifting, and image re-scaling.
We train PN-Relighting on two datasets: SMOLAT from (Zhang et al., 2021), and Pseudo-Albedo Dataset from FFHQ (Karras et al., 2019).
Zhang et al. (2021) captured this OLAT dataset for portrait relighting using video as input. They used an ultra-high speed camera to capture OLAT images of 36 subjects with 2810 HDR environment lighting maps.
For each frame of OLAT data, it contains 114 light positions, corresponding to 114 images each. The dataset contains a total of 603,288 frames of OLAT data. We split the dataset into a training set and a test set by the subject’s identity. The training data contains 30 individuals and we only show results on the rest 6 subjects (unseen in training) in this section. For each frame of the OLAT data, we obtain its normal by photometric stereo method (Woodham, 1980) and use it as ground truth for training. Unlike TotalRelighting which uses a full-light image as an albedo, we generate ground truth albedo by photometric stereo to better filter out the specular effects, and thus providing higher appearance fidelity for portrait relighting. We use all the HDR environment illumination provided by SMOLAT to generate the ground truth training pair of Phong prior and image-based rendering under different illumination for our Neural Relighting networks. Figure 6 shows our relit results on SMOLAT.
We have described the construction of Pseudo-Albedo dataset in Sect. 5. It contains 300 images in total, and each has a pseudo albedo map from Eq. 3 as training guidance. We use 250 images as training set and evaluate our methods on the rest 50 images. In Sect. 6.4 we show that this dataset improves the performance of our albedo-network on the in-the-wild data.
In addition to the PA dataset, we collect a sub-FFHQ to train the Neural Relighting Network so that our method can be generalized to ”in-the-wild” images. Specifically, we collected about 50k images for training, and randomly picked about 1k images for testing. None of images from sub-FFHQ overlap with the PA dataset. Our ablation study in Sect. 6.4 shows that adding the sub-FFHQ dataset into the training procedure can help to preserve the faithful appearance realism and image sharpness.
We have evaluated PN-Relighting on three tasks: portrait appearance reconstruction, portrait relighting, and novel view synthesis under various illuminations. For each task, we compare with state-of-the-arts both qualitatively and quantitatively.
We compare our estimated surface normal and albedo with two state-of-the-art appearance estimation approaches: DECA (Feng et al., 2021), and SfSNet (Sengupta et al., 2018). DECA is a parametric face model that inferences and albedo as their intermediate results. SfSNet is more close to our method that also handles illumination change. Specifically, we measure the reconstructed albedo accuracy using PSNR, SSIM and RMSE metrics. As for normal, we use the mean error and the percentage of correct pixels at various thresholds.
On SMOLAT dataset, we conduct quantitative comparisons on the estimated normal and albedo in Tables 2 and 3. For a fair comparison, we applied the background removal and color calibration to all the other methods and only evaluate the reconstructed appearance from the original viewpoint of . Figure 7 shows the visual comparison with state-of-the-arts. Compared with other methods, normal map produced by PN-Relighting contains more details and is the closest to ground truth. Similarly, our albedo is of a higher resolution and presents fewer artifacts. Figure 11 shows the average normal and albedo error along lighting changes. We can tell that, our predicted intrinsic appearance parameters are stable among different illumination, which is inline with facts: surface normal and albedo are independent on lighting changes.
On the in-the-wild dataset, since there is no ground truth albedo and normal for quantitative measurement. We hence only show the qualitative comparisons of and in Fig. 8.
Overall PN-Relighting produces more convincing results on the in-the-wild data. Specifically, compared to SfSNet, our method reconstructs an albedo map with fewer specular and artifacts whereas our normal is sharper, preserving more high-frequency details. By introducing Phong diffuse and specular shading as a prior, our method also more faithfully recovers specular reflectance of the portrait in the reconstructed . Table 4 shows a quantitatively comparison of the reconstructed on the FFHQ test set (Fig. 9) .
On SMOLAT dataset, we show the qualitative and quantitative comparison in Fig. 10, whose quantitative comparison result is in Tables. 5, and 6 respectively. Our method has achieved the best accuracy and stability under different metrics. When testing the average normal and albedo error along lighting changes, as shown in Fig. 11, our method is more robust in both geometry and albedo reconstructions, and consequently achieves better relighting compared to SfSNet. This result shows that our predicted normal and albedo keep stable when the illumination changes, thus demonstrating that our network has good decomposition ability for albedo and illumination (Table 7).
We have compared our relit results with SfSNet, NVPR (Zhang et al., 2021), MTP (Shu et al., 2017), TotalRelighting (TR) (Pandey et al., 2021), SIPR1 (Wang et al., 2020) and SIPR2 (Sun et al., 2019) using SSIM, PSNR, and RMSE measurements. As we don’t have access to the code and training dataset of TR, we acquired the results of our testing dataset from the authors.
As shown in Fig. 10, NVPR presents high stability in color. However, it presents deteriorated performance in high contrast environment illumination such as specular highlights due to the lacking of portrait geometry prior. To compare with MTP, we choose a reference portrait image as its input. Recall that the MTP relighting is primarily based on image color, the results exhibit artifacts when the input portrait differs from the reference one in color and texture. Similar to NVPR, SIPR1 and SIPR2 do not contain a geometric prior, and therefore demonstrate noticeable artifacts of unnatural highlights and shadows when presented in high-contrast environment illumination. In contrast, by training with a large OLAT dataset, TR produces high-fidelity relighting results. Ours has achieved comparable performance with TR while using a much smaller training data size. Besides, due to the special post-processing, the global hue of the relit results from TR is significantly different from the other baseline methods. Thus, for a more fair visual comparison, we manually align the exposure curve of our results to TR, so that the results have a similar hue, as presented in the column Ours(Adjusted) of Figs. 9 and 10. We can tell that, after the adjustment, our results are of the similar, if not better, quality and preserve the same amount of details when compared with TR.
On the in-the-wild dataset, we further conduct a qualitative comparison in Fig. 9. Compared to other methods, PN-Relight produces more photo-realistic results. Moreover, we demonstrate the effect of editing the implicit material latent in Fig. 14 to change the material of the portrait. In this example, we gradually reduce the implicit material latent extracted from the original image to zero for relighting, resulting in the portrait material gradually approaching to diffuse during relighting.
For multi-view rendering, we further compare with SfSNet, StyleFlow (Abdal et al., 2021), and StyleNerf (Gu et al., 2022). Both StyleNerf and StyleFlow construct 3D face model based on GAN architecture, and exhibits inconsistency when the viewpoint changes. Since the two methods are not designed for relighting tasks, for fair comparison, we apply NVPR to add lighting effects to their reconstructed 3D faces. SfSNet can only generate surface normal and albedo from portrait image, and is not for 3D face generation. We therefore use the same way as described in Section. 4.2 to form 3D faces using their predicted normal and albedo maps. In Figs. 12 and 13, we show the qualitative results on SMOLAT and in-the-wild dataset, respectively. We observe that the GAN-based modeling approaches still present poor consistency on both shape and lighting under varying viewing angles, due to their inaccurate geometry estimation. For SfSNet, their reconstructed appearance is affected by lighting changes as we discussed above, and therefore fails to handle specular highlights as shown in Fig. 12. Our method, in contrast, demonstrates the most consistent rendering effects on both viewpoint and illumination changes.
It is important to note that there is no available multi-view OLAT datasets that can enable quantitative measurement of the identity and relighting consistency directly. For better evaluation, we generate reference face images under different viewing points and illuminations by using CG rendering pipeline Zhang et al. (2022). Specifically, take one view with a environment illuminations to construct the input image, and take two new views with a new environment illumination as the condition . As shown in Table 6, our results achieves the best consistency w.r.t. identity and relighting (Fig. 14).
To validate that our pseudo-albedo rendering pipeline can effectively improve the generalization ability of PN-Relighting on SMOLAT, we create a variation of our network: (1) w/o PAD that is trained only on SMOLAT; (2) w/o AP that is trained on SMOLAT and Pseudo-Albedo dataset without manually removing highlights on pseudo-albedo maps; (3) w/ PAD denoting the full pipeline. The qualitative comparison results are shown in Fig. 15, and the quantitative comparison results are shown in Table 8. We observe that our Pseudo-albedo generation pipeline enables PN-Relighting to preserve fine details in the relighting results, including the makeup, skin texture, eyebrows, pupils, hair color, etc. w/o PAD, in contrast, still exhibits a number of specular highlights in the predictions that should ideally be removed .
We have also verified the importance of training PN-Relighting network on the in-the-wild dataset by creating two variations: (1)w/o Wild represents the network using only the OLAT dataset; (2) w/ Wild represents our full pipeline. We show a qualitative comparison in Fig. 16. Compared to the w/o Wild, the network trained with FFHQ achieves superior performance in generalization, by faithfully reproducing the original portrait’s in both appearance realism and image sharpness.
We compare our relighting network with the material encoder to the relighting network with the U-net structure. w/o IML denotes the effect without the implicit material latent. w/ IML denotes our complete pipeline.
Regarding the ”w/o IML” variant, we adopt the same network structure as “w/ IML” (the full pipeline), but remove loss terms that related to material latent vector in the training phase, i.e., the (Eq. 8) and (Eq. 9).
Qualitative comparisons are shown in Fig. 17 and quantitative comparisons are illustrated in Table 9. The network with implicit material latent allows the network to produce more realistic specular highlights, more consistent with the real-life cases.
In order to verify the advantage of our in-the-wild training strategy, we conduct an ablation study on different training data partitions among OLAT and in-the-wild data. Specifically, we use 25%-OLAT, 50%-OLAT and 100%-OLAT to denote variants created by using 25%, 50% and 100% OLAT data for training, and use w/o Wild and w/ Wild to denote variants using in-the-wild training and not using in-the-wild training respectively. The quantitative comparison is presented in Table 10. We can tell that the model of 25% OLAT + w/Wild outperforms the model using 100% OLAT but doesn’t include the ”in-the-wild” training data. This further demonstrates the effectiveness of employing PA & FFHQ in the network training to boost the efficiency of OLAT data usage.
Even though the parametric model can provide normal information, we found out that these parametric normals cannot model the pixel-aligned geometry details in those facial regions near the eyes and wrinkles. Such normal artifacts further lead to inaccurate albedo modeling and the following relighting module. Thus, we chose to rely on our OLAT dataset to provide pixel-aligned facial normal estimation and only adopt the parametric model to enable more stable free-view relighting. To verify this, we create a variation w/ mesh-prior by using the mesh normal from the parametric model as prior. In detail, for w/ mesh-prior, we first project mesh-normal onto the input’s perspective of view, and attach the transformed normal to the original input of Normal Network . The rest training procedure is the same as “w/o mesh-prior” (our full pipeline). The quantitative comparison is shown in Table 11. We can tell that the “parametric normal” will degrade the normal estimation accuracy. For this reason, we don’t use the parametric normal as prior in our optimization pipeline.









