多尺度相似性迭代查找的可靠双目视差估计

Min Yan; Junzheng Wang; Jing Li

doi:10.11834/jig.210551

多尺度相似性迭代查找的可靠双目视差估计

Translated title of the contribution: Reliable binocular disparity estimation based on multi-scale similarity recursive search

Min Yan, Junzheng Wang, Jing Li^*

^*Corresponding author for this work

School of Automation

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

Abstract

Objective: Depth information is the key sensing information for the autonomous platform. As common depth sensors, the binocular camera can make up for the sparsity of LiDAR and depth camera not suitable for outdoor scenes. Comparing the performance of light detection and ranging (LiDAR) and depth cameras, it is very important to improve the accuracy and speed of the binocular disparity estimation algorithm. Disparity estimation algorithms based on deep learning have its own priority. Disparity estimation and optical flow estimation methods can learn from each other and faciliate new algorithms generation. Inspired by the efficient optical flow estimation algorithm recurrent all-pairs field transforms (RAFT), a unilateral and bilateral multi-scale similarity recursive search method is demonstrated to achieve high-precision binocular disparity estimation. A method of disparity estimation consistency detection for left and right images is proposed to extract reliable estimation regions to resolve inconsistent estimation accuracy and confidence in different regions. Method The pyramid pooling module (PPM), skip layer connection and residual structure are conducted in the feature network to extract the representation vector with strong representation capability. The inner product of representation vectors is used to demonstrate the similarity between pixels. The multi-scale similarity is obtained by average pooling. The updated or initial disparity, a certain range of similarity with a large field of view searched in multi-scale similarity according to the disparity (the 0th updating iteration is searched in one direction to the left and other updating iterations are searched in two directions) and context information are integrated together. The integrated information is transmitted to the convolutional recurrent neural network (ConvRNN) of the 0th updating process or the ConvRNN shared by other updating processes to obtain the updated amount of disparity, and the final disparity value is obtained via multiple updating iterations. The disparity of the right image is estimated by reversing the order and conducting left-right flipping of the inputted left and right images, and the confidence of disparity is determined by comparing the absolute value of disparity difference between the matched points of the left and right images and the given threshold. The output of each updating iteration is designated to reduce error gradually with increasing weight and the supervised method is used to train the network. In the training process, the learning rate is reduced by segments, and the root mean square prop(RMSProp) optimization algorithm is used for learning. To improve the inference efficiency, the resolution of the feature network is reduced by 8 times, so the learning up-sampling method is adopted to generate the disparity map with the same resolution of the original image. The disparity of the 8 × 8 adjacent region of a pixel in the original resolution image is calculated by weighting the disparity of the 3 × 3 adjacent region of the pixel in the reduced resolution image. The weights are obtained by convoluting the hidden state of the ConvRNN. To reduce the high cost of real-scene disparity data or depth data collection, the Sceneflow dataset generated by the 3D creation suite Blender is used to train and test the network, and the real-scene KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago) data is used to verify the generalization capability of the proposed method.First, on the Flyingthings3D dataset of the Sceneflow dataset, 21 818 pairs of training images of 540×960 pixels are randomly cropped to get images of 256 × 512 pixels. The cropped images are inputted to the network to train 440 000 iterations. The batch size is set to 4. The trained network is tested on 4 248 pairs of test images. To verify the rationality of adding the unilateral search process, we use ablation experiments on the Sceneflow dataset to compare the performance of networks with and without the unilateral search process. Next, the network trained on Sceneflow data is tested on KITTI training data to verify the generalization ability of the algorithm between simulation data and real-scene data directly. Then, the network trained on the Sceneflow dataset is fine-tuned on the KITTI2012 and KITTI2015 training set (5.5k iterations of training), respectively, and then cross-tested on KITTI2015 and KITTI2012 training sets for qualitative analysis. Finally, the network trained on Sceneflow data is fine-tuned on KITTI2012 and KITTI2015 training sets together (trained 11 000 iterations), and then tested on KITTI2012 and KITTI2015 test sets to verify the performance of the network further. The code is implemented via the TensorFlow framework. Result Before reliable region extraction step, the accuracy of this method is comparable to that of state-of-the-art methods on the Sceneflow dataset. The average error is only 0.84 pixels, and the error decreases with the increase of the updating iteration count, while the inference time becomes longer. However, the resiliency between speed and accuracy can be obtained by manipulate the number of updating iterations. After credible region extraction, the error on the Sceneflow dataset is further reduced to the historical best value of 0.21 pixels. On the KITTI benchmark, this method may rank first when only estimated regions are evaluated. The colorized disparity images and point cloud images identified completely that almost all of the occluded regions and a huge amount of areas with large errors are removed based on reliable region extraction. Conclusion: The proposed method has its superiority for binocular disparity estimation. The credible region extraction method can extract high-precision estimation regions efficiently, which improves the disparity reliability of the estimated regions greatly.

Translated title of the contribution	Reliable binocular disparity estimation based on multi-scale similarity recursive search
Original language	Chinese (Traditional)
Pages (from-to)	447-460
Number of pages	14
Journal	Journal of Image and Graphics
Volume	27
Issue number	2
DOIs	https://doi.org/10.11834/jig.210551
Publication status	Published - 16 Feb 2022

Access to Document

10.11834/jig.210551

Cite this

@article{8b6ade0e4082460cbe91efecfbbea477,

title = "多尺度相似性迭代查找的可靠双目视差估计",

abstract = "Objective: Depth information is the key sensing information for the autonomous platform. As common depth sensors, the binocular camera can make up for the sparsity of LiDAR and depth camera not suitable for outdoor scenes. Comparing the performance of light detection and ranging (LiDAR) and depth cameras, it is very important to improve the accuracy and speed of the binocular disparity estimation algorithm. Disparity estimation algorithms based on deep learning have its own priority. Disparity estimation and optical flow estimation methods can learn from each other and faciliate new algorithms generation. Inspired by the efficient optical flow estimation algorithm recurrent all-pairs field transforms (RAFT), a unilateral and bilateral multi-scale similarity recursive search method is demonstrated to achieve high-precision binocular disparity estimation. A method of disparity estimation consistency detection for left and right images is proposed to extract reliable estimation regions to resolve inconsistent estimation accuracy and confidence in different regions. Method The pyramid pooling module (PPM), skip layer connection and residual structure are conducted in the feature network to extract the representation vector with strong representation capability. The inner product of representation vectors is used to demonstrate the similarity between pixels. The multi-scale similarity is obtained by average pooling. The updated or initial disparity, a certain range of similarity with a large field of view searched in multi-scale similarity according to the disparity (the 0th updating iteration is searched in one direction to the left and other updating iterations are searched in two directions) and context information are integrated together. The integrated information is transmitted to the convolutional recurrent neural network (ConvRNN) of the 0th updating process or the ConvRNN shared by other updating processes to obtain the updated amount of disparity, and the final disparity value is obtained via multiple updating iterations. The disparity of the right image is estimated by reversing the order and conducting left-right flipping of the inputted left and right images, and the confidence of disparity is determined by comparing the absolute value of disparity difference between the matched points of the left and right images and the given threshold. The output of each updating iteration is designated to reduce error gradually with increasing weight and the supervised method is used to train the network. In the training process, the learning rate is reduced by segments, and the root mean square prop(RMSProp) optimization algorithm is used for learning. To improve the inference efficiency, the resolution of the feature network is reduced by 8 times, so the learning up-sampling method is adopted to generate the disparity map with the same resolution of the original image. The disparity of the 8 × 8 adjacent region of a pixel in the original resolution image is calculated by weighting the disparity of the 3 × 3 adjacent region of the pixel in the reduced resolution image. The weights are obtained by convoluting the hidden state of the ConvRNN. To reduce the high cost of real-scene disparity data or depth data collection, the Sceneflow dataset generated by the 3D creation suite Blender is used to train and test the network, and the real-scene KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago) data is used to verify the generalization capability of the proposed method.First, on the Flyingthings3D dataset of the Sceneflow dataset, 21 818 pairs of training images of 540×960 pixels are randomly cropped to get images of 256 × 512 pixels. The cropped images are inputted to the network to train 440 000 iterations. The batch size is set to 4. The trained network is tested on 4 248 pairs of test images. To verify the rationality of adding the unilateral search process, we use ablation experiments on the Sceneflow dataset to compare the performance of networks with and without the unilateral search process. Next, the network trained on Sceneflow data is tested on KITTI training data to verify the generalization ability of the algorithm between simulation data and real-scene data directly. Then, the network trained on the Sceneflow dataset is fine-tuned on the KITTI2012 and KITTI2015 training set (5.5k iterations of training), respectively, and then cross-tested on KITTI2015 and KITTI2012 training sets for qualitative analysis. Finally, the network trained on Sceneflow data is fine-tuned on KITTI2012 and KITTI2015 training sets together (trained 11 000 iterations), and then tested on KITTI2012 and KITTI2015 test sets to verify the performance of the network further. The code is implemented via the TensorFlow framework. Result Before reliable region extraction step, the accuracy of this method is comparable to that of state-of-the-art methods on the Sceneflow dataset. The average error is only 0.84 pixels, and the error decreases with the increase of the updating iteration count, while the inference time becomes longer. However, the resiliency between speed and accuracy can be obtained by manipulate the number of updating iterations. After credible region extraction, the error on the Sceneflow dataset is further reduced to the historical best value of 0.21 pixels. On the KITTI benchmark, this method may rank first when only estimated regions are evaluated. The colorized disparity images and point cloud images identified completely that almost all of the occluded regions and a huge amount of areas with large errors are removed based on reliable region extraction. Conclusion: The proposed method has its superiority for binocular disparity estimation. The credible region extraction method can extract high-precision estimation regions efficiently, which improves the disparity reliability of the estimated regions greatly.",

keywords = "Binocular disparity estimation, Convolutional recurrent neural network (CRNN), Deep learning, Occlusion, Supervised learning",

author = "Min Yan and Junzheng Wang and Jing Li",

year = "2022",

month = feb,

day = "16",

doi = "10.11834/jig.210551",

language = "繁体中文",

volume = "27",

pages = "447--460",

journal = "Journal of Image and Graphics",

issn = "1006-8961",

publisher = "Editorial and Publishing Board of JIG",

number = "2",

}

TY - JOUR

T1 - 多尺度相似性迭代查找的可靠双目视差估计

AU - Yan, Min

AU - Wang, Junzheng

AU - Li, Jing

PY - 2022/2/16

Y1 - 2022/2/16

N2 - Objective: Depth information is the key sensing information for the autonomous platform. As common depth sensors, the binocular camera can make up for the sparsity of LiDAR and depth camera not suitable for outdoor scenes. Comparing the performance of light detection and ranging (LiDAR) and depth cameras, it is very important to improve the accuracy and speed of the binocular disparity estimation algorithm. Disparity estimation algorithms based on deep learning have its own priority. Disparity estimation and optical flow estimation methods can learn from each other and faciliate new algorithms generation. Inspired by the efficient optical flow estimation algorithm recurrent all-pairs field transforms (RAFT), a unilateral and bilateral multi-scale similarity recursive search method is demonstrated to achieve high-precision binocular disparity estimation. A method of disparity estimation consistency detection for left and right images is proposed to extract reliable estimation regions to resolve inconsistent estimation accuracy and confidence in different regions. Method The pyramid pooling module (PPM), skip layer connection and residual structure are conducted in the feature network to extract the representation vector with strong representation capability. The inner product of representation vectors is used to demonstrate the similarity between pixels. The multi-scale similarity is obtained by average pooling. The updated or initial disparity, a certain range of similarity with a large field of view searched in multi-scale similarity according to the disparity (the 0th updating iteration is searched in one direction to the left and other updating iterations are searched in two directions) and context information are integrated together. The integrated information is transmitted to the convolutional recurrent neural network (ConvRNN) of the 0th updating process or the ConvRNN shared by other updating processes to obtain the updated amount of disparity, and the final disparity value is obtained via multiple updating iterations. The disparity of the right image is estimated by reversing the order and conducting left-right flipping of the inputted left and right images, and the confidence of disparity is determined by comparing the absolute value of disparity difference between the matched points of the left and right images and the given threshold. The output of each updating iteration is designated to reduce error gradually with increasing weight and the supervised method is used to train the network. In the training process, the learning rate is reduced by segments, and the root mean square prop(RMSProp) optimization algorithm is used for learning. To improve the inference efficiency, the resolution of the feature network is reduced by 8 times, so the learning up-sampling method is adopted to generate the disparity map with the same resolution of the original image. The disparity of the 8 × 8 adjacent region of a pixel in the original resolution image is calculated by weighting the disparity of the 3 × 3 adjacent region of the pixel in the reduced resolution image. The weights are obtained by convoluting the hidden state of the ConvRNN. To reduce the high cost of real-scene disparity data or depth data collection, the Sceneflow dataset generated by the 3D creation suite Blender is used to train and test the network, and the real-scene KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago) data is used to verify the generalization capability of the proposed method.First, on the Flyingthings3D dataset of the Sceneflow dataset, 21 818 pairs of training images of 540×960 pixels are randomly cropped to get images of 256 × 512 pixels. The cropped images are inputted to the network to train 440 000 iterations. The batch size is set to 4. The trained network is tested on 4 248 pairs of test images. To verify the rationality of adding the unilateral search process, we use ablation experiments on the Sceneflow dataset to compare the performance of networks with and without the unilateral search process. Next, the network trained on Sceneflow data is tested on KITTI training data to verify the generalization ability of the algorithm between simulation data and real-scene data directly. Then, the network trained on the Sceneflow dataset is fine-tuned on the KITTI2012 and KITTI2015 training set (5.5k iterations of training), respectively, and then cross-tested on KITTI2015 and KITTI2012 training sets for qualitative analysis. Finally, the network trained on Sceneflow data is fine-tuned on KITTI2012 and KITTI2015 training sets together (trained 11 000 iterations), and then tested on KITTI2012 and KITTI2015 test sets to verify the performance of the network further. The code is implemented via the TensorFlow framework. Result Before reliable region extraction step, the accuracy of this method is comparable to that of state-of-the-art methods on the Sceneflow dataset. The average error is only 0.84 pixels, and the error decreases with the increase of the updating iteration count, while the inference time becomes longer. However, the resiliency between speed and accuracy can be obtained by manipulate the number of updating iterations. After credible region extraction, the error on the Sceneflow dataset is further reduced to the historical best value of 0.21 pixels. On the KITTI benchmark, this method may rank first when only estimated regions are evaluated. The colorized disparity images and point cloud images identified completely that almost all of the occluded regions and a huge amount of areas with large errors are removed based on reliable region extraction. Conclusion: The proposed method has its superiority for binocular disparity estimation. The credible region extraction method can extract high-precision estimation regions efficiently, which improves the disparity reliability of the estimated regions greatly.

AB - Objective: Depth information is the key sensing information for the autonomous platform. As common depth sensors, the binocular camera can make up for the sparsity of LiDAR and depth camera not suitable for outdoor scenes. Comparing the performance of light detection and ranging (LiDAR) and depth cameras, it is very important to improve the accuracy and speed of the binocular disparity estimation algorithm. Disparity estimation algorithms based on deep learning have its own priority. Disparity estimation and optical flow estimation methods can learn from each other and faciliate new algorithms generation. Inspired by the efficient optical flow estimation algorithm recurrent all-pairs field transforms (RAFT), a unilateral and bilateral multi-scale similarity recursive search method is demonstrated to achieve high-precision binocular disparity estimation. A method of disparity estimation consistency detection for left and right images is proposed to extract reliable estimation regions to resolve inconsistent estimation accuracy and confidence in different regions. Method The pyramid pooling module (PPM), skip layer connection and residual structure are conducted in the feature network to extract the representation vector with strong representation capability. The inner product of representation vectors is used to demonstrate the similarity between pixels. The multi-scale similarity is obtained by average pooling. The updated or initial disparity, a certain range of similarity with a large field of view searched in multi-scale similarity according to the disparity (the 0th updating iteration is searched in one direction to the left and other updating iterations are searched in two directions) and context information are integrated together. The integrated information is transmitted to the convolutional recurrent neural network (ConvRNN) of the 0th updating process or the ConvRNN shared by other updating processes to obtain the updated amount of disparity, and the final disparity value is obtained via multiple updating iterations. The disparity of the right image is estimated by reversing the order and conducting left-right flipping of the inputted left and right images, and the confidence of disparity is determined by comparing the absolute value of disparity difference between the matched points of the left and right images and the given threshold. The output of each updating iteration is designated to reduce error gradually with increasing weight and the supervised method is used to train the network. In the training process, the learning rate is reduced by segments, and the root mean square prop(RMSProp) optimization algorithm is used for learning. To improve the inference efficiency, the resolution of the feature network is reduced by 8 times, so the learning up-sampling method is adopted to generate the disparity map with the same resolution of the original image. The disparity of the 8 × 8 adjacent region of a pixel in the original resolution image is calculated by weighting the disparity of the 3 × 3 adjacent region of the pixel in the reduced resolution image. The weights are obtained by convoluting the hidden state of the ConvRNN. To reduce the high cost of real-scene disparity data or depth data collection, the Sceneflow dataset generated by the 3D creation suite Blender is used to train and test the network, and the real-scene KITTI(Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago) data is used to verify the generalization capability of the proposed method.First, on the Flyingthings3D dataset of the Sceneflow dataset, 21 818 pairs of training images of 540×960 pixels are randomly cropped to get images of 256 × 512 pixels. The cropped images are inputted to the network to train 440 000 iterations. The batch size is set to 4. The trained network is tested on 4 248 pairs of test images. To verify the rationality of adding the unilateral search process, we use ablation experiments on the Sceneflow dataset to compare the performance of networks with and without the unilateral search process. Next, the network trained on Sceneflow data is tested on KITTI training data to verify the generalization ability of the algorithm between simulation data and real-scene data directly. Then, the network trained on the Sceneflow dataset is fine-tuned on the KITTI2012 and KITTI2015 training set (5.5k iterations of training), respectively, and then cross-tested on KITTI2015 and KITTI2012 training sets for qualitative analysis. Finally, the network trained on Sceneflow data is fine-tuned on KITTI2012 and KITTI2015 training sets together (trained 11 000 iterations), and then tested on KITTI2012 and KITTI2015 test sets to verify the performance of the network further. The code is implemented via the TensorFlow framework. Result Before reliable region extraction step, the accuracy of this method is comparable to that of state-of-the-art methods on the Sceneflow dataset. The average error is only 0.84 pixels, and the error decreases with the increase of the updating iteration count, while the inference time becomes longer. However, the resiliency between speed and accuracy can be obtained by manipulate the number of updating iterations. After credible region extraction, the error on the Sceneflow dataset is further reduced to the historical best value of 0.21 pixels. On the KITTI benchmark, this method may rank first when only estimated regions are evaluated. The colorized disparity images and point cloud images identified completely that almost all of the occluded regions and a huge amount of areas with large errors are removed based on reliable region extraction. Conclusion: The proposed method has its superiority for binocular disparity estimation. The credible region extraction method can extract high-precision estimation regions efficiently, which improves the disparity reliability of the estimated regions greatly.

KW - Binocular disparity estimation

KW - Convolutional recurrent neural network (CRNN)

KW - Deep learning

KW - Occlusion

KW - Supervised learning

UR - http://www.scopus.com/inward/record.url?scp=85125356628&partnerID=8YFLogxK

U2 - 10.11834/jig.210551

DO - 10.11834/jig.210551

M3 - 文章

AN - SCOPUS:85125356628

SN - 1006-8961

VL - 27

SP - 447

EP - 460

JO - Journal of Image and Graphics

JF - Journal of Image and Graphics

IS - 2

ER -

多尺度相似性迭代查找的可靠双目视差估计

Abstract

Access to Document

Other files and links

Fingerprint

Cite this