At the risk of stating the obvious, stereovision in practice has a few interesti...

At the risk of stating the obvious, stereovision in practice has a few interesting challenges. Yes, the main formula is deceptively simple: d = b*f / D (d - depth, D - disparity, b - baseline, f - focal length), but in practice, all 3 terms on the right require some thinking. The most difficult is D - disparity, it usually comes from some sort of feature matching algorithm, whether traditional or ML-based. Such algorithms usually require some texture surfaces to work properly, so if the surface does not have "enough" texture (example would be a gray truck in front of the cameras), then the feature matching will work poorly. In CV research there are other simplifying assumptions being made so that epipolar constraints make the task simpler. Examples of these assumptions are coplanar image planes, epipolar lines being parallel to a line connecting focal points and so on. In practice, these assumptions are usually wrong, so you need, for example, to rectify the images which is an interesting task by itself. Additionally, baseline b can drift due to changes in temperature and mechanical vibrations. So is the focal length f, so automatic camera calibration is required (not trivial).

Don't forget some interesting scenarios like dust particles or mud on one of the cameras (or windshield if cameras are located behind the windshield) or rain beading and distorting the image thus breaking the feature matcher and resulting disparity estimates.

Next, to "see" further, a stereo rig needs to have a decent baseline. For example, in a classic KITTI dataset, the baseline is approximately 0.54m which is much larger than, for example, human eyes (0.065m). Such baseline, 54cm, together with focal length, which, if I remember correctly, is about 720px in case of KITTI vehicle cameras, would give about 388m in the ideal case of being able to detect 1 pixel disparity. But detecting 1px of D is very difficult in practice - don't forget you will be running your algo on a car with limited compute resources. Say, you can have around 5px of D, that means max depth of around 77m - comparable to older Velodyne LiDARs.

Some of the issues I mentioned are not specific to stereovision (e.g. you need to calibrate monocular cameras as well and so on), just wanted to point out that stereovision does not magically enable depth perception. The solution would likely be a combination of monocular and stereo cameras, combined with SfM (Structure from Motion) and depth-from-stereo algorithms.