I have posed myself the question in the past, in which a number of photos is each subdivided into a grid of rectangles, how a signature can be derived from each rectangle, which leads to some sort of identifier, so that between photos, these identifiers will either match or not, even though there are inherent mismatches in the photos, to decide whether a rectangle in one photo corresponds to the same subject-feature, as a different rectangle in the other photo.
Eventually, one would want this information in order to compute a 3D scene-description – a 3D Mesh, with a level of detail equal to how finely the photos were subdivided into rectangles.
Since exact pixels will not be equal, I have thought of somewhat improbable schemes in the past, of just how to compute such a signature. These schemes once went so far, as first to compute a 2D Fourier Transform of each rectangle @ 1 coefficient /octave, to quantize those into 1s and 0s, to ignore the F=(0,0) bit,
and then to hash the results.
But just recently I have come to the conclusion that a much simpler method should work.
At full resolution, the photos can be analyzed as though they formed a single image, in the ways already established for computing an 8-bit color palette, i.e. a 256-color palette, like the palettes once used in GIF Images, and for other images that only had 8-bit colors.
The index-number of this palette can be used as an identifier.
After the palette has been established, each rectangle of each photo can be assigned an index number, depending on which color of the palette it best matches. It would be important that this assignment not take place, as though we were just averaging the colors of each rectangle. Instead, the strongest basis of this assignment would need to be, how many pixels in the rectangle match one color in the palette. (*)
After that, each rectangle will be associated with this identifier, and for each one the most important result will become, at what distances from its camera-position the greatest number of other cameras confirm its 3D position, according to matching identifiers.
*) The way I would suggest scoring, how well a set of pixels matches one color, would be as the sum of reciprocals, of absolute (R, G, B) differences converted into floating-point numbers, with a softening factor of maybe (2) added to the differences first. That way, 2 exact matches will be as good as 3 near misses…
Ultimately, I would want to take advantage of the fact that the diameter of a circle, as seen from any point along the circle, will always span an arc of 90⁰. Thus, if the question is next asked, what the smallest unit of distance from any one camera is, because of which matching or mismatching can be distinguished, the answer is that a depth-difference from one camera, was a difference in lateral displacement according to at least one other camera.
This should approximately be the reciprocal of the number of squares that each photo is subdivided in to, across and top-to-bottom, if the maximum depth is again, the diameter of a notional circle.
So, the camera arc should be +/- 45⁰.
To be conservative, I would suggest turning the space in front of all the cameras into a volumetric cube, of (n * n * n) elements if each photo is subdivided into (n * n) squares. And I would suggest sampling the maximum depth at intervals of (0.5 / n) to be safe, resulting in (2 * n) probes, the scores of which can temporarily be stored in a scalar.
(Edit 01/07/2017 : And then, I would suggest ‘painting’ one voxel of this cube, the first time the score in a scalar, resulting from one camera-square-depth combination, equals the maximum score for the same scalar, with all the positions computed in 3D, in floating-point numbers, correctly according to trigonometry. )
Thus, a cube should emerge, the voxels of which are either untouched as having a density of (0), or of having been touched one or more times, resulting in a density of (1).
And then I would suggest using an Iso-Surface.
When a point on the Iso-Surface is being rendered for any purpose, it needs to be facing and unobstructed to at least two cameras, in order to have non-zero Alpha – i.e. really to be visible. In practice, the complete solution therefore also requires, that each virtual camera-position render a height-map of the scene, which can later be shadow-tested against.
(Edit 01/07/2017 : )
I think I can see where the main point will be, where this suggested system of mine fails: In the inaccuracy, of real-world camera-angles.
It is already obvious, that when private people use their cell-phones to take pictures, they will not be holding the phone parallel to the ground, even if this happens to be an arbitrary assignment, of what the camera-position for one shot was supposed to be. But, what everybody assumes is that with cell-phones, the internal gyroscope and accelerometer will report the real camera-angle accurately, where the user did not. Real-world positioning is probably beyond the abilities of the sensors in the phone, to pinpoint.
When I circled the subjects myself which I created captures of, I took the liberty of getting closer to them or further away from one shot to the next, and of angling the phone downward slightly, to correspond to most of the horizontal shots.
There would be a subconscious tendency to think, that if the user centered the subject in each horizontal shot, while in real 3D the angle was slightly downward, all this might do is raise the subject to be as if centered. But this would be in error.
What this interpretation would imply, is that more-distant points of the subject, which are mapped to higher regions in all the shots, will be higher than the midpoint of the subject, as seen from every angle over a 360⁰ viewing-circle. So, rather than obtaining, for example, an inclined plane, we would obtain a roughly conical apparition, which is wider near the bottom and still fills each perspective to the top of its photo.
When the public app “123D catch” decided, that it should make one of its photos the main photo, according to which the 3D mesh is most-correct, they did not only decide against a volumetric analysis, but also decided that if a flat, rectangular object is being centered according to that one perspective, it will be an inclined plane, according to that one angle.
Their resulting model can still be rotated any-which-way, so that the viewing angle I have shown below,
was not the perspective of this main view.
The main camera-position the app chose, as I recall, would have been 90⁰ to the left of the rendered view of the above model. But because one 3D model resulted, which is textured, we are able to reorient that, before we commit it to the database, and then also to generate a viewing angle ‘from where the ceiling should be’.
But according to the synthetic perspective I have shown, the rectangle was most elevated on one side, and lowest on the other. I simply rotated the model in synthetic 3D, to get rid of that. The tabletop above should be rolled to one side, according to the raw capture. Because I had the tendency to point all my shots downward slightly, the ~rectangle~ generated and correlated in synthetic 3D coordinates, should also be most-elevated at the points most-distant to every camera position.
Clearly then, small errors in camera-angles can lead to huge errors and finally contradictions, in synthesized 3D coordinates. And the makers of 123D Catch have gained stability in comparison, by making exactly one of the real photos the main photo (according to which the rectangle was most-elevated, ‘on the right side of the generated view’ shown above).
( … )
Also, the 3D model which I have downloaded onto one of my PCs does not just possess 1 U,V-mapped texture-image, but 3, separately-U,V-mapped texture images, and this folded, extended part of the mesh derives its colors, not from the first texture-image.
For readers who might not know: Your smart-phone can compute the Tilt (Up / Down) with which a photo is being shot, using its very-accurate accelerometers, since this correlates with the angle of gravity. But, the angle of Pan (Left / Right) with which a photo is being shot, may require the gyroscopes. And gyroscopes ultimately suffer from a phenomenon known as “Gyro-Drift”.
When we use the app I am comparing my ideas to, a widget already shows the user in real-time, which camera-perspective he is currently considering, provided that the app does not detect any Roll, and then the corresponding element of the widget lights up. There is a bottom circle of horizontal perspectives, and a top circle of perspectives that are angled down by 45⁰. The way this happens allows the user to see the gyro-drift, because once he knows he has walked a full circle around the subject, the widget no longer lights up, at the correct, original position… The app thinks that the user Panned either more or less than exactly 360⁰.
I think that regardless of which widget-element is being selected by the app for one shot, each photo should better store the angle of Tilt with it, as measured, so that the real angle of tilt can later be used to derive the 3D geometry.
And I think that the allowable amount of error in Tilt and Roll, corresponds to the width of one rectangle, which the photos are hypothetically subdivided in to, according to my own thinking. I.e., features of the subject which were photographed as belonging to one specific row of rectangles on one side of the photo, should not drift into another row on the other side, and the same rule should work for the depth-range, and therefore for the angle of Tilt.
(Edit 01/09/2017 : ) The thought occurs to me, that a smart-phone has three additional sources of information:
- The focal length which the camera has adjusted its auto-focus to.
- The magnetic field of the Earth.
I am sure that all this information could be integrated, to form a more-accurate positioning / orientation set.
A single rectangle-tag from one camera can be compared with vertical groups of 3 rectangles from the other cameras, and the scores added, in order to determine the best horizontal parallax.
A single rectangle can be compared with horizontal groups of rectangles, in order to determine vertical parallax.
If there is more than one distinct set of GPS coordinates, a model with sub-meshes can be derived. In that case however, the real size of the mesh produced in 3D by one camera-group needs to be known, in relation to the distances between accurate positions. Otherwise, a set of photographs would only be defined by their angles of Pan, and by a notional circle which forms from that, around one subject.