A Method for Obtaining Signatures for Photogrammetry

I have posed myself the question in the past, in which a number of photos is each subdivided into a grid of rectangles, how a signature can be derived from each rectangle, which leads to some sort of identifier, so that between photos, these identifiers will either match or not, even though there are inherent mismatches in the photos, to decide whether a rectangle in one photo corresponds to the same subject-feature, as a different rectangle in the other photo.

Eventually, one would want this information in order to compute a 3D scene-description – a 3D Mesh, with a level of detail equal to how finely the photos were subdivided into rectangles.

Since exact pixels will not be equal, I have thought of somewhat improbable schemes in the past, of just how to compute such a signature. These schemes once went so far, as first to compute a 2D Fourier Transform of each rectangle @ 1 coefficient /octave, to quantize those into 1s and 0s, to ignore the F=(0,0) bit, and then to hash the results.

But just recently I have come to the conclusion that a much simpler method should work.

At full resolution, the photos can be analyzed as though they formed a single image, in the ways already established for computing an 8-bit color palette, i.e. a 256-color palette, like the palettes once used in GIF Images, and for other images that only had 8-bit colors.

The index-number of this palette can be used as an identifier.

After the palette has been established, each rectangle of each photo can be assigned an index number, depending on which color of the palette it best matches. It would be important that this assignment not take place, as though we were just averaging the colors of each rectangle. Instead, the strongest basis of this assignment would need to be, how many pixels in the rectangle match one color in the palette. (*)

After that, each rectangle will be associated with this identifier, and for each one the most important result will become, at what distances from its camera-position the greatest number of other cameras confirm its 3D position, according to matching identifiers.

Dirk

Photogrammetry Acknowledgement

In this earlier posting, I described a form of “photogrammetry” in which an arbitrary, coarse base-geometry is assumed as a starting point, and from which micropolygons are spawned, in order to approximate a more-detailed final geometry.

I must acknowledge that within this field, a domain also exists, which is not like that, and in which the computer tries to guess at a random, arbitrary geometry. Of course, this is a much more difficult form of the subject, and I do not know much about how it is intended to work.

I do know that aside from the fact that swatches of pixels need to be matched from one 2D photo to the next, one challenge which impedes this, is the fact that parts of the (yet-unknown) mesh will occlude each other to some camera-positions but not others, in ways that computers are poor at predicting. To deal with that requires such complex fields as “Constraint Satisfaction Programming” – aka ‘Logic Programming’, etc..

(Edit 01/05/2017 : Also, if we can assume that a 2D grid of pixel-swatches is being tagged for exact matching, and that only horizontal parallax is to be measured, the problem of entire rows of rectangles that all have the same signature can be cumbersome to code for, where only the end-points change position from one photo to the next… And then their signature can end, to be replaced by another, after which, on the same row, the first set of signatures can simply resume.

Further, If we knew that this approach was being used, Then we could safely infer that the number of mesh-units we derive, will also correspond to the number of rectangles, which each photo has been subdivided in to, not the number of pixels. )

If that was to succeed, I suppose it could again form a starting-point, for the micropolygon-based approach I was describing.

I do know of at least one consumer-grade product, which uses micropolygons.

The wording ‘Light Values’ can play tricks on people.

What I wrote before, was that between (n) real, 2D photos, 1 light-value can be sampled.

Some people might infer that I meant, always to use the brightness value. But this would actually be wrong. I am assuming that color footage is being used.

And if I wanted to compare pixel-colors, to determine best-fit geometry, I would most want to go by a single hue-value.

If the color being mapped averages to ‘yellow’ – which facial colors do – then hue would be best-defined as ‘the difference between the Red and Green channels’.

But the way this works out negatively, is in the fact that actual photographic film which was used around 1977, differentiated most poorly between between Red and Green, as did any chroma / video signal. And Peter Cushing was being filmed in 1977, so that our reconstruction of him might appear in today’s movies.

So then an alternative might be, ‘Normalize all the pixels to have the same luminance, and then pick whichever primary channel that the source was best-able to resolve into minute details, on a physical level.’

Maybe 1977 photographic projector-emulsions differentiated the Red primary channel best?

Further, given that there are 3 primary colors in most forms of graphics digitization, and that I would remove the overall luminance, it would follow that maybe 2 actual remaining color channels could be used, the variance of each computed separately, and the variances added?

In general, it is Mathematically safer to add Variances, than it would be to add Deviations, where Variance corresponds to Deviation squared, and where Variance therefore also corresponds to Energy, if Deviation corresponded to Potential. It is more generally agreed that Energy and its homologues are conserved quantities.

Dirk

Modern Photogrammetry

Modern Photogrammetry makes use of a Geometry Shader – i.e.  Shader which starts with a coarse grid in 3D, and which interpolates a fine grid of microplygons, again in 3D.

The principle goes, that a first-order, approximate 3D model provides per-vertex “normal vector” – i.e. vectors that always stand out at right angles from the 3D model’s surface in an exact way, in 3D – and that a Geometry Shader actually renders many interpolated points, to several virtual camera positions. And these virtual camera positions correspond in 3D, to the assumed positions from which real cameras photographed the subject.

The Geometry Shader displaces each of these points, but only along their interpolated normal vector, derived from the coarse grid, until the position which those points render to, take light-values from the real photos, that correlate to the closest extent. I.e. the premise is that at some exact position along the normal vector, a point generated by a Geometry Shader will have positions on all the real camera-views, at which all the real, 2D cameras photographed the same light-value. Finding that point is a 1-dimensional process, because it only takes place along the normal vector, and can thus be achieved with successive approximation.

(Edit 01/10/2017 : To make this easier to visualize. If the original geometry was just a rectangle, then all the normal vectors would be parallel. Then, if we subdivided this rectangle finely enough, and projected each micropolygon some variable distance along that vector, There would be no reason to say that there exists some point in the volume in front of the rectangle, which would not eventually be crossed. At a point corresponding to a 3D surface, all the cameras viewing the volume should in principle have observed the same light-value.

Now, if the normal-vectors are not parallel, then these paths will be more dense in some parts of the volume, and less dense in others. But then the assumption becomes, that their density should never actually reach zero, so that finer subdivision of the original geometry can also counteract this to some extent.

But there can exist many 3D surfaces, which would occupy more than one point along the projected path of one micropolygon – such as a simple sphere in front of an initial rectangle. Many paths would enter the sphere at one distance, and exit it again at another. There could exist a whole, complex scene in front of the rectangle. In those cases, starting with a coarse mesh which approximates the real geometry in 3D, is more of a help than a hindrance, because then, optimally, again there is only one distance of projection of each micropolygon, that will correspond to the exact geometry. )

Now one observation which some people might make, is that the initial, coarse grid might be inaccurate to begin with. But surprisingly, this type of error cancels out. This is because each microploygon-point will have been displaced from the coarse grid enough, that the coarse grid will finally no longer be recognizable from the positions of micropolygons. And the way the micropolygons are displaced is also such, that they never cross paths – since their paths as such are interpolated normal vectors – and so no Mathematical contradictions can result.

To whatever extent geometric occlusion has been explained by the initial, coarse model.

Granted, If the initial model was partially concave, then projecting all the points along their normal vector will eventually cause their paths to cross. But then this also defines the extent, at which the system no longer works.

But, According to what I just wrote, even the lighting needs to be consistent between one set of 2D photos, so that any match between their light-values actually has the same meaning. And really, it’s preferable to have about 6 such photos…

Yet, there are some people who would argue, that superior Statistical Methods could still find the optimal correlations in 1-dimensional light-values, between a higher number of actual photos…

One main limitation to providing photogrammetry in practice, is the fact that the person doing it may have the strongest graphics card available, but that he eventually needs to export his data to users who do not. So in one way it works for public consumption, the actual photogrammetry will get done on a remote server – perhaps a GPU farm, but then simplified data can actually get downloaded onto our tablets or phones, which the mere GPU of that tablet or phone is powerful enough to render.

But the GPU of the tablet or phone is itself not powerful enough, to do the actual successive approximation of the micropolygon-points.

I suppose, that Hollywood might not have that latter limitation. As far as they are concerned, all their CGI specialists could all have the most powerful GPUs, all the time…

Dirk

P.S. There exists a numerical approach, which simplifies computing Statistical Variance in such a way, that Variance can effectively be computed between ‘an infinite number of sample-points’, at a computational cost which is ‘only proportional to the number of sample-points’. And the equation is not so complicated.
 s = Mean(X2) - ( Mean(X) )2 
(Next)