Transcripció de l'episodi 009
Generat automàticament amb OpenAI Whisper
Data: 2026-04-20 11:41:09
==================================================

 Si t'adono una fotografia printada d'una gran, molt ornata fontana en un parc, el teu cervell instantàniament fa aquesta matèria molt complexa sense que t'ho notis. Oh, sí, totalment, sense esforç. Saps instantàniament com d'allà és la bassa d'aigua o que els arbres són potser 50 feet enrere de tota la estructura. Exacte. T'adones que és exactament on el fotògraf estava per captar aquest angle. Però, i aquí hi ha la part wilda, si t'adones que és exactament el mateix fotògraf digital per a un computador, és completament, dimensionadament, blindat. Sí, ho és. No veu una paisatge o un forat o res. A un peix de software, veu aquesta grada flat, arbitral, de guirades colorades. Sí, guirades colorades. Curar aquesta blindness dimensional és, bàsicament, la nostra missió d'avui. Així que, benvinguts a la Deep Dive, tothom. Gràcies per tenir-me. És un tòpic molt bonic per descomptar. Sí, estem tractant les mecaniques de la reconstrucció 3D avui. Però estem posant un constrat molt massiu en nosaltres, es concentra en tot el que és important. Es concentra en un desafiament molt específic, que és capturar espais altres, totalment, sense relàixer-se a drones. Això fa que sigui molt més difícil, però molt més accessible. És clar. Anem a mirar com pots prendre un smartphone standard o només una càmera d'slr consumida i caminar cap a un monument o a un parc i deixar que la software d'opensource construís una realitat fotorealística, una vida 3D fàcil d'exploració d'outros dels teus files 2D. Sí, i remoure el drone de l'equació realment canvia la situació. És la base matemàtica de tot el procés. Quan uses vehicles aéreos, beneficies de, bé, bàsicament, un set de cheats estructurals. Cheats, oi? Perquè dels passos de vol. Exactament. Els drones volen en aquests predíctims grids de pàtres. Tenen aquests sistemes super expensius d'impedits GPS de l'Arquikei que tracten la seva posició fins al mil·límetre literal. Així que les imatges són altament ordenades, el computador ja té un gran començament. És clar, però la captura de base de terra és just caòtic. És totalment caòtic. Tu camines amb els teus dos peus. Potser t'agafes la càmera a nivell de chest i, no ho sé, potser l'aixeques a sobre del teu cap per evitar que el somni de la llum. T'ho tiltes, te'n passes per una pàtala a la platja. És caòtic. Estem parlant de caòtic que intervé amb la geomètria de captura de la humanitat. Que és exactament el que és el nostre forç de gràcia per a aquest deep dive. Estem mirant una bretada tècnica de la presentació d'un everypoint i realment es foca en l'enlaixament d'aquesta mena de caixa negra de software de reconstrucció 3D. Sí. Sí, específicament mirant un tool central anomenat Callmap, que és just aquest caix de caixa de raça open source. I per encarar tots aquests conceptes de tecnologia greus, la raça utilitza un exemple primari molt relatable que vull continuar venint avui. Sí, l'aigua de capital d'Oregon. Un dels recerchers va prendre una càmera, va caminar a l'aigua de capital molt texturada enfront d'una ciutat de capital en un dia brillant i sol i va recordar una vídeo mentre va caminar un parell de llocs al voltant de la estructura. I després van extreure frames d'esquena individuals d'aquesta caixa de raça per enviar-hi la software. Sí, només moviment humà, un dia sol i una càmera regular. No hi ha lasers, ni radar, ni drones. Però abans que veiem com la software va actualitzar aquests dades de vídeo, hem d'entendre les origens de Callmap, oi? Sí, definitivament. La Callmap va ser autora d'un recercher anomenat Johannes Schoenberger. Aquest era quan feia el seu mestre a la ciutat de Chapel Hill, gairebé al 2010-2015. I el seu viatge acadèmic és superinteressant perquè perfectament mira aquesta transició que just vam parlar, moviment de la càmera de l'aeroport estructural a la càmera de l'aeroport no estructural. Exacte. Abans que desenvolupés Callmap, treballava en una pipeta més aviat anomenada M-A-V-Map, que va ser per a Mapper de vehicle aéreo mòbil. Ok, doncs aquest treball era totalment dependent en aquests codes de la càmera de l'aeroport que vam mencionar. La software esperava aquesta bonica, predictible, perspectiva de la part d'enllaç d'un vehicle aéreo. Sí, però Schoenberger va reconèixer les limitacions de l'aproach. El que va ser, exacte. Oh, sí, perquè Callmap significa Mapper de col·lecció. Exacte, Mapper de col·lecció. Una col·lecció completament ràpida. És l'equivalent de prendre un puzzle de jigsaw però les peces no eren de una sola imatge. Imagineu que cada peça de jigsaw és una fotografia presa per un turista diferent en un dia diferent d'un angle i distància És un bon motiu de pensar-ho. I tu just t'embales aquesta caixa de peces de malament en una taula i demanes al computador no només assemblar la peça fleta però pujar-la a l'exterior per construir la forma física tridimensional de l'objecte. I la software ha de deduir dinàmicament la escala, la llum, la llengua focal i la posició física del fotògraf per cada peça de la taula. Que sembla impossible. Ho fa. Però la source afirma aquesta fenomenal anècdota de la temps de Schoenberger a la plaça de la Chapelle de la UNC que realment proveix la robustesa d'aquesta apropació. Oh, sí, la història dels estudiants de PhD. Sí. Aleshores, mentre Schoenberger estava activament escrivint les primeres iterations de Colmap, un estudiant de PhD de la mateixa universitat estava enfrontat a una gran wall computacional amb el seu projecte final. I aquest estudiant tenia un dataset de, què, 100 milions d'images? 100 milions d'images. Sí. Va poder fer l'analisi visual inicial, però no podia reconstruir la geometria final 3D d'una col·lecció tan gran i no estructurada. Un PC de segon standard no podia gestionar la matèria d'un milió d'angles amb les eines que tenien abans. Sí. Llavors, Schoenberger va oferir el seu primer construcció de Colmap. Va fer el dataset i la matèria va tenir contra el caos de la col·lecció. Va processar la geometria i aquest estudiant va poder publicar el paper de PhD. És increïble. Vull dir, quan construïs un enginy capaç de treure 100 milions d'images caòtiques, unes 200 frames d'un fàntan d'Oregon sembla molt trivial. Seriosament. Però el problema principal que la software està resolvint en tots els casos és descobrir la pose de la cámara. I la pose aquí vol dir dues coses molt específiques, oi? Sí, vol dir la coordinació física de la cámara en el espai físic i la rotació, vol dir la pitja, la llà i la rola. Aprendre que la pose és l'última meta, però la software òbviament no pot només pensar on era la cámara. Ha de construir un cas basat en evidències visuals. Exacte. Així que quan fes la teva folga d'images de fàntan en la mapa de trànsit, la primera fase del feina és ensenyar a la computadora a identificar landmarks estructurals. La anomenem extracció de feina. Extracció de feina. Perquè abans pot comparar dues fotos per veure si s'overteixen, ha de trobar una cosa que és valent de comparar. Exacte. I utilitza un algoritme per això, el més común que es diu SIFT, que significa transformació de feina de escala invariant. SIFT. Així que què fa SIFT quan veu una imatge? És efectivament a buscar anomalies de gran contrast a través de la grada de pixels. L'algoritme escala per àrees on la coloració o la brillantesa es modifiquen molt dramàticament i de sobte. Com un edge escalf. Sí. O vol un cluster d'excel·lents pixels entornat per pixels de llum o viceversa. En el fàrmac de la visió computèrica, aquestes són referides a blobs. Blobs. M'agrada com termes tècnics molt tècnics sovint es buiden a blobs. Però SIFT fa molt més que buscar espais d'escolta, oi? És a dir, per ser escala invariant, s'ha de trobar landmarks que un computador pot reconèixer si el fotògraf està a dos fites o a 50 fites. Aquesta és la part crucial. I ho aconsegueix per blorir i esbarrir l'imatge de manera matemàtica i de manera i de manera i de manera i de manera i de manera. Es aplica el que es diu blors de Gaussian i després compareix les versions blorides per trobar feinades que actualment sobreviuen a la distorsió. Espera, es blora l'imatge per trobar feinades? Sí, aquest procés és conèixer la diferència de gaussians. Amb la substitució d'una versió de l'imatge de manera molt menys blora, l'algoritme isolava els detalls més resilients i estructurals. Ok, crec que ho entenc. La tínica textura granular d'una pedra pot registrar com a feinada a tota resolució, però la massiva curva de l'entorn de la pedra registra com a feinada a una resolució esbrinca. I SIFT cataloga totes aquestes. Exacte, cataloga totes aquelles escales. La font utilitza un exemple molt limpio d'una pedra per il·lustrar una feinada perfecta. Imagina un portàbil d'un pòl molt blanc muntat en una porta molt blanca i molt blanca. La contrasta és molt clara, la forma és distinta i l'algoritme és molt fàcil i es flagga com un landmark estructural. Sí, però el nostre focus avui és totalment a l'estranger. Estem capturant espais a l'estranger grans sense drones. Així que si veiem el nostre primer exemple, l'ornat de pedra davant de la capità de la ciutat, veiem per què aquest objecte és només una minera per aquesta software. Perquè un fàntan més d'una d'una d'una d'una d'una d'una d'una d'una d'una d'una d'una d'una No es abandona l'esquerra entornament. Llavors, has de framear els teixits per maximitzar la textura, mantenir l'esquerra a l'excel·lència. Però, si ens reliem a aquests frecles de contrast alt, veig una vulnerabilitat molt clara amb la captura de base de terra. Oh, sí. Què et sembla? Bé, estem en una cadira pública. No estem abans de la escena en un dron. L'environament és dinàmic. Si parlo d'aquesta fontana i una persona amb una jaqueta molt negra parla al meu frame, la jaqueta crea una gran barra de contracte contra el concret colorat de llum. Oh, absolutament. Així, SIFT flagra la jaqueta com una marca estructural, oi? O si un car corre cap a la ciutat capital en el darrere, SIFT tracta els corners de les finestres de la carrer. Sí, perquè l'algoritme no té cap entornament conceptual de què és un car o una persona. Només veu contrast moviment. Però, si la software tracta aquestes funcions movimentes a través de diferents frames, és que assumeix que la geografia física de la vida és que es va canviant. Prova a calcular un model 3D bàsic de punts que canvien de llocs activament. L'entorn matemàtic de la scena es va avançant. Aquesta és una crítica molt valida de captura de base de terra. Si un car moviment pot trencar la geografia, per què no ens posem a la drona? ¿És salvar uns pocs dinars valent de valent la matèria que és tan fràgil? Ho sento. Però l'environament que és inherentment hostil a l'assumpte que la geografia és estàtica, no vol dir que abandoni la metodologia. Només apliquis un filtre. La font detallava l'implementació de masques per resoldre aquesta exacta vulnerabilitat. Ah, masques. D'acord. Com la taula de plàstic digital. Exactament com la taula de plàstic digital. Abans que es faci la processació fàcil, pots enviar a la software un file de masques corresponent a les teves imatges. Explícitament instructes l'algoritme per ignorar cap d'un pixel dins de les darreres. Oh, doncs si saps que una estrada té tràfic actiu, tu només matures el masque. O si t'acabes de trobar la teva somia al paviment mentre camines, tu matures-la. Així que, en essencial, cures les dades visuals, forçant la software a escoltar-se a la sonoritat dinàmica, així que només extraeix feinades de la stona prominent en l'arquitectura. Això manté la matèria stable. Així és com controles el caos de la captura a nivell de terra. D'acord, això fa total sent. Però identificar les feinades és realment només la meitat de la preparació, oi? Perquè SIFT també descriu la vall de pixels dins de cada frecle, així que té una única signatura per buscar-la després. D'acord, crea un descriptor per a cada punt. I llavors, el callmap demana que defineixis el model de la caméra. El model de la caméra. Anem a trencar això. És clar. La software necessita un espai de traducció matemàtica entre el món físic i el sensor digital que el capteix. Diferents lèncers benden la llum en diferents maneres. El model de la caméra ofereix els paramètreis necessaris per interpretar aquesta bendició. I aquests paramèteres són usualment representats com letres en la software. Així que tens F, que significa la llengua focal. Això diu a la computadora l'àmbit inherent de la vista, com si la llença fos zoomada o tancada o retratada super lluny. D'acord, llavors tens 6S, que defineix el punt principal o bàsicament l'òptica central de l'imatge. D'acord, això sembla bastant standard, però hi ha altres variables que són molt més importants per aquesta cosa. Sí, les variables que dicten el succés de les reconstructions de l'estranger són K1 i K2. Aquestes són les termes de distorsió radial. Distorsió radial. Aquesta és on entendre l'òptica és molt vital, perquè pensa-t'hi, si estic intentant capturar un monument massiu des del terra, estic sovint físicament constringuda. Potser estic bascat en una paret i encara totalment inaprofit de fitar tota la estructura en un lens standard. D'acord, no pots anar més enllà com un camp de dron. Exacte. Així que què faig? Així que què faig? Pujo una càmera d'acció amb un lens super ampli de l'àrea de l'aigua. I com més ampli el lens, com més extrem la distorsió radial. La càmera bens la llum per compressar un lloc més ampli de vista en el sensor digital flac. Això vol dir que les colons de la ciutat darrere el nostre fàntan apareixen distinctment embocades i curves en la fotografia resultant 2D. Exacte. Sembla que estan embocades en l'àrea de l'aigua. Així que si no dono la software aquests paràmetres K1 i K2, el computador assumirà que està mirant un lloc sense distorsió fàcil. Literalment calcularà la posició 3D de les colons de l'aigua com si estiguessin físicament embocades i rebentant en la vida real. La geometria seria catastròficament malament. Les colons semblarien com bananas. Però, si seleccions un model de càmera radial o de càmera de càmera en Colmap, li dones la software la fórmula específica algebraica que cal desbordar la llum. Ah, doncs, matemàticament es refleteix l'imatge curva de dalt en l'esquena de l'esquena de l'esquena abans que fins i tot intenta calcular la densitat. Sí, exactament. D'acord, doncs, recopilem. Tenim les nostres feines extractives, tenim les nostres masques en lloc, bloquejant les carres i les persones, i tenim matemàticament les lentes de l'esquena de l'esquena. L'únic que tenim ara és un fólder ple d'escolars de l'esquena. Cada un conté una llista de tens de milions de points describits. La software no té absolutament ni idea de com la foto A es relaciona amb la foto B. D'acord, doncs, hem de construir la web relacional. I això ens mou a la següent fase, la recerca de correspondents. Això encolpa la recerca de les feines i la verificació I des que estem parlant de captures no estructurades, les estratègies que fem és fer o que fem en un any. Seriosament, la software ofereix unes estratègies distintes de la recerca. La més bàsica és l'exhaustió de la recerca. D'acord, és la metoda de brute force. D'acord, la computadora té l'imatge número un, mira les 10.000 features i compare-les amb les 10.000 features en l'imatge número dos, després compare l'imatge un a l'imatge tres, després l'imatge un a l'imatge quatre. Compara cada imatge amb cada altra imatge en l'entorn de la totalment sense que on o quan es prevenien. És un problema de matemàtica de l'N2. Si només tens 100 fotos que es fan de una estatua, la recerca és totalment bé. Bàsicament, t'assolta que no perds una feina de recerca. D'acord, tenim 1.000 frames extractats. D'acord. Amb 1.000 frames, no demanem que el computador faci 1.000 operacions. Demanem que faci 1.000 comparacions visuals complexes. I l'encaixament computacional augmenta exponencialment. Treballar matemàtiques en una extracció parala una estatua de la feina. Has de donar a la software una recerca lògica. Això ens porta a veeen Simple assumption. Chronological proximity equals spatial proximity. Meaning if they were taken at the same time, they were probably taken in the same place. Exactly. If you were walking around the fountain at a steady pace recording a video, frame number 150 was physically captured a fraction of a second after frame 149 and just before frame 151. Because of that timeline, frame 150 and 151 are basically guaranteed to share a massive percentage of their visual features. They are looking at the exact same patch of concrete from maybe a millimeter apart. Right. So sequential matching tells the software, hey, do not bother comparing frame 150 to frame 800. Just check the five frames immediately before it and the five frames immediately after it. That is brilliant. It dramatically slashes the workload. The massive N squared problem just vanishes. And it's replaced by a highly efficient localized search window that physically follows the path you walk. Extremely fast. But there is a catch with sequential matching in the real world. I was just thinking about this. Let's say I'm walking around the fountain recording my video. Halfway through the loop, a dog runs past me. I instinctively turn around, point the camera totally away from the fountain, film the dog for 10 seconds, and then turn back to the fountain to finish my loop. Yeah, you have completely shattered the chronological logic of the data set. Right. Frame 300 is the fountain. Frame 301 is a tree on the opposite side of the park where the dog ran. Frame 600 is back to the fountain. Right. Frame 100 is the tree on the opposite side of the park where the dog ran. And the entire 3D model will snap in half because the chain of connections is broken. Yes. Frame 1 will never connect to frame 1000 because of that interruption. Exhaustive matching is way too slow. And sequential matching is incredibly vulnerable to interruptions or just disorganized photo collections. Which is why CallMap utilizes a third highly sophisticated strategy. The vocab tree. The vocab tree. The vocabulary tree. I love this one. It sounds like natural language processing but applied to visual data. The mechanism is conceptually very similar to a search engine index like Google. When you select vocab tree matching, the software does a rapid preliminary scan of every image in the folder. It doesn't do deep feature matching yet. Instead, it quantizes visual information. It summarizes the dominant textures and shapes into a compact list of visual words. So it generates an index at the back of the textbook. It tags, say, image 50 with visual data. Visual words like jagged stone, water ripple, and moss. Exactly. And when it comes time to find matches for image 50, the software just queries the index. It asks the vocab tree to return a list of the top 50 other images in the entire dataset that share those specific visual words. Oh wow. So it instantly filters out the 900 irrelevant photos of the sky or the pavement or that dog you filmed. And it only performs the heavy intensive feature matching on the most probable candidates. It is an incredibly elegant. It is an incredibly elegant way to handle massive, totally disorganized ground collections. It really is. Now, there is actually a fourth matching option in the software called spatial matching, but we need to explicitly rule it out for our specific mission today. Right. Definitely rule it out. Spatial matching relies on the GPS coordinates embedded in the image metadata. It matches photos based on where the camera physically reported being on the globe. And spatial matching is really the domain of the drone operator. An enterprise drone flying in open air with an RTK module, records its physical location with millimeter precision. The software can safely rely on that metadata to decide which images overlap. But down on the ground, GPS is a complete mess. It's terrible. The signal bounces off tall buildings, creating multi-path errors. It gets absorbed by thick tree canopies. If you rely on your smartphone's GPS chip to tell ColMap which photos of the fountain are close to each other, a bad signal bounce might tell the software you were standing a hundred yards away. And then the spatial matching is a mess. The spatial logic just completely collapses. So for capturing large outdoor spaces from the ground, we must rely on the visual evidence through sequential or vocab tree matching, not the satellite data. One hundred percent. So once your chosen matching strategy is executed, the software draws thousands of tentative connections. These are literal lines linking a SIFT key point in one image to a highly similar key point in another. This raw matching data is notoriously gullible. Gullible is the perfect word for it. The SIFT algorithm is brilliant at finding high contrast, but it has zero common sense. Zero. Let's step away from the fountain for a second and imagine we're capturing a large historic brick building. SIFT extracts thousands of features from the sharp corners of the individual bricks. And the algorithm looks at a brick corner on the far left side of the building, and it looks at a brick corner on the far right side of the building. To SIFT, the texture, the contrast, and the local pixel neighborhood are mathematically identical. So the raw matching algorithm enthusiastically draws a strong connection between the left side of the building and the right side of the building. It declares then the exact same point in physical space. Which is a huge problem, because if CallMap accepts that match, it is going to attempt to physically fold the 3D model of the building in half just to force those two bricks to occupy the same coordinates. It'll look like a taco. Yeah. Repetitive patterns, brickwork, modern window grids, chain link fences, even the canopy and the highly textured tree will generate thousands of these false positive matches. We must sanitize the data before we calculate depth. And this critical cleanup phase is called geometric verification. The software basically has to test the physics of the matches. Right. It looks at a pair of images and the hundreds of points connecting them. And it uses advanced mathematical models, specifically calculating what are called homographies or essential matrices. While the linear algebra there is dense, the underlying concept is actually highly intuitive. Think about it. If you take a photo of a flat wall, take one step to your right and take a second photo. Every single feature on that wall should shift in your camera frame in a predictable uniform direction based on your physical movement. Exactly. The whole scene translates together. So the geometric verification looks at the collective behavior of the entire web of points. Right. So if 99 points shift uniformly to the left, but one specific point like our misidentified brick corner moves to the right, or jumps to a completely different area of the frame, the math flags it. The software basically determines that for that specific brick to actually be the same physical object, the camera would have to exist in two different locations simultaneously. The movement violates the laws of physics. So the software identifies the outlier, severs the false connection and just deletes the match. It trims the web until only the geometrically consistent physically possible connections remain. And with that, we have now completed feature extraction and correspondence search. We finally have a pristine, verified web of relationships between all our photographs. But, and this is wild, we are still entirely stuck in two dimensions. We haven't calculated a single millimeter of actual depth yet. Nope. But we have finally reached the threshold of 3D. We enter step four of the pipeline, incremental reconstruction. This is where the physical world is actually born from the flat data. And it requires an incredibly delicate process called initialization. Initialization. So the software has to pick a starting point. It has to choose two images from the folder, calculate the distance to the features they share and plot the very first 3D points in the empty void. Right. And I think the natural assumption for anyone doing this is that the software just grabs image number one and image number two and starts doing the math. And that assumption is exactly why many initial attempts at 3D scanning fail. For incremental reconstruction, starting with two chronologically adjacent images is usually a catastrophic mistake. Really? Why? It all comes down to the necessity of parallax and baseline. OK, let's define the baseline in camera terms. The baseline is simply the physical distance between the camera's position in the first photo and its position in the second photo. Exactly. Now think about walking around the fountain shooting a 60 frame per second video. The physical distance you travel between frame one and frame two is perhaps a fraction of an inch. Oh, right. So the baseline is virtually non-existent. And because the cameras are so close together, the perspective shift, the parallax between the two frames, is mathematically non-existent. So the baseline is negligible. Yep. If the software tries to calculate the depth of the fountain using two viewpoints that are only an inch apart, the math is incredibly weak. It's kind of the equivalent of trying to accurately judge the distance of a mountain peak 10 miles away by shifting your head a millimeter to the left. That's a perfect analogy. The resulting 3D points from that tiny baseline will be noisy, wildly inaccurate and structurally unsound. If you build the rest of the model on that weak foundation, the entire reconstruction will fail. And the software knows this. During initialization, CoalMaps scans the entire web of verified matches, searching for a pair of images that satisfy two somewhat conflating requirements. First, they must share a massive number of verified features, so they're strongly linked. Second, they must have a very wide physical baseline, a drastic change in perspective. Which brings us directly back to the specific capture strategy used by the researchers at the Oregon State Capitol. When they filmed the mountain, they didn't just walk in a single circle. No, they didn't. They walked one complete loop holding the camera high above their heads. Then they lowered the camera to chest level and walked a second complete loop. And that vertical shift is the absolute key to successful initialization. Because of that strategy, CoalMaps searches the data and totally bypasses the weak connection between frame one and frame two. Instead, it might select frame one from the very beginning of the overhead loop, and frame one from the very beginning of the overhead loop. So, the software has a perfect stereoscopic foundation. It uses that wide perspective shift to triangulate the depth of those shared features with extremely high mathematical confidence. Right. It plots that first cluster of images, and then it takes the data from the first cluster of images, and then it takes the data from the second cluster of images, and then it takes the data from the third cluster of images, and then it takes the data from the fourth cluster of images, and then it takes the data from the fifth cluster of images. Right. It plots that first cluster of 3D points into the void. The foundation is solidly set. Now, the incremental part of the reconstruction truly begins. The incremental loop. How does that work? The software enters this continuous rigorous cycle. It looks at the small cluster of 3D points it just generated. Then it searches the remaining 2D images for a third photo that shares SIFT features with those existing 3D points. Okay. It finds an overlapping image. Now, it has to figure out where that third camera was actually set. That's right. The third camera was actually standing in physical space. It executes a mathematical process called image registration. Yeah. And the underlying math here is known as the perspective endpoint problem, or PNP. The software analyzes how the known 3D points project onto the flat 2D sensor of the new image. And by reversing that projection, it calculates the exact physical XYZ coordinates and the rotation of that third camera in the virtual space. Once it anchors that third camera, it uses that new vantage point to look at other features in the image, calculates their depth, and plots even more 3D points into the void. And that process is called triangulation. The model grows. Then it finds a fourth overlapping image, registers the fourth camera, triangulates more points than a fifth, then a tenth, then a hundredth. Camera by camera, point by point, the geometry of the fountain is slowly pulled out of the darkness and assembled. It is a phenomenal process, but honestly, it immediately raises a massive red flag for me. The drift. Yes, the drift. If it is building this entire world sequentially, if camera four derives its position from camera three and camera five derives from camera four, we are essentially dealing with a mathematical telephone game. And error accumulation is the single most dangerous threat to an incremental pipeline. Let's say the calculation for camera three is off by just a microscopic fraction of a degree due to sensor noise or a slightly blurry pixel. Camera four bases its entire reality on camera three, so it inherits that tiny error, and inevitably adds its own microscopic miscalculation on top. Exactly. And by the time the software registers camera 1000 on the opposite side of the fountain, those microscopic errors have compounded exponentially. The fountain would physically twist. The start of the loop would never align with the end of the loop. If the software merely added cameras blindly, the result would always be a warped, unusable mess. So to combat this drift, Colomap constantly employs a rigorous optimization algorithm known as bundle adjustment. Bundle adjustment. The terminology in this field is so dense, but the concept is actually very physical. When I first hear bundle adjustment, I picture someone trying to carry a massive, overflowing armful of firewood. That's a good visual. A log starts slipping, the balance shifts, and the person has to momentarily stop walking, shuffle the logs around, and aggressively tighten their grip on the entire bundle so it doesn't all collapse onto the ground. What is actually being bundled in the software? The metaphor aligns perfectly with the math, honestly. The bundle refers to the dense web of mathematical viewing rays, the invisible lines connecting every calculated camera position to every triangulated 3D point in the space. OK. Bundle adjustment is the mathematical process of simultaneously tweaking the positions of the cameras and the 3D points to minimize the tension and error in that web. So the software is actively tightening its grip on the geometry, and the source details two specific flavors of this optimization. Local bundle adjustment and global bundle adjustment. Local bundle adjustment happens continuously. Every single time Cullmap adds a new image and triangulates new points, it pauses and runs a local adjustment. But crucially, it does not recalculate the entire scene. It only adjusts the firewood that is actively slipping. So if the software just registered camera 800 on the north face of the fountain, adjusting the cameras on the far south face won't help anything. Exactly. The software isolates the immediate neighborhood, the newly added camera and the cameras directly connected to it, and runs a rapid optimization algorithm, usually something like Levenberg-Marcourt. It subtly shifts the local points to minimize the immediate error. It is a fast, efficient way to keep the local geometry tight as the model expands. But local adjustments cannot solve the overarching systematic drift that slowly bows the entire model over hundreds of frames. Eventually, the software must reckon with the entire data set at once. That is when it triggers a global bundle adjustment. The software stops adding new images entirely. It takes the entire bundle, every single camera pose it has calculated, and hundreds of thousands of 3D points and essentially drops them all onto the floor. And then it runs a massive computationally punishing optimization across the entire model simultaneously, forcing every single point into the most mathematically cohesive shape possible. It totally eradicates the telephone game error. But the source highlights a truly brilliant aspect of this global optimization. The software isn't just tweaking the physical locations of the cameras. It actively interrogates the camera model parameters we defined way back in step one. Wait, remember the K1 and K2 radial distortion numbers we fed it for our wide angle lens? Yes, exactly. During a global bundle adjustment, the software looks at the massive amount of 3D evidence it's compiled and realizes, hey, if I adjust my assumption about the physical curvature of the glass lens by just 1%, the geometric error across the entire fountain drops dramatically. That's insane. The software refined its understanding of the physical lens based on the reality of the 3D world it is building. The model continuously teaches the software how to see it better. It is a breathtaking feedback loop. It really is. But this incredible, meticulous step by step process, registering a camera, triangulating, local adjustment, global adjustment, leads us directly into the most notorious frustration for anyone attempting 3D reconstruction. It is the hardware bottleneck. Yes. It is the moment when the theory hits the reality of the silicon sitting on your desk. I see this constantly online. A learner wants to reconstruct a local monument. They go out by a massive top tier, incredibly expensive graphics processing unit, a GPU. They load up their photos, hit run, and the feature extraction phase finishes in 90 seconds. The GPU crushes the SIFT algorithm. Flies right through it. But then incremental reconstruction begins. The progress bar crawls to a total halt. The software estimates it will take five hours to finish. The user checks their system monitor and their hyper expensive GPU is sitting at maybe 4% utilization, while their standard CPU is maxed out and overheating. Why is the actual birth of the 3D model barely using the graphics card? To understand that bottleneck, you really have to look at the fundamental architectural differences between how a GPU and a CPU solve problems. The source material provides a fantastic framework for this. Think of the computational workload as a police search. A GPU is structured like an army of a million rookie cops. Right. They have very small amounts of memory processes. VRAM limitations are a massive factor in 3D processing. And their internal architecture, the ALUs, are designed for very simple math. So you line up your million rookies across a massive field and you tell them, everyone take one step forward, stare at the single square inch of grass in front of your boots and shout if you see a clue. The task is massively parallelized. Millions of simple independent operations executing simultaneously. That is exactly why feature extraction is so incredibly fast on a GPU. Finding a high contrast sift blob in the top left corner of an image has absolutely zero mathematical dependency on finding a blob in the bottom right corner. The million rookies scan the grid instantly, but incremental reconstruction is not a search. It is an unfolding, highly complex sequence. You cannot parallelize a sequence. You cannot register camera five until the underlying math for camera four is definitively locked in. And camera four depends on camera three. If you have a million rookie cops to solve a sequential murder mystery, the investigation collapses. Rookie number 500 cannot do his job because he is waiting for rookie 499 to hand him the evidence. You don't need an army of rookies. You need a single master detective. And that is the CPU. A CPU doesn't have thousands of cores. It has eight, sixteen or maybe twenty four. But each of those cores is vastly more sophisticated. They have access to massive amounts of system RAM and they are engineered to solve intense, deeply convoluted linear algebra equations step by step without stalling. So the CPU is the master detective, meticulously solving the localized and global bundle adjustments, unspooling the mathematical knot one camera at a time. The heavy lifting of incremental 3D reconstruction is fundamentally, inherently a CPU bound process. You simply have to wait for the detective to finish the work. That structural reality is deeply satisfying to understand, but it doesn't really change the fact that five hours is a long time to wait for a model. It is a very long time. What if you need to process thousands of ground images and you cannot afford the CPU bottleneck? What if there is a way to entirely skip the slow sequential detective work? Which brings us to the bleeding edge of the source material and the introduction of an alternative pipeline called GLOMAP. GLMAP, that stands for Global Mapper. And fascinatingly, Johannes Schoenberger, the original architect of the incremental Coal Map, was deeply involved in developing this new approach just recently. Yes. He recognized the unavoidable CPU bottleneck of his own software and helped engineer a fundamentally different mathematical strategy. So how does GLOMAP differ? GLOMAP abandons the sequential foundation entirely. It does not initialize a starting pair. It does not build the model one camera at a time. It attempts the incredibly audacious task of solving the position of every single camera in the entire dataset simultaneously. Wait, how is that mathematically possible? If you throw a thousand unstructured ground images into the void all at once, how can the software possibly organize them without a step by step baseline? It splits the pose estimation into two distinct phases, relying heavily on a concept called rotation averaging. OK, rotation averaging. Now we still run the initial feature extraction and correspondence search. We still need that verified web of 2D matches connecting the images. Right. So we know what photos share features. We just don't know where the photos belong in physical 3D space. Exactly. But instead of trying to find the XYZ location right away, GLOMAP looks at everything in the 3D space. It calculates every single pair of ratched images and performs a rapid calculation to estimate only the relative rotation between them. It ignores physical distance entirely. Completely ignores distance. It simply asks, how much did the photographer tilt or pan the lens between photo A and photo B? So it only cares about the direction the camera was pointing. It calculates thousands of these isolated rotational guesses. Then it throws all those relative rotations into a massive global optimization algorithm. The software searches for a single unified orientation framework that satisfies as many of those individual guesses as possible. Oh, I see. And because it is only solving for rotation, the math is significantly less complex than a full bundle adjustment. And it can actually leverage much more parallel processing. Exactly. It gets every single camera pointing in the correct direction relative to each other all at the exact same time. But they are still just floating haphazardly in space. The actual geometry doesn't exist yet. Once the global rotation is locked, it initiates the global positioning step. The algorithm slides all those properly oriented cameras along their mathematical viewing rays, shifting them through the void until the thousands of 2D feature matches all converge and intersect on cohesive 3D points in the center. It pulls the entire chaotic web tight in one massive movement. The model basically pops into existence almost instantly. The source notes that Geolab operates at speeds that make incremental reconstruction look archaic. But a shortcut that aggressive obviously comes with severe limitations. Oh, absolutely. When we look at our specific mission today, capturing large outdoor spaces from the ground, Geolab is a high risk, high reward gamble. Why is it high risk? It requires an incredibly specific type of data collection to succeed. Geolab demands a highly interconnected data set. It thrives on loop closures. Loop closures. Let's go back to the fountain example. The videographer walked a complete circle around the structure, returning a the exact patch of sidewalk where they started. That is a loop closure, right? Yes. Because the end of the image sequence physically overlaps with the beginning of the sequence, the 2D matching web is incredibly dense and structurally found. Geolab uses the strength of that closed loop to anchor the global rotation map, pulling the entire fountain into perfect focus. But consider a different outdoor capture. Let's say I want to 3D model the storefronts on a historic main street. I hold my camera and I walk straight down the sidewalk for 10 blocks, capturing the facades as I go, and then I just stop. I never turn around. I never walk back. I never close the loop. Straight line capture creates a brittle mathematical chain. Frame one shares features with frame five. Frame five connects to frame ten. But frame one has absolutely zero mathematical relationship to frame 1000 at the far end of the block. The global web is incredibly weak. If you feed that straight line data set into Geolab, the global positioning math just collapses, doesn't it? It cannot resolve the tension. And the source describes the failure of the state of Geolab and it is fascinatingly chaotic. It's hilarious, honestly. When incremental Geolab fails, it just slowly drifts and bends like a banana. When Geolab fails, it panics. The visual output is bizarre. The source specifically refers to the failures as Borg cubes. Borg cubes. Because the software lacks the loop closures to anchor the structure, the cameras essentially pull each other into a localized gravitational collapse. So instead of a smooth street, the software outputs a dense, glitchy, perfectly square cube of overlapping cameras just hovering in empty space. The geometry completely destroys itself. The model implodes into a singularity. But, you know, from a workflow perspective, even that catastrophic failure is highly valuable. Because it fails fast. You don't wait five hours for a master detective CPU to slowly build a warped, unusable street. Geolab hands you a Borg cube in five minutes. Right. You instantly know your data set lacks structural integrity. So you switch your software back to the slow, incremental car map and you let the CPU brute force the straight line. Having both tools allows you to attack different types of geometry with the appropriate mathematical strategy. This is amazing. We have completely demystified the black box. You really have. We started with the goal of capturing a massive outdoor space without relying on the cheat codes of a drone, and we examined how the SIFT algorithm acts as the eyes of the pipeline, utilizing difference of the Gaussian's to hunt for high contrast scale invariant freckles on the face of the architect. Sure. We saw how to blind the software to dynamic noise using masks and how to unbend the physical distortion of our lenses. By defining the camera model, we explored how sequential and vocab tree matching allow us to connect those features without crushing our hardware with N squared exhaustive math. We verify the physics of those matches using homographies, initialized our models by leveraging the parallax of a wide baseline and watch the CPU act as a master detective to incrementally bundle, adjust the geometry into reality. And we looked at the bleeding edge, utilizing gelo map to instantly snap a model together, provided we give it the dense loop closures it craves. The theory is profound, but the source material ends with a direct call to action, and it is one we definitely must echo to you, the listener. Absolutely. Do not leave this knowledge in the abstract. Reading about photogrammetry is vastly different from actually executing it. Do not just download a pristine, prepackaged, open source data set of a statue where you know the lighting is perfect and the model map is guaranteed to solve. You have the necessary hardware in your pocket right now. Go outside. Find a textured brick building, a local monument or a park fountain. Grab your smartphone. Walk around the structure, consciously thinking about overlap, texture and parallax. Download CalmApp. It is entirely free and open source and feed your own photos into the pipeline. Watch the feature extraction isolate the details. Watch the point cloud slowly build. When your model inevitably collapses into a board cube, interrogate your capture strategy. Did you capture too much blank sky? Did you break the chronological chain by filming a dog? The trial and error fundamentally changes how you perceive physical space. You will never look at a blank white wall or a heavily textured stone carving the same way again. It really rewires your brain. It does. Now I want to leave you with a final lingering thought that builds on this entire discussion. The core strength of Schoenberger's collection mapper is that it does not need a drone, it does not need a top down grid, and it does not need RTK GPS. It was explicitly designed to handle the messy, random, overlapping reality of human photography. It thrives on unstructured collections. So consider the implications of scaling that capability. What happens if we take this software and instead of feeding it a thousand photos of a fountain taken this afternoon, we feed it millions of random historical tourist photos of a city center from the 1980s. Oh, wow. Millions of physical photos pulled from attics, scanned into digital files, completely unstructured, but overlapping by sheer coincidence because thousands of different tourists stood in roughly the same spots over the course of a decade. I mean, the SIFT algorithm does not care if the photos were taken ten years apart. If the structural contrast of the stone remains static, the software will match the features. Exactly. We are standing on the verge of being able to dynamically 3D reconstruct the past. We could build fully nautical photorealistic geometric environments of monuments that have long been destroyed or city squares that have been paved over, utilizing absolutely nothing but the accidental collective 2D memories sitting in our photo albums. That is wild to think about. We can cure the dimensional blindness of history. Something to chew on. That is a deep dive. Thank you for joining us on this exploration of 3D reconstruction. And as always, keep diving deep.