GREYSCOPELABS
Loading...
Back to Blog
BlogJune 28, 2026

Teaching Deep Learning to See Architectural Vision: Building Production-Grade Intelligent Systems

GT

Greyscope Teams

Transforming ideas, Empowering innovation

Teaching Deep Learning to See Architectural Vision: Building Production-Grade Intelligent Systems

greyscope_labs_architecture_classification_transferlearning_efficientnetv2

Architecture is not merely physical structure; it is a manifestation of complex geometry, historical context, and microscopic ornamental details. For computer vision systems, fine-grained image classification (FGIC) of architectural styles presents a formidable challenge. High intra-class variance driven by fluctuating camera angles, lighting distortions, and overlapping visual elements between distinct building types causes conventional models to fail at isolating essential structural features. Greyscope Labs, we view this friction as the optimal proving ground for our deep tech foundry capabilities. Through the Fine-Grained Image Classification (FGIC) of World Architecture initiative, we engineered an enterprise-grade machine learning pipeline that relies heavily on advanced transfer learning. The result is an architecture that is not only theoretically sound but empirically validated, achieving a 97.77% testing accuracy on highly complex spatial data.

Data Engineering: Establishing an Authentic Foundation

The representation quality of any artificial intelligence system is inherently bound to the integrity of its training data. To train a high-precision spatial model, relying on noisy public datasets or images contaminated by generative AI synthetics is fundamentally unacceptable. Our team executed a sovereign data extraction pipeline orchestrated via Pexels APIs and BrightData proxies. This automated web-scraping process amassed exactly 13,440 class architectural images sourced purely from authentic human photographers. The dataset was then perfectly balanced across eight specific building classes, ranging from modern skyscrapers and medieval castles to ancient temples.

Data Leakage Prevention & Resolution Calibration

Accumulating thousands of images introduces the fatal risk of duplicate data leakage. We implemented a dual-filtration system utilizing binary validation (dedup-SHA256) and perceptual deduplication (pHash) to guarantee zero crossover between our training, validation, and testing splits. Furthermore, bypassing the industry-standard 224x224 input size, we calibrated the image resolution to 320x320 pixels. This higher resolution forces the model to capture the micro-details such as wall textures and facade ornaments that serve as the primary differentiators in architectural design.

tSNE-Visual-EffecientNetV2-small

Figure 1: A 2D t-SNE visualization proving how the model successfully separates the 8 architectural class clusters within the spatial plane, a direct result of immaculate data processing.

Two-Phase Fine-Tuning and Simulating Real-World Chaos

The most prominent vulnerability in fine-grained classification is the model's tendency to blindly memorize background noise a phenomenon known as overfitting. To neutralize this risk, we modified the EfficientNetV2-S backbone by stacking 14 layers of regularization techniques simultaneously. We engineered a data pipeline that intentionally introduces real-world visual chaos. Through aggressive augmentation techniques like Random Erasing, the system computationally drops random pixel areas during training to simulate visual occlusion, such as trees or clouds obstructing a building. The model is also forced to learn from hybrid representations using Mixup and CutMix interpolation. Rather than executing a single, volatile training run, the transfer learning process was systematically bifurcated into two surgical phases:

  1. Phase 1 (Head Feature Extraction): The entire EfficientNetV2-S (small) backbone is frozen. The model strictly trains the custom head to understand basic architectural patterns while minimizing computational waste.
  2. Phase 2 (Selective Fine-Tuning): We unfreeze the top convolution blocks (block6 + top_conv) while keeping Batch Normalization frozen to prevent catastrophic forgetting. A Discriminative Learning Rate is applied via an override step, forcing the lower variables to learn 10x slower than the head, ensuring the model adapts to the specific architectural domain without destroying its pre-trained ImageNet weights.

Intelligence Optimization Matrix

ComponentSystem ImplementationOperational Impact
Spatial AdaptationReplacing standard Global Average Pooling with Generalized Mean (GeM) Pooling (p=3.0).Forces the model to aggressively weight its computational attention toward dominant architectural features.
Latent Sample FocusIntegrating Focal Loss (gamma=2.0) combined with dynamic Label Smoothing.Recalibrates weight updates so the system intensely focuses on latent, hard-to-classify edge cases.
Inference StabilityApplying Stochastic Weight Averaging (SWA) statically across the final 10 post-training epochs.Flattens the model's weight profile toward a stable convergence point, yielding extreme generalization on unseen data.

Grad-CAM-EffecientNetV2-Small

Figure 2: Grad-CAM analysis validating that the model does not guess randomly; its computational attention is accurately locked onto structural elements like domes, towers, and rooflines.

Production-Scale Metrics and Latent Anomalies

AI infrastructure is only valuable if its performance can be rigorously audited. The execution of this engineering pipeline resulted in massive performance metrics: a 97.77% testing accuracy alongside an ROC-AUC score of 0.9985. When evaluated using Test-Time Augmentation (TTA), the model's accuracy peaked at 97.99%. A mature enterprise architecture dissects its anomalies rather than hiding them. Through the classification report, the model demonstrated absolute dominance (near 100% precision and recall) when identifying the Skyscraper class. Conversely, we recorded the highest variance and lowest recall (0.9524) in the Temple class. Certain ancient temple structures exhibit a heavy visual crossover with Mosque domes or stadium grandstands, effectively establishing this class as the ultimate edge case for fine-grained classification systems.

Production-Ready Architecture: Cross-Platform Deployment

We do not allow Deep Learning models to stagnate as research artifacts; solutions must be deployed. We immediately converted this architecture into three distinct, production-grade formats:

  • TensorFlow SavedModel: Optimized for large-scale containerized server execution via TensorFlow Serving.
  • TensorFlow-Lite (TF-Lite): Engineered specifically for edge computing and mobile deployment. This format slashes inference latency down to 170 milliseconds (2.1x faster than the native Keras model) while retaining a 100% absolute prediction match with the primary weights.
  • TensorFlow.js (TFJS): The model is partitioned into 23 distinct 3MB proxy shards for progressive loading, allowing client-side web browsers to execute the classification pipeline autonomously without requiring a backend server.

TensorFlow-Lite-EfficientNetV2-Inferences

Figure 3: Proof of deployment integrity; the radical compression of TensorFlow-Lite yields zero degradation in quality, maintaining a 100% match rate when benchmarked against the heavy primary model.


The demo live at HuggingFace Architecture Building Image Classifier initiative reinforces the core methodology of use, we confront complex research friction and engineer it into efficient, scalable, and deployable intelligent software infrastructure for the modern enterprise.

References

  • Model Architecture & Primary Documentation: Saugani. (2026). Fine-Grained Image Classification of World Architecture: An EfficientNetV2-S Transfer Learning Approach with Layered Regularization. Released under MIT License. Full code review available at the GitHub Repository.

  • Curated Dataset Access & Model Weight Distribution: The collection of 13,440 public architectural images (CC BY 4.0 license) and all final model artifacts are freely accessible via the HuggingFace infrastructure (Dataset: 0xgr3y/arch-building-dataset | Model: 0xgr3y/Arch-Building-Image-Classification).

  • Enterprise Infrastructure & Engineering: Greyscope Labs Deep Tech Foundry (2026). Architecture Exploration & Solutions Platform. Available at: https://greyscope.xyz