Lightweight Room Layout Estimation
2022 · ICCAS — lightweight 3D room layout estimation from a single panorama.
22nd International Conference on Control, Automation and Systems (ICCAS), BEXCO, Busan · Nov 2022 · Indexed in IEEE Xplore
Dayoung Kil, Seong-heum Kim · VIP Lab, Soongsil University
Can a phone-sized network rebuild the 3D layout of a room from a single panorama?
We make HorizonNet lightweight: its ResNet backbone is replaced by a searched MnasNet, and its LSTM by a GRU. The result runs at less than half the parameters — with almost no loss in layout accuracy.
IEEE Xplore View code on GitHub
The problem
Sharing the inside of a home as a single photo or panorama is everyday now — but a 2D image distorts the real size and proportions of a 3D space. Room layout estimation recovers the true 3D structure (floor, ceiling, walls) from one image, which is useful for architects, interior design, and AR.
The catch: state-of-the-art models like HorizonNet are heavy, and camera ISPs / embedded platforms have a tight compute budget.
The goal: keep HorizonNet’s layout quality, but make it light enough for on-device, low-power use.
Approach
HorizonNet recovers a layout in three stages — pre-processing (align the panorama, detect vanishing points), feature extraction (predict a 1D layout of ceiling/floor/wall boundaries), and post-processing (lift it to 3D under the Manhattan-world assumption). We leave this pipeline intact and only make the feature-extraction network lightweight.
The feature extractor in HorizonNet is ResNet-50 + LSTM. We replace both halves with lighter modules and search the configuration instead of hand-fixing it:
| Stage | HorizonNet (baseline) | Ours (lightweight) |
|---|---|---|
| Backbone | ResNet-50 | MnasNet — platform-aware NAS |
| Sequence model | LSTM (2 states, 3 gates) | GRU (1 state, 2 gates) |
| Hyperparameters | fixed | sampling-based search |
Why these swaps
- ResNet-50 → MnasNet. MnasNet decomposes the network into blocks and uses a factorized hierarchical search space — each block can differ, but layers inside a block share structure, so the search space stays small and mobile-friendly.
- LSTM → GRU. A GRU merges the LSTM’s input/forget gates into one update gate and its cell/hidden states into a single state — fewer parameters for the same sequence modeling.
Searching the backbone
We tune the 6 inverted-residual blocks with sampling-based optimization. Mirroring ResNet’s 256·512·1024·2048 blew up the parameter count, so we use 128 · 256 · 512 · 1024 · 36 · 24 out-channels, and assign fewer repeats to the wide blocks — trading FLOPs and parameters for almost no accuracy drop.
Results
Trained on a Stanford2D3D + PanoContext mix (817 train / 79 val / 166 test, 300 epochs), and validated on real RICOH THETA Z1 panoramas.
| Metric | HorizonNet (ResNet-50 + LSTM) | Ours (MnasNet + GRU) |
|---|---|---|
| Parameters | 81.6 M | 37.6 M (−54%) |
| 2D IoU | 87.07 | 85.07 |
| 3D IoU | 84.53 | 81.89 |
| MSE | 0.18 | 0.21 |
−54%
parameters (81.6M → 37.6M)−2.0
2D IoU points only≈ same
qualitative 3D layout
On real THETA panoramas, the lightweight model’s 3D reconstructions show no significant visual difference from the original HorizonNet.
Ablation — where the savings come from
| Model | Parameters | FLOPs | MSE |
|---|---|---|---|
| ResNet-50 + LSTM (baseline) | 81.6 M | 71.83 | 0.18 |
| MnasNet + LSTM | 40.4 M | 59.19 | 0.23 |
| MnasNet + GRU (ours) | 37.6 M | 58.48 | 0.21 |
- MnasNet does the heavy lifting — it alone roughly halves the parameters (81.6M → 40.4M) and cuts FLOPs (71.83 → 59.19).
- GRU trims a bit more (40.4M → 37.6M) and, interestingly, improves MSE (0.23 → 0.21): once the model is right-sized, extra under-trained parameters were hurting rather than helping.
Takeaways
- A searched MobileNet-style backbone + GRU makes panoramic room-layout estimation embedded-friendly at less than half the parameters.
- The accuracy cost is small (≈2 IoU points), and qualitatively the 3D layouts are indistinguishable from the full model.
Future work: add the Structured3D dataset, and use reinforcement learning to search the remaining inverted-residual hyperparameters (kernel size, expansion ratio, stride).
Details
HorizonNet MnasNet (NAS) GRU Stanford2D3D · PanoContext RICOH THETA Z1
Authors: Dayoung Kil, Seong-heum Kim · VIP Lab, Soongsil University.
Supported by the National Research Foundation of Korea (MSIT), Grant NRF-2021R1G1A1009828.