Agentic AI Image Detection in Construction: 2026 Guide

The April 2026 inflection

For most of the last five years, construction tech’s "AI" story was photo logs you could search. Reality-capture vendors competed on coverage and ergonomics: how easily you could turn a site walk into a navigable 360 record, how often you could re-fly the same drone path. The intelligence sat inside dashboards. People still made the calls.

Three weeks ago, that story changed.

In a single seven-day window, four vendors and three peer-reviewed papers landed:

DroneDeploy announced 20 trillion sq ft of visual site data and activated four production agents (Progress AI, Safety AI, Inspection AI, and Embodied AI) trained on 34M end-user annotations across 3M sites. Robotics missions on the platform grew 160% year-over-year (source).
ConstructConnect shipped Takeoff Boost on Google Cloud’s Gemini Enterprise stack: computer vision that classifies, detects, counts, and measures materials directly from construction plans in seconds, not hours.
Sitetracker launched Scout, an agentic platform purpose-built for the critical-infrastructure verticals (telecom, utilities, EV charging); photo intelligence is one of three pillars alongside document processing and risk analysis.
Buildots unveiled "construction intelligence" as the new operational standard, citing up to 50% reductions in project delays across customers including Turner, JE Dunn, Digital Realty, and Intel.

Layer on the research front: a new arxiv preprint integrates YOLOv11n object detection with a 4B-parameter VLM (Gemma-3) and pushes safety-hazard detection F1 from a 34.5% baseline to 50.6% at 2.5 ms / image of overhead. AEC-Bench, the first multimodal benchmark explicitly built for agentic systems in architecture, engineering, and construction, also landed and rapidly began separating models that can do cross-sheet reasoning from those that can’t.

The shift isn’t subtle. The site is starting to see itself.

What "agentic" means here (and what it doesn’t)

A camera that detects a missing hard hat is not new; pre-trained YOLO models have done that since 2022. The difference in 2026 is the loop.

A reality-capture system used to be: capture, store, query. The human asked the question.

An agentic vision system is: capture, perceive, reason, decide, act, observe, repeat. The platform asks and answers the question, then takes action (opens an RFI, flags a clash, files a safety report, reorders the punch list) without a human in the middle of every step.

OpenSpace describes this as "agents with eyes". DroneDeploy’s terminology is more clinical: each agent owns a domain and exchanges visual evidence with the others. Sitetracker’s Scout calls it "asset-lifecycle automation." The vocabulary differs; the architecture is the same:

Perception: vision-language models running on captured stills, 360s, drone passes, and ground-bot feeds.
Grounding: the perception output is reconciled against BIM, schedule, takeoff, contract, and OSHA policy. Without grounding the agent hallucinates progress; with it, the agent has an evaluator.
Action: write to the system of record (Procore, Autodesk Construction Cloud, the GC’s ERP) or trigger downstream automations (escalation, replan, hold).

Two things are worth naming. First: every serious vendor is keeping a human in the review loop, not the execution loop. The agent files the RFI; the project engineer approves. Second: the value isn’t in any single inference; it’s in closing the gap between when reality diverges from plan and when somebody acts on it. That gap used to be days. It’s now closer to hours, and the leading edge of the literature is targeting minutes.

The four use cases that are actually shipping

1. Progress detection vs. schedule

This is the oldest application and it’s now the most mature. Cameras and drones capture installed work; an agent localizes elements (columns, slab pours, MEP rough-in, partitions, finishes) against the BIM, classifies installation phase, and writes a delta back into the schedule. Doxel customers report 11% faster project delivery and up to 16% cash-flow savings. DroneDeploy’s Progress AI runs the same play with a fleet of 18M+ aerial captures behind it.

For an underwriter, the consequence is simple: percent-complete is no longer a number a borrower writes on a draw request. It’s a number a third party can verify against pixels.

2. Safety and PPE compliance

Safety is the most public application, and it’s the one where agentic vision has eaten the most ink. Tiliter and a half-dozen peers detect hard-hat compliance, hi-vis vests, scaffolding clearance, and working-at-height proximity on a snapshot cadence rather than continuous surveillance. In April’s arxiv preprint, the headline result is more interesting than any single F1 number: combining a tiny detector (YOLOv11n) with a small VLM (Gemma-3 4B) outperforms a large VLM running alone, at a fraction of the inference cost.

That matters because construction is still the worst connectivity environment on the internet. Edge-deployable small VLMs are the difference between "we’ll know on Monday" and "the foreman gets a Slack ping before the welder drops his hood."

3. Defect, damage, and rework detection

Defect detection has always been bottlenecked by labels. The 2026 fix is multi-stage pipelines: super-resolution lifts a low-megapixel capture, an object detector localizes regions of interest, and a VLM is asked what’s wrong here? in natural language. A March preprint applied exactly this stack to post-disaster satellite imagery and recovered usable severity classifications across orders of magnitude of resolution.

The same architecture is showing up at the site level. Track3D advertises 20% lower rework costs by spotting deviations earlier in the pour-and-finish cycle. The economics are unsubtle: rework is the largest non-recoverable line item in most ground-up budgets, and it’s almost entirely caught at punch-list. Catching it at install would change the shape of the curve.

4. Takeoff and material counting

ConstructConnect’s Takeoff Boost is the bluntest version: feed the agent a set of plans, get back a quantity takeoff in seconds. It’s vision applied not to the site but to the document, and it’s how vision and language are converging: the same multimodal model that reads a wall section also classifies a stud count. AEC-Bench has begun separating frontier models that can do this cross-sheet reasoning from ones that can’t.

For underwriting this is downstream, but it changes the input data quality of every cost estimate that hits a development pro-forma, which is upstream of everything we do at Relm.

What’s still hard

It would be a disservice to read the announcements as solved. Three open problems:

False positives at scale. The April small-VLM result is 50.6% F1. That’s a real lift over baseline, but it’s not "ship to ten thousand sites without humans." Every vendor is wrestling with the alert-fatigue tax, and every customer reports the same first six weeks of tuning.
Occlusion and perspective. Tower cranes, scaffolding, and stored materials hide more of the work than they show. Aerial captures recover the slab; ground walks recover the elevation. The combo is what’s improving, but it requires re-flying the same paths to keep the registration clean.
Connectivity and edge compute. Most jobsites still run on bonded LTE. The rationale for snapshot-based analysis (capture, push to cloud, return seconds later) versus continuous edge inference is largely an artifact of bandwidth. Until 5G or LEO link is universal, the architecture will keep favouring batch inference.

What this means for institutional real estate underwriting

Construction risk has historically been the largest variance in any ground-up or value-add deal, and one of the most opaque. Lender draws depend on monthly G-702/703 packages and quarterly site visits. By the time a project shows up on a watch list, the schedule slip is already six weeks deep.

Agentic vision changes the underlying assumption of that workflow. Three implications worth thinking about now, before they become table stakes:

Construction draws will move from attestation to verification. Lenders are already piloting DroneDeploy and Buildots feeds as evidence packages on draw requests. If you’re underwriting senior debt on a development deal in the next 12 months, ask the GP what monitoring stack are you required to maintain? Then make it a covenant.
OCIP / wrap underwriting becomes risk-based. Insurers writing owner- or contractor-controlled programs can now price safety at the granularity of individual sites with measured incident rates, not portfolios with class codes. Brokers who can deliver visual-evidence dossiers will quote differently.
Delay-risk pricing on stabilization-period deals will tighten. The variance in lease-up assumptions on a 4% LIHTC or a build-to-rent deal is dominated by completion timing. As schedule prediction gets better, the spread on construction-period coupon premia gets compressed.

If you’re doing acquisition underwriting on operating real estate (buying assets the year after they finish, refinancing the year after that), agentic-vision construction tech doesn’t change your job day-to-day. But it does change the input data quality for the year-zero deliverables you rely on. Lien releases, certificate-of-occupancy timing, punch-list closeout, post-construction insurance binders: all of these are about to get faster and cleaner.

That’s the part that intersects with what we do at Relm Pro. Our platform runs Deep Research and 10-year pro-forma on operating multifamily, office, and retail, not on projects under construction. But the operating-day-one inputs we ingest (rent rolls, occupancy ramps, capex schedules, lien histories) are downstream of these new tools. Cleaner construction data = cleaner stabilized underwriting.

Where this goes next

Three directions are worth watching through the rest of 2026:

Agent-to-agent protocols. DroneDeploy’s four-agent stack already exchanges artifacts internally. The next step is across vendors: a Buildots safety alert that triggers a Procore RFI without a human typing. MCP-style tool protocols are a natural fit; expect at least one major announcement before year-end.
On-device VLMs. The April small-VLM result is the leading indicator. If 4B-parameter models on edge GPUs can match cloud-only frontier models for safety detection, the architecture flips back from snapshot-cloud to continuous-edge, and the latency floor drops by an order of magnitude.
Underwriting integrations. It’s a question of when, not if, lenders begin pulling progress evidence directly into draw-request systems. The data is now there. The contracts haven’t caught up.

We’ll keep covering this thread on the Relm blog as the next round of vendor announcements lands. If you’re building a deal where construction-period assumptions matter (or you’d just like to see how agentic AI shows up earlier in the underwriting flow than the construction phase), book a 30-minute walkthrough of the platform.

Frequently asked questions

What is agentic AI in construction?

Agentic AI in construction refers to vision systems that don’t just answer questions; they perceive site conditions, reason against plans and policies, and take actions (file an RFI, flag a hazard, update a schedule) with minimal human-in-the-loop. The 2026 wave from DroneDeploy, Buildots, ConstructConnect, and Sitetracker treats vision as one component in a perception→grounding→action loop, rather than an isolated dashboard feature.

How accurate is AI image detection on construction sites today?

It depends on the task. Object detection for PPE compliance is reliable enough to ship at scale; defect detection still benefits from human review. The most-cited 2026 academic result combines a small object detector with a small VLM and reaches 50.6% F1 on a multi-hazard benchmark, a substantial lift over a baseline of 34.5% but well short of "ship without supervision."

Can these systems run at the edge?

Increasingly yes. The shift from large cloud-only VLMs to detection-guided small VLMs in the 4B-parameter range is the technical inflection: 2.5ms-per-image overhead is fast enough to run on edge GPUs, and the F1 lift is meaningful. Until LTE / 5G coverage on jobsites is reliably better than bonded-modem, snapshot-and-cloud will remain the default.

How does this affect real estate underwriting?

Construction-period risk has been the largest variance and the most opaque part of ground-up underwriting. Visual ground truth lets lenders verify draw requests against pixels, lets insurers price OCIP / wrap programs at the site rather than the portfolio level, and tightens the spread on construction-period coupon premia. The downstream effect for acquisition and asset-management underwriting is cleaner year-zero inputs.

Where do I learn more about Relm’s underwriting platform?

Relm Pro runs Deep Research and 10-year pro-forma on operating multifamily, office, and retail. See pricing or book a 30-minute walkthrough.

About the author

Illya Nayshevsky, Ph.D., is the founder of Relm and writes on agentic systems, computer vision, and the underwriting workflow.

The Construction Site That Sees Itself: Agentic AI and Image Detection in 2026