// Interview Prep/Industry Questions/IoT & Embedded QA

📡 IoT & Embedded QA

9 questions · full model answers. Firmware OTA, device-cloud state convergence, offline reconciliation, and fleet-scale telemetry — testing where the physical world is the ground truth and a bad release can be physically unreachable.

// What they weigh

What a IoT & Embedded QA interviewer is actually probing for — beyond generic QA.

  • 01

    Respect for irreversibility

    Candidates who treat firmware like a web deploy fail immediately. Interviewers look for OTA discipline: staged rollout, A/B rollback, signature verify, and post-flash health check before the fleet gets the build. The question they're really asking is: 'do you know you can't just roll back a bricked device from a dashboard?'

  • 02

    Closed-loop physical reasoning

    A command ack is not the same as a physical state change. Strong candidates close the loop — they verify the actuator reached the new state via a sensor reading, not by trusting the cloud acknowledgement. Interviewers screen for candidates who can articulate why 'the app says it's locked' and 'the lock is engaged' are two different assertions.

  • 03

    Fleet-scale and offline-first thinking

    Offline is the default, not the exception. Strong candidates treat reconnect backlog replay and conflict resolution as first-class test cases, reason about clock skew across a heterogeneous device fleet, and know that QoS-redelivery idempotency is the difference between correct telemetry and inflated aggregates. Interviewers distinguish candidates who think about one device from those who think about ten thousand.

// Junior · 2

A user taps 'lock' in a smart-home app and the app shows 'locked' immediately. What is the actual state guarantee at that moment, and how would you test the gap?

Junior

The app shows the command was acknowledged by the cloud — not that the physical lock engaged. The gap is the ack-vs-execute distinction: the actuator could have jammed, timed out, or been offline. Testing requires closed-loop confirmation from a sensor, not just observing the app state.

// What interviewers look for

That the candidate understands there are two distinct events — cloud acknowledgement and physical actuation — and that the app displaying 'locked' conflates them. Strong answers name the closed-loop confirmation mechanism (a sensor or position switch) and articulate what a failure looks like: a false physical state that the user trusts.

Common pitfall

Treating the app UI as ground truth and writing a test that checks the app displays 'locked'. This tests the command was sent, not that the lock engaged. Missing the physical-state divergence entirely — this is the canonical IoT QA failure mode and interviewers filter for candidates who recognise it.

Model answer

At the moment the app shows 'locked', the only guarantee is that the cloud received and acknowledged the lock command. The physical actuator might still be mid-stroke, might have jammed, or — if the device was briefly offline — the command might be queued and not yet delivered. The app displaying 'locked' and the lock being physically engaged are two different facts, and I test them separately. The correct oracle is not the app state but a sensor that confirms the physical transition: a door-position sensor, a bolt-position switch, or a current-draw signal from the motor that indicates it reached its end-stop. My test would: issue the lock command via the API; wait for the device to report physical confirmation (sensor state change), not just a cloud ack; assert the app reflects the sensor-confirmed state, not the ack. Then I cover the failure paths: send the lock command to a device whose actuator is obstructed — assert the app shows a failure or 'unable to confirm', never 'locked'. Send the command to an offline device — assert the app shows a pending or timeout state, not a falsely confident 'locked'. Test a flaky actuator that acks but intermittently fails to complete — assert the system detects the non-confirmation within the defined timeout and surfaces it. Throughout, the test oracle is the sensor, not the cloud ack.

device-cloud stateack vs executephysical stateactuatorsensorsmart home

Walk through how you'd test the device pairing and provisioning flow — including mid-pair power loss, re-pairing an already-claimed device, and confirming the device reaches a ready state.

Junior

Test the full provisioning lifecycle: happy path to ready state, then the failure modes — power loss mid-pair, re-pair of an already-claimed device, factory reset de-provisioning, and confirmation that credentials are invalidated, not just hidden.

// What interviewers look for

Coverage of both the success path and the state-corruption failure modes that are unique to physical-device provisioning. Strong candidates distinguish between 'pair again' (device already registered in the cloud) and 'factory reset then re-pair' and test that the prior ownership is cleanly severed — not a web-auth flow, a physical ownership binding.

Common pitfall

Testing only the happy path (scan QR code, device pairs, app shows 'ready') and stopping there. Missing that provisioning has ownership-binding semantics — a device that still holds a previous owner's credentials after an incomplete factory reset is a security failure, not just a UX gap. Missing the mid-pair power-loss case, which leaves the device in a partially-provisioned limbo state.

Model answer

I test provisioning as an ownership-binding operation, not just a connection setup. The happy path covers: factory-fresh device starts broadcasting; user follows the pairing flow (QR scan, BLE handshake, or Wi-Fi credential exchange); device contacts the provisioning endpoint, receives its per-device certificate, and transitions to 'ready' in the app. I verify the device's cloud identity is issued and unique — it should not be sharing a certificate with any other device. Then the failure paths: power loss at each provisioning stage — during BLE handshake, during credential exchange, during the first cloud handshake. After each interruption I assert the device is in a defined fallback state (either cleanly unprovisioned and ready to retry, or in an error state with a clear recovery path) and that the provisioning endpoint does not create a zombie entry for the partial claim. Re-pair of an already-claimed device is critical: a second user attempting to pair a device that belongs to another account should be blocked; the device should not transfer ownership without the prior owner explicitly de-registering it. Factory reset de-provisioning: perform a factory reset, confirm the device's prior-owner cloud access is fully revoked — not just the device's local credentials wiped, but the cloud entry marked inactive so the old owner's app cannot send commands. Finally I verify the 'ready' state is a confirmed, synchronised state: the device has reported in, the cloud knows its current firmware version, and a test command round-trip succeeds before I call the pairing journey complete.

provisioningpairingBLEWi-Fiownershipfactory resetconsumer IoT

// Mid-level · 4

How would you test a firmware OTA update — covering staged rollout, image signature verification, installation, and post-flash health check?

Mid-level

Test each OTA phase independently: the distribution endpoint gates the right cohort, the device verifies the signature before flashing, the installation completes with a post-flash health check, and a failed health check triggers automatic rollback — never a stuck or silent failure.

// What interviewers look for

Stage-by-stage coverage rather than 'download and run it'. Strong candidates test what happens at each gate (signature rejection, cohort gating, health-check failure) and know the correct failure response: rollback to the previous image, not a half-installed state. They distinguish firmware OTA from a software deployment: here a bad build is physically irreversible without a rollback mechanism.

Common pitfall

Testing only the success path (firmware updates, device works). Missing the rejection paths — an unsigned image, a downgrade attempt, a failed health check — which are the safety gates. Treating OTA as equivalent to a web service rolling deployment and missing that a failed flash without A/B rollback leaves the device unbootable.

Model answer

I test OTA as a staged pipeline with explicit gates at each step, because a mistake at any stage can leave a device permanently offline. Step one is cohort gating at the distribution endpoint: a device whose hardware revision is not eligible for this build requests the OTA, and I assert it is refused — not served the wrong image. A device in the target cohort at the rollout percentage boundary is consistently included or excluded on repeated eligibility checks, not assigned randomly. Step two is download integrity: I tamper with the image binary mid-transfer and assert the device rejects it at the checksum and signature verification step, before any flash operation begins. I send a correctly-signed image whose version number is older than the currently installed build and assert the anti-rollback check rejects it — this is a bootloader-level check, not an application-layer permission, so I verify it fires at the right point. Step three is the flash itself: I verify the device uses an A/B partition scheme (or equivalent rollback mechanism) so the active image is not overwritten until the new image is confirmed healthy. Step four is the post-flash health check: I define what 'healthy' means — device boots, reports firmware version, successfully sends a heartbeat to the cloud within a defined window — and I test what happens when the health check fails: the device must reboot from the previous A/B image automatically and log the rollback reason, not sit in a boot loop or stay on the broken image. I finish by verifying the cloud reflects the correct firmware version after rollback, not the failed target version.

firmwareOTAstaged rolloutsignaturehealth checkrollbackembedded

A device was offline for two hours during which it buffered local commands, and the user changed the same setting in the app. When the device reconnects, what do you test?

Mid-level

Test the reconnect backlog replay, the conflict resolution policy (device-last-write vs cloud-last-write vs timestamp-based), deterministic winner selection, and that the losing change is surfaced rather than silently dropped — then test the edge where the buffered backlog has aged past the cloud's retention window.

// What interviewers look for

Offline-first thinking where the device is buffering writes and commands — not a web read-cache serving stale data, but a device that has been independently issuing or queueing state mutations. Strong candidates articulate the conflict model: two independent sources of truth (device and cloud), a defined policy for resolution, and an observable outcome that the user can verify.

Common pitfall

Describing this as 'test that the app refreshes when the device reconnects' — treating offline as a display-cache problem rather than a state-conflict problem. Missing that the device has been accumulating writes, not just failing to read. Or assuming last-write-wins is obvious and not testing the boundary where the policy's implementation picks the wrong winner.

Model answer

The setup is two co-equal sources of truth: the device has been locally buffering a command ('set thermostat to 18°') and the cloud has a later command from the user ('set thermostat to 22°'). These are not a client and a server — both have been independently writing state, and the reconnect is a merge, not a sync. I test three things. First, conflict detection and policy application: on reconnect, the system must identify the conflict — same setting changed by both sides — and apply the documented conflict policy deterministically. Whatever that policy is (device timestamp wins, cloud timestamp wins, or 'most recent by server-corrected time'), I test it by seeding pairs where the policy should unambiguously pick one, and asserting it picks the right one every run, not randomly. Second, transparency of the losing change: the loser should not be silently discarded. I assert the user is either notified that a conflict was resolved or can inspect the history. 'Silent overwrite of user intent' is a specific failure mode I look for: the user set 22° while offline, the device reconnects and silently reverts to 18°, and no indication is shown. Third, backlog expiry: I test a device that reconnects after its buffered events have aged past the cloud's retention window. The expected behaviour is that expired events are discarded, not applied stale — I assert this and verify the device reaches a coherent state without the expired commands being applied.

offlinereconnectreconciliationconflict resolutionbacklogdevice-cloud sync

A firmware OTA is 60% through writing to flash when power is lost. How do you test the device does not end up bricked, and what must the recovery path look like?

Mid-level

Inject power loss at multiple flash percentages and assert the device boots from the previous A/B image every time — never from the partially-written partition. Verify the watchdog or bootloader rollback fires, the rollback is logged, and the update is retried automatically on the next boot.

// What interviewers look for

Concrete fault-injection at specific flash-percentage checkpoints, not just 'test what happens on power loss'. Strong candidates know the correct recovery mechanism (A/B partition, watchdog-triggered rollback) and can describe the test oracle: the device boots the previous image, reports the correct firmware version to the cloud, and retries the update — not a boot loop, not silent failure.

Common pitfall

Describing the test as 'unplug the device during an update and see what happens', without defining the checkpoints, the expected outcome, or what 'bricked' means as a measurable assertion. Missing that the correct recovery requires a specific mechanism — if the device has only one partition and no watchdog rollback, there is no safe recovery path, and the test should fail.

Model answer

I test interrupted OTA as a fault-injection matrix across flash percentages, because the recovery guarantee must hold at every point in the write sequence, not just at an arbitrary point. I pick checkpoints at roughly 10%, 25%, 50%, 75%, and 90% of the flash write and use a scripted power cut (or a hardware relay for physical devices, or a simulated power event in a device emulator) to interrupt at each. At each checkpoint I assert three things: the device boots, the device boots from the previous image (A/B partition's known-good slot), and the firmware version reported to the cloud matches the pre-update version — not a partial version, not zero. The failed-version image must never be the active boot target. I then assert the recovery path: the watchdog or bootloader rollback fires within the defined timeout (not after a boot loop), the rollback reason is logged with a timestamp, and the cloud marks the OTA as failed for this device rather than showing it as pending or successful. Finally, I verify that the device retries the update on the next scheduled OTA window rather than staying permanently on the old version — a failed-then-recovered device should not be silently excluded from future rollouts. The failure mode I'm explicitly screening for is: device boots to a half-written image and crashes repeatedly (boot loop), or boots to nothing at all (bricked). Both are test failures; the A/B mechanism must prevent both.

firmwareOTAinterrupted flashbrickingA/B partitionrollbackwatchdogembedded

How do you test that sensor data is valid when the data can be dirty by physics — drift, calibration loss, out-of-range spikes, and signal dropouts are expected field conditions?

Mid-level

Test at the ingestion layer using seeded inputs that cover the full validity envelope: nominal, boundary, out-of-range, NaN or dropout, and a slow-drift series. Assert range and rate-of-change guards fire at the correct thresholds — the oracle is the guard's response, not the presence of a single bad reading.

// What interviewers look for

Ingestion-layer testing with a dirty-data mindset — sensor data is not clean API input, it is a physical signal that degrades over time in the field. Strong candidates distinguish range validation (single reading out of bounds) from drift detection (individually-valid readings trending toward a calibration error) and test both. This is distinct from AI/ML dataset quality: the problem here is live signal degradation caught at ingestion, not training data curation.

Common pitfall

Testing only clean, nominal readings and a single obviously-out-of-range spike. Missing drift — the most insidious sensor failure — where each individual reading passes validation but the series trends toward a systematic error. Or conflating this with data-pipeline quality assurance, which is about dataset integrity for model training, not about physical-signal validation at the moment of ingestion.

Model answer

Sensor data validation is a live physical-signal problem, not a data-quality-of-a-dataset problem. The sensor is degrading in the field right now, and my ingestion layer is the first and often only gate. I test by seeding a controlled input stream at the device simulator or ingestion stub with five categories of reading: nominal (in range, stable), boundary (exactly at the upper or lower valid limit), just-outside-range (one unit past the limit — should fail the range guard), NaN or dropout (missing reading — should be flagged as a gap, not silently interpolated or carried forward), and a slow-drift series (individually valid readings that increase by a fixed rate each tick, designed to trigger the rate-of-change guard before any single reading exceeds the range). For each category I assert the correct guard fires. Range guard: an out-of-range reading is flagged and not propagated to aggregates or alerts as valid data. Dropout guard: a gap in readings is surfaced rather than silently carrying the last value forward indefinitely. Rate-of-change guard: the drift series triggers the guard before any individual reading exceeds the threshold — the point is to detect gradual calibration loss early, not only after the sensor is already giving unusable data. I also test calibration offset application: if the sensor has a known-good calibration value, readings must be corrected at ingestion rather than raw values stored and passed to analytics. A de-calibrated sensor returning consistently +4° high is a systematic error, not random noise, and the test oracle is whether the system can detect it via the drift rate before it propagates into alerts and billing.

sensor datavalidationdriftcalibrationout-of-rangedropouttelemetryindustrial IoT

// Senior · 3

Design a security test strategy for firmware OTA — covering image signing, anti-rollback protection, per-device identity, and credential revocation.

Senior

Test at the bootloader layer, not the application layer: unsigned images rejected before any flash, downgrade attempts blocked by anti-rollback, each device's unique certificate verifiable and individually revocable, and a compromised device's access cut without affecting the fleet.

// What interviewers look for

Bootloader-level trust verification distinct from application security. Strong candidates know OTA security is not about API authentication or PII protection — it is about binary-trust on a constrained device. The attack surface is the image itself and the device's identity, and the defense must be in the bootloader before any code runs, not in a cloud permission check.

Common pitfall

Describing OTA security as 'test that only authorised users can trigger the update' — this is the cloud-API permission check, not the firmware security guarantee. A cloud permission check can be bypassed by a compromised cloud, a network MitM, or a physical attacker who extracts the device's shared secret and constructs a valid-looking payload. The bootloader signature check is what holds when all of that fails.

Model answer

OTA firmware security sits at the bootloader layer, not the application layer, and I test it as a binary-trust problem rather than an API-permission problem. The distinction matters: a cloud permission check can be bypassed by a network MitM or a compromised update server; a cryptographic signature check in the bootloader holds even then. I test four surfaces. Image signing: I construct an unsigned image, a tampered image (valid structure, bit-flipped content), and a correctly-signed image — I assert the first two are rejected by the bootloader before any flash write begins, and only the third proceeds. The rejection must happen at the bootloader, not at a cloud pre-check; I verify this by injecting the tampered image at the device's download path, bypassing the cloud check. Anti-rollback: I take a correctly-signed image whose version number is lower than the currently installed build and assert the bootloader's monotonic version counter rejects it — this prevents an attacker from flashing a known-vulnerable older version. I also test that the rollback counter increments correctly on a successful update and cannot be decremented. Per-device identity: I verify each device's provisioning certificate is unique — not a shared fleet-wide key. I extract the certificate fingerprint from two different devices and assert they differ. If a device uses symmetric keys instead of asymmetric certificates, I note this as a weaker security model. Revocation: I revoke a specific device's certificate via the provisioning API and assert the device can no longer authenticate to publish telemetry or receive OTA updates — other fleet devices are unaffected. I verify the revocation propagates within the defined window and does not require a device restart or a cloud cache flush to take effect. The cross-cutting concern is that none of these checks are bypassable through debug ports or local APIs, so I also verify UART and JTAG are disabled in production firmware builds.

OTA securityfirmware signinganti-rollbackper-device identitycertificaterevocationembedded

How would you test fleet-scale telemetry ingestion for correctness — covering QoS-redelivery idempotency, out-of-order event handling, clock skew, and the reconnection-storm scenario?

Senior

Test each correctness property independently: redelivery idempotency (same event delivered twice must write one record), clock-skew ordering (server-arrival timestamp overrides raw device timestamp for sequencing), and reconnection-storm throughput (N devices reconnecting simultaneously must not lose or duplicate any event or leave devices permanently retrying).

// What interviewers look for

Three distinct correctness properties that require separate test approaches: exactly-once semantics under redelivery, temporal ordering under heterogeneous clock drift, and throughput with no-loss under a sudden burst. Strong candidates know clock skew is a fleet-specific problem — not a general timezone issue — and that the correct solution is to trust server-arrival time for ordering, not raw device timestamps.

Common pitfall

Testing only the success-path throughput ('N devices publishing; all readings arrive') without testing idempotency under redelivery or the ordering correction for skewed clocks. Clock skew in particular is a failure mode candidates miss: events with future timestamps from a device whose RTC drifted forward will sort incorrectly in the time series, corrupting aggregations, without any error being raised.

Model answer

Fleet telemetry ingestion has three correctness properties I test independently because they fail in different ways. First, QoS-redelivery idempotency: MQTT QoS-1 guarantees at-least-once delivery, so the ingestion service must handle duplicates. I inject a deliberate redelivery — simulate the broker redelivering the same PUBLISH packet with the same message ID — and assert exactly one record is written to the store. The idempotency mechanism (typically a message-ID or event-ID dedup key with a TTL) must hold under concurrent redeliveries, so I fire two simultaneous redeliveries of the same message and assert still exactly one record. Second, clock-skew ordering: I seed a device with a clock drifted 35 minutes fast. It emits events with timestamps in the future. I assert that ingestion corrects the ordering using server-arrival time rather than device-reported time, and that the time series in the store shows events in receipt order. I test the boundary: two devices emit the same physical event at the same real moment; their raw device timestamps differ by 12 minutes of clock skew — I assert the stored sequence reflects server-arrival order and not device-timestamp order. These events should be queryable together in the correct sequence, not interleaved incorrectly because of the skew. Third, reconnection storm: I simulate a region of devices reconnecting simultaneously after a 30-minute outage. Each device replays its buffered backlog. I assert no reading is lost (every buffered event is stored), no reading is duplicated (the backlog idempotency holds under storm concurrency), ingestion latency stays within SLA for the tail of the storm, and no device is left in a permanent retry loop because the ingestion endpoint rate-limited it and never backed off.

telemetryfleet scaleQoSidempotencyclock skewout-of-orderreconnection stormindustrial IoT

How do you test a firmware build under constrained-resource conditions — power and battery floor, memory and flash envelope, and intermittent connectivity — and how do you define what 'graceful degradation' looks like as a test oracle?

Senior

Test each resource constraint independently with defined limits: the firmware stays within its flash and RAM budget under load, a battery device survives a full OTA burst without dropping below the operating voltage floor, and intermittent connectivity triggers prioritised task suspension rather than a crash or hung process.

// What interviewers look for

Resource-constrained testing as a first-class discipline, not an afterthought. Strong candidates define concrete oracles for each constraint — a specific memory ceiling, a specific battery voltage floor, a specific degradation behaviour — rather than describing vague 'stress tests'. They know that on a constrained device, the correct response to resource pressure is defined, deterministic degradation, not undefined behaviour or crash.

Common pitfall

Treating constrained-resource testing as the same as server-load testing (just run k6 until it falls over). On an embedded device the failure modes are different: flash is finite and non-growable, battery has a hard minimum voltage below which the device must not operate, and a crash under memory pressure can leave the device in a permanently degraded state without physical access for recovery. The oracle for 'success' is not throughput but defined degradation: the device suspends non-critical work, not crashes.

Model answer

I define concrete resource ceilings for each constraint before testing, because 'graceful degradation' is only testable if I specify what it means. Flash and RAM budget: I measure the firmware's static and dynamic footprint at build time and assert it stays under the defined ceiling — typically the target hardware's available flash minus a safety margin, and the peak heap usage under the worst-case feature combination. I test this at the hardware level using a device profiler or linker map analysis, not by inference from the application layer. For flash, I deliberately fill storage close to capacity and assert the firmware handles it without corrupting existing data or crashing — it should refuse new writes with a defined error, not silently overwrite. Power and battery floor: I define the minimum safe operating voltage and test that the device suspends power-hungry tasks — OTA download, active telemetry upload, BLE advertising — when battery voltage approaches that floor. I use a bench power supply with programmable voltage control to hold the supply at threshold levels and assert the correct task-suspension behaviour fires. The oracle: below the floor the device stops non-essential work and enters a low-power state; it does not attempt an OTA flash (which would brick it if power is lost mid-flash). Intermittent connectivity: I use a network fault injector or a TC (traffic control) rule to drop the connection at intervals and assert the firmware's reconnection behaviour is bounded — it backs off exponentially, does not retry at full speed indefinitely, and does not hold thread resources in a perpetual blocked state. The graceful-degradation oracle: under prolonged connectivity loss the device continues to buffer readings locally, does not crash, does not exhaust its buffer (it drops oldest when full per the defined policy), and reconnects correctly when connectivity is restored without requiring a reboot.

constrained resourcespowerbatterymemoryflashintermittent connectivitygraceful degradationembedded

// Go deeper

These questions pair with the in-depth IoT & Embedded QA QA guide — the risk areas, signature bugs, and test strategies the questions are drawn from.