What Actually Breaks When You Try to Connect a Dialysis Clinic to the Cloud

By Sai Rupesh Kagga, Senior Software Developer, SanQuest Inc.
LinkedIn: Sai Rupesh Kagga

People keep asking when dialysis goes “fully cloud.” It doesn’t. Not soon. What you see across the networks I’ve worked with is partial migration, analytics, patient portal, ingest from home devices, a few EHR hooks. The clinical core stays on prem because treatment won’t tolerate a WAN flap, and because Fresenius and Baxter ship software glued to their hardware in ways that make lift-and-shift a non-starter. You live in hybrid. The interesting failures all live in the seam.

Machine data

Hemodialysis machines were never designed as data sources. The telemetry exists because regulators leaned on the manufacturers, not because anyone in product wanted to expose it. Each vendor has its own protocol, and asking the integration team a firmware compatibility question is something you block out a week for.

Some give you a read-only Ethernet port. Others insist on a Windows agent on a box right next to the machine, dumping session logs to a local share every few minutes. One clinic I supported had it running on a Dell Optiplex shoved under a nurse’s desk for thirteen months. Polled a network folder. Nobody touched it because nobody wanted to be the one who broke it. The day someone did finally try to upgrade the OS, treatment scheduling went sideways for a shift and a half.

IEEE 11073 is the nominal standard. Coverage is patchy, version drift is real, and the part nobody plans for is vendor extensions, they quietly break portability between firmware revisions of the same model. What you build in practice is an adapter zoo. One per model, sometimes one per firmware. Everything feeds an edge aggregator that normalizes before anything touches cloud. We put Kafka at the edge to buffer through WAN drops, which happen far more often than carrier SLAs suggest, especially in standalone outpatient sites. The adapter problem doesn’t get solved. It gets staffed.

FHIR is a format, not an answer

R4 forced every dialysis vendor to expose an API. Compliant on paper. Whether what comes out means the same thing, separate fight.

Most DIMS platforms were built ten or fifteen years back around billing workflows. The data model still smells like it: procedure codes, shift schedules, supply inventory. Map a Kt/V reading to an FHIR Observation and you’re making a dozen quiet decisions. Which LOINC. Pre or post. Whether a missing value is an explicit null or just dropped. Two vendors will choose differently, both validate, and the data won’t reconcile three systems downstream.

Timestamps alone can eat a quarter. DIMS storing local clinic time without a timezone, EHR expecting UTC, network spanning two zones, some records migrated from a legacy system using yet a third convention, and DST creating rows that aren’t wrong, just genuinely ambiguous. The fix is unglamorous: a location lookup, a manual review flag for DST-boundary records, and clinical staff burning weeks at cutover validating by hand. Procurement, separately, will ask why this wasn’t in the SOW.

Compliance rewrites your service boundaries

HIPAA gets checkbox treatment from teams that haven’t shipped healthcare. In dialysis it stops being a checkbox. PHI moves machine → edge → ingest → pipeline → EHR → portal. Every hop is audit surface. Zero-trust isn’t a posture choice here, it’s forced on you, mTLS everywhere, secrets in Vault, services scoped tight. Shapes how you decompose. You don’t get to ship one fat Lambda doing everything.

What actually gets you in an audit is access logging, not breach risk. Six-year retention, tamper-evident, queryable across thousands of daily sessions. S3 with object lock works fine and costs almost nothing. It’s the thing that saves you when a regulator asks for cohort-specific access history.

Home dialysis broke the reliability model

In-clinic data loss is annoying but recoverable. Nurse re-keys it. Home hemo doesn’t work that way. Patient running nocturnal at 2 AM over residential cable, the machine log is the only record. Drop the session and nobody notices until weekly summary review. By then the intervention window is gone.

Store-and-forward at the gateway is the only thing that holds. Local write first, sync on connectivity, idempotent endpoint. A naive POST implementation we inherited was losing 12% of sessions in rural cohorts. Done properly it sits below 0.3%.

Scalability is correctness, not throughput

A big network might be 30K patients, three sessions a week. Not high volume. The hard problem is longitudinal correctness. A bug in a Kt/V calc doesn’t break one report, it contaminates every downstream metric ever derived from it. Event sourcing earns its keep. Immutable raw events, projections rebuilt on top. Storage costs money. Untrustworthy clinical data costs more.

ESRD patients have no physiological margin. Disconnected systems create exactly that margin. Every working integration is a small structural risk reduction. Reason enough.