Tag: disaster recovery

  • My AI Assistant Died. Here’s How I Got It Back in 2 Hours.

    My AI Assistant Died. Here’s How I Got It Back in 2 Hours.

    A real-world disaster recovery story — and the backup routine that saved weeks of work.


    Last Monday at 12:07pm, I told my AI assistant to update itself. Seven hours later, I was still trying to get it back online.

    This is the story of how a routine software update killed my AI setup, what I lost, what I saved, and the simple backup habit that prevented a genuine disaster.

    The Setup

    I run an AI assistant called Saul through OpenClaw — an open-source platform that connects a large language model to your messaging apps, email, calendar, and pretty much anything else you can think of. Saul lives on a VPS in a Docker container and talks to me through WhatsApp.

    Over seven weeks, Saul had become genuinely useful. Not “novelty chatbot” useful — operationally embedded in my daily workflow. He manages my inbox, writes and publishes articles to my blog, generates a daily podcast, monitors my stock portfolio, runs automated prediction market trades, scans for comets in NASA satellite imagery, tracks vehicle tax and MOT dates, and does a dozen other things I’ve forgotten I ever did manually.

    All of that is configuration. Skills, scripts, API keys, cron schedules, memory files, credentials. Seven weeks of iterative building.

    The Update

    OpenClaw version 2026.3.22 was available. The release notes looked impressive: a new skill marketplace, improved plugin architecture, support for the latest AI models. The usual.

    I told Saul to update. He confirmed: “Updated from 2026.3.13 → 2026.3.22. Restarting now — back in a sec.”

    He never came back.

    The Silence

    What followed was seven hours of silence. No WhatsApp messages. No email reviews. No heartbeat checks. Nothing.

    The update had introduced a breaking change that wasn’t in the release notes. WhatsApp — previously a built-in plugin — had been moved to an external marketplace. But the configuration still referenced it as a built-in. The result: a validation error that blocked every command, including the one you’d need to fix it. A perfect deadlock.

    I couldn’t repair it. I couldn’t roll it back through normal channels. I had to rebuild from scratch — tear down the container and start again on the previous version.

    What I Lost

    When I rebuilt the container, I lost everything that wasn’t on persistent storage:

    • The entire OpenClaw configuration (channel settings, heartbeat config, plugin setup)
    • All 33 scheduled cron jobs (email reviews, portfolio checks, blog publishing, news monitoring)
    • The WhatsApp session (had to re-scan a QR code to re-link)
    • The headless browser and its dependencies
    • API key registrations that had to be regenerated

    The configuration file — a single JSON file that orchestrates everything Saul does — was gone.

    What I Saved

    But here’s the thing: the workspace survived.

    Three weeks earlier, I’d set up a simple daily backup. Every night at 3am, Saul tars up his entire workspace directory — memory files, scripts, skills, credentials, notes, everything — and copies it to cloud storage. It’s a shell script. It took ten minutes to write.

    That backup, taken six hours before the failed update, contained:

    • 41 daily memory logs spanning seven weeks
    • 78 custom scripts (trading bots, podcast generators, blog publishers, email tools)
    • 15 installed skills
    • All API credentials and secrets
    • The complete long-term memory file with every decision, preference, and project note

    I downloaded the backup from Dropbox. Extracted it. The workspace was whole.

    The Rebuild

    Getting Saul operational again took about two and a half hours. Not because the backup failed, but because some things can’t be backed up as files.

    The WhatsApp session is a cryptographic handshake between the server and my phone. When the container was rebuilt, that session was invalidated. I had to SSH into the server, generate a new QR code in the terminal, and scan it from my phone. Five minutes, but it requires physical access.

    The cron jobs — all 33 of them — existed only in OpenClaw’s runtime database, not in the workspace. I had to recreate them from memory and from my notes. This is where good documentation paid off: Saul’s own TOOLS.md file listed every cron job with its schedule and purpose. Recreating them was tedious but not guesswork.

    API keys for the Polymarket trading system had to be regenerated. The old keys were invalidated when the configuration was wiped. Fortunately, the wallet private key was in the backup, so deriving new API credentials was a single command.

    The headless browser needed its system libraries reinstalled — a Docker-level dependency that doesn’t persist across container rebuilds. One command from the host machine.

    By 9:34pm — two and a half hours after starting the recovery — everything was operational. WhatsApp connected. All cron jobs rebuilt. Browser working. Trading desk active. Email flowing.

    And as a bonus, during the rebuild we added a capability we didn’t have before: voice control of the Sonos speakers in the house. Sometimes a crisis creates space for improvements you wouldn’t have made otherwise.

    The Rules We Wrote Afterwards

    The first thing I did after recovery was write rules to prevent this happening again. Not guidelines — hard rules, embedded in Saul’s operating instructions:

    Rule 1: Always backup before updating.** No exceptions. The backup runs automatically the moment an update is requested, before anything is touched. It copies to off-server storage.

    Rule 2: Check the issue tracker.** Before applying any update, check GitHub for known bugs in the target version. If WhatsApp or any critical channel has open issues, don’t update.

    Rule 3: Save the configuration separately.** The OpenClaw config file now gets backed up independently of the workspace, because it’s the hardest thing to recreate from memory.

    Rule 4: Document everything in the workspace.** If it’s not written down in a file that gets backed up, it doesn’t exist. Cron job schedules, API endpoints, SSH details, speaker IP addresses — all of it lives in files now.

    The Lesson

    The real lesson isn’t “backups are important” — everyone knows that. The lesson is that AI assistants are infrastructure now, and they need the same operational discipline as any other critical system.

    When Saul went dark for seven hours, it wasn’t a toy that stopped working. Real workflows were affected. Emails went unread. Scheduled tasks didn’t fire. Monitoring stopped. The podcast didn’t generate. For a tool that’s supposed to make you more productive, sudden loss of it makes you less productive than if you’d never had it at all.

    If you’re running an AI assistant that’s become embedded in your daily operations — whether it’s OpenClaw, or any other platform — ask yourself:

    1. If it died right now, what would you lose?
    2. How long would it take to rebuild?
    3. Do you have a backup that could survive a complete teardown?

    If you can’t answer those questions confidently, spend ten minutes today setting up a backup. A cron job, a tar file, a cloud sync. It doesn’t matter how — it matters that it exists.

    Because the update that breaks everything isn’t a question of if. It’s when.


    I’m a CFO who builds with AI. I write about the intersection of finance, technology, and getting things done at markhendy.com.