Server Migration

A bit of a hardware post. But one with some software ties.

This past weekend I was trying to decide what to do with a Pi 4 4GB I had sitting around. I had originally reserved it for running some software for my brother. But there were some limitations with the ARM architecture I couldn't get past which made it less than ideal. So I finally settled on a decision to use it to transition some of my more critical services onto.

"Move more critical services to a Raspberry Pi over a dedicated x86 based server" you say? Yep. But there are a few good reasons:

  • Boot Time
  • Power Usage
  • Understanding
  • Prove Recovery Strategy

Boot time is pretty simple. Due to a checkered past of issues with OMV, the boot time on my server is over 5 minutes. This is clearly not ideal. The boot time on the Pi is under 1m. The Pi also automatically boots after losing power by default. I have the main server set to do this as well, but BIOS updates or any other activity which reset the BIOS would undo this on my main server.

However, the primary reason is that I have some services which I want to start as quickly as possible. And right now, the Pi simply boots faster. Yes, I can fix the issues with my server. But, at this point, the easiest way to achieve that is to start from scratch again. And the easiest way to start from scratch is to migrate it elsewhere first.

Power usage is the next easy to understand element. I only really have 2 "absolutely essential" services. Traefik and Home Assistant. Both of these can run just fine on the Pi. Pretty much everything else could be shut off from time to time. The TDP on my server processor is 95W. The Pi as a whole draws about 6W under load and goes as low as 4W when the draw is lower. Now, I don't know what the ACTUAL power draw of my server is. But the 1700x in there alone is undoubtedly consuming more than 5W even when under low loads. I would guess, on the lower end 15-20W. And then the motherboard, GPU and drives all add up. I would estimate at "idle" (the system is never truly idling with OMV + 20 docker containers running) that the system draws at least 50W constantly. 

As such, if I can shut down the computer for 6-12 hours I should get a pretty significant savings.

Understanding is the next pillar in this decision. And Traefik is the big motivator there. I still have some compose files using labels rather than toml files simply because I never took the time to figure out and parse them into toml files. I prefer toml files for a few reasons; I can put all of my traefik configuration in one place and back it up, I can quickly and easily view specified configurations and the toml files work even when the docker containers are on other servers. 

Obviously, that last item is key here. While I have already migrated Home Assistant. I still need to wrap up a few more services before I can start migrating Traefik. Understanding how to configure Traefik well enough to be able to make this transition is also one of the reasons for doing it in the first as it also helps in defining my recovery strategy.

And with that segue, we are onto proving out the recovery strategy. Pretty much everything is running in Docker for a reason; portability. The containers themselves are ephemeral. If the services are able to restart successfully each time that means, in theory, that everything else needed is stored in the environment variables and my persistent storage mounts. And that means that these services can be restored by simply backing up the contents of those mounts and my compose files. 

I HAVE tested this to a degree before. I did reinstall OMV at one point. But, that was a special case. I had a failing drive and replicated the old one onto a new one. I certainly proves that it is possible in at least one fashion to recover. But it also requires the drive to be pretty much fully functional. 

I have also tested some more minor migrations. I moved some services from the primary OMV drive to a new NVME drive. So, it was piecemeal and even required some changes to the mounts in the compose file. But, everything still remained on the same server. Not bad. But also not the degree of flexibility I ultimately want to prove.

My currently plan is to first move Home Assistant (as I've done), then update my Traefik configs to run everything from toml files (almost done) and then migrate Traefik itself (next step). I'll call that phase 1. Once phase 1 is complete, Phase 2 will consist of figuring out where all of the other services are running from, generate a strategy to backup their persistent storage and compose files and then... recreate the server (without OMV, or perhaps with OMV as a Docker container) and get everything running again.

Once this is done I'll have proven out the ability to do 2 very important things:
  1. The ability to safely migrate services to any server I want
  2. The ability to recreate my primary server from backup data
And the software tie-in?

Having worked at a few software companies, the 2 things that I will say are often lacking are a solid recovery plan and real world testing of that recovery plan.

Think about my situation here. I'm not a company. I have limited funds and time. But I've built a recovery strategy which would allow me to actually verify the ability to recover almost any service at any time. As an example, if I were actually running these services as a business, once or twice a year I would restore these services to a secondary environment and verify them. And once every 1-2 years I would schedule a maintenance window where Production is taken offline and the services are either restored to new hardware, or if running in the cloud, restored to a new cluster or even provider just as a test.

Having backups is good. Having a complete strategy is great. But, every time I've seen a company NEED to perform a recovery, it has come with issues beyond the recovery itself. 

My situation is also representative of another reason to practice your recovery strategy; cost savings. In addition to having seen companies that never validate their recovery strategies, I've also seen them, as a result, feel that migrating service providers or hardware as being too risky to attempt. While vendor lock-in is real, the bigger problem most companies have is that they are afraid to even attempt a move. A reproducible and tested recovery strategy is a huge part of eliminating that risk. 

Though, I want to be clear. A "proper" recovery strategy is not simply restoring a disk image. A proper recovery strategy (IMHO) involves only using backups of data and being able to rebuild the environment. A full disk image is too big and complex to be relied upon and cannot necessarily be migrated. It also means potentially restoring the conditions which caused the failure in the first place. There is no problem with having disk images as a redundant layer. But, those are a last resort in my opinion. 

At the end of the day, once I'm done this I'll know that I have a proper backup strategy because I will have used it. And I will have used it in my "production configuration" to restore my "production environment". 

Comments

Popular Posts