Correction Of Error (COE) for ammobin.ca downtime
Correction Of Error (COE) for ammobin.ca downtime
site was down for most of friday
Cause: droplet disconnects from public ip
- not able to ssh or ping
- going through digital ocean console, able to access terminal, no ethX visible at all, even after many restarts
- reach out to support at 9am pst. receive response 5pm pst. did not fix issue
Fix: created new droplet
- git clone ammobin-compose
- filled out secrets (.env file)
- docker-compose up -d
- switch over dns record new droplet starts getting traffic immediately
Lessons Learned:
- ability to quickly spin up fresh machine is easy with docker compose (able to bring site back with 10mins of work)
- external + automated backups are a good thing (lost dashboards + stats + cron jobs)
- dont wait around on support people to fix things if able to just recreate
- some cloud init scripts to set up machine for me (oh-my-zsh, docker, docker-compose) would be nice
- publicly exposed status page at https://status.ammobin.ca
note: COE is an amazon thing that is well described here: https://blog.mischel.com/2018/01/24/responding-to-errors-at-amazon/