Sad Servers Devops Challenges Solutions Part 1:
As a associate system administrator I worked on Redhat Linux servers, including user management, permissions, services, and performance monitoring Automated routine administrative tasks using Bash scripting and cron jobs, reducing manual effort by ~30% I am aws certified sysops administrator and Google Certified Cloud Engineer. Determined to transition my career into cloud architect /Cloud Support role
Intro:
SadServers is a leetcode type of platform for system administrator/devops/sre engineers to solve real troubleshooting problem using free and paid virtual servers . I have attempted free and both paid challenges to practise for mock interviews and here is writeup of of my solutions:
1. Minimize disk-filling process: Identify and terminate the process writing to /var/log/bad.log:
lsof /var/log/bad.log
# Output shows PID 621 (badlog.py), kill it:
kill 621
# Or as one-liner:
kill $(lsof -t /var/log/bad.log)
*2. Find secret number hidden in .txt files: Count "Alice" occurrences:
grep Alice -Hnc *.txt | cut -d: -f2 | python3 -c "import sys; print(sum(int(l) for l in sys.stdin))"
Find file with exactly one "Alice" and next line's number:
grep Alice -Hnc *.txt | grep -Ee :1$ | cut -d: -f1 | xargs -I{} sh -c 'grep -A1 Alice {} | grep -oEe "[0-9]+"'
Concatenate results:
PART1=$(...) # as above
PART2=$(...)
echo -n "\(PART1\)PART2" > /home/admin/solution
3. Find IP with most requests in access.log:
grep -oEe '[0-9]{1,3}(?:\.[0-9]{1,3}){3}'
access.log | sort | uniq -c | sort -n | tail -1 | awk '{print $2}' > /home/admin/highestip.txt
4. Fix web server (Apache) permission issue: Check permissions:
ls -lhA /var/www/html/
# Fix:
sudo chmod a+r /var/www/html/index.html
# Then verify:
curl localhost
If local curl hangs, fix firewall:
sudo iptables -F
5. Fix Postgres disk full issue: Check disk usage:
df -h
# Remove large files:
rm -f /opt/pgdata/*.bk
# Restart PostgreSQL:
sudo systemctl restart postgresql
6. Fix Nginx starting error (configuration and open files): Remove invalid line:
sed -i -e '1d' /etc/nginx/sites-enabled/default
Fix open files limit:
sudo sed -i -e '/LimitNOFILE/d' /etc/systemd/system/nginx.service
sudo systemctl daemon-reload
sudo systemctl restart nginx
7. Enable Nginx web page:
chmod a+r /var/www/html/index.html
iptables -F
8. Set up Docker web app on port 8888: Update Dockerfile:
EXPOSE 8888
CMD ["node", "server.js"]
Rebuild:
sudo docker build -t app:latest .
Run container:
sudo docker run -d -p 8888:8888 app:latest
9. Fix Kubernetes deployment for webapp: Update deployment.yml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
namespace: web
spec:
selector:
matchLabels:
app: webapp
replicas: 1
template:
metadata:
labels:
app: webapp
spec:
containers:
- name: webapp
image: webapp
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8888
Update service:
apiVersion: v1
kind: Service
metadata:
name: webapp-service
namespace: web
spec:
type: LoadBalancer
selector:
app: webapp
ports:
- port: 8888
targetPort: 8888
nodePort: 30007
Apply:
kubectl apply -f deployment.yml -f nodeport.yml
10. Run program requiring communication: If wtfit needs a service, but network is broken, run with:
LD_LIBRARY_PATH=/lib/x86_64-linux-gnu ./wtfit
If permissions are broken:
sudo /lib/x86_64-linux-gnu/ld-2.31.so /usr/bin/chmod +x /usr/bin/chmod
This summarizes the key commands and fixes for each scenario efficiently.
