为什么要自动化
手动运维 100 台服务器是灾难,自动化运维 1000 台服务器是常态。
Ansible 是自动化运维领域的瑞士军刀:无 Agent、基于 SSH、YAML 语法,学习曲线平缓但能力强大。
核心概念速览
| 概念 | 说明 |
|---|---|
| Inventory | 主机清单,定义管理哪些服务器 |
| Module | 原子操作单元(如 copy, service, yum) |
| Playbook | YAML 格式的任务编排剧本 |
| Role | 可复用的 Playbook 组织单元 |
| Facts | 自动采集的主机信息(CPU/内存/OS等) |
快速安装与配置
# 安装 Ansible
pip install ansible
# 验证
ansible --version
# 基本配置
cat > ~/.ansible.cfg << 'EOF'
[defaults]
host_key_checking = False
inventory = ./hosts
forks = 20
timeout = 10
gathering = smart
EOF
Inventory 主机清单
# hosts — 静态清单
[webservers]
web01 ansible_host=10.0.1.11 ansible_user=root
web02 ansible_host=10.0.1.12 ansible_user=root
web03 ansible_host=10.0.1.13 ansible_user=root
[dbservers]
db01 ansible_host=10.0.2.11 ansible_user=root
db02 ansible_host=10.0.2.12 ansible_user=root
[prod:children]
webservers
dbservers
[prod:vars]
ansible_python_interpreter=/usr/bin/python3
monitoring_enabled=true
动态 Inventory(生产必备)
# inventory/aliyun.yml — 从阿里云 API 动态获取主机
plugin: alibaba.alicloud.ali
regions:
- cn-hangzhou
filters:
vpc_id: vpc-xxx
keyed_groups:
- key: tags.Role
prefix: role
Ad-Hoc 命令(即兴操作)
# 所有 web 服务器执行 uptime
ansible webservers -m command -a "uptime"
# 并行执行(-f 指定并发数)
ansible webservers -f 20 -m shell -a "free -m"
# 复制文件到所有主机
ansible prod -m copy -a "src=/etc/nginx/nginx.conf dest=/etc/nginx/nginx.conf backup=yes"
# 重启服务
ansible webservers -m systemd -a "name=nginx state=restarted"
# 查看收集的 Facts
ansible web01 -m setup | less
# 只显示特定 facts
ansible web01 -m setup -a "filter=ansible_memory_mb"
Playbook 实战
基础结构
---
# playbooks/nginx-setup.yml
- name: 安装和配置 Nginx
hosts: webservers
become: yes
vars:
nginx_port: 80
nginx_worker_processes: auto
tasks:
- name: 安装 Nginx
yum:
name: nginx
state: latest
- name: 部署配置模板
template:
src: templates/nginx.conf.j2
dest: /etc/nginx/nginx.conf
backup: yes
notify: restart nginx
- name: 确保服务运行
systemd:
name: nginx
state: started
enabled: yes
handlers:
- name: restart nginx
systemd:
name: nginx
state: restarted
使用 Jinja2 模板
# templates/nginx.conf.j2
user nginx;
worker_processes {{ nginx_worker_processes }};
error_log /var/log/nginx/error.log;
events {
worker_connections {{ ansible_processor_vcpus * 1024 }};
}
http {
server {
listen {{ nginx_port }};
server_name {{ inventory_hostname }};
location / {
root /usr/share/nginx/html;
}
# 条件渲染
{% if monitoring_enabled | default(false) %}
location /nginx_status {
stub_status on;
allow 127.0.0.1;
deny all;
}
{% endif %}
}
}
批量更新:零停机滚动部署
---
# playbooks/rolling-update.yml
- name: Nginx 滚动更新
hosts: webservers
serial: 1 # 每次只操作 1 台
become: yes
vars:
deploy_version: "v2.1.0"
pre_tasks:
- name: 摘除负载均衡
haproxy:
state: disabled
host: "{{ inventory_hostname }}"
socket: /var/run/haproxy.sock
backend: web_backend
delegate_to: "{{ item }}"
with_items: "{{ groups['loadbalancers'] }}"
tasks:
- name: 部署新版本
copy:
src: "/data/builds/{{ deploy_version }}/"
dest: /usr/share/nginx/html/
notify: reload nginx
- name: 健康检查
uri:
url: "http://{{ inventory_hostname }}/health"
status_code: 200
register: health_result
retries: 10
delay: 3
until: health_result.status == 200
post_tasks:
- name: 恢复负载均衡
haproxy:
state: enabled
host: "{{ inventory_hostname }}"
socket: /var/run/haproxy.sock
backend: web_backend
delegate_to: "{{ item }}"
with_items: "{{ groups['loadbalancers'] }}"
handlers:
- name: reload nginx
systemd:
name: nginx
state: reloaded
Role 目录结构
roles/
└── common/
├── tasks/
│ └── main.yml # 入口任务
├── handlers/
│ └── main.yml # 处理器
├── templates/
│ └── sysctl.conf.j2 # Jinja2 模板
├── files/
│ └── rpm-gpg-keys/ # 静态文件
├── vars/
│ └── main.yml # 变量(高优先级)
├── defaults/
│ └── main.yml # 默认变量(低优先级)
└── meta/
└── main.yml # 依赖和元信息
常用模块速查
# 包管理
- yum: name=nginx state=latest # RHEL/CentOS
- apt: name=nginx state=latest # Debian/Ubuntu
- pip: name=ansible state=latest # Python 包
# 文件操作
- copy: src=/local/file dest=/remote/file backup=yes
- template: src=config.j2 dest=/etc/app.conf
- lineinfile: path=/etc/hosts line="10.0.1.11 db01"
- blockinfile: path=/etc/ssh/sshd_config block="{{ lookup('file', 'sshd_block') }}"
# 命令执行
- command: uptime # 不经过 shell,更安全
- shell: "ps aux | grep nginx" # 经过 shell,支持管道
- script: /local/scripts/setup.sh # 先上传再执行
# 系统管理
- user: name=deploy groups=wheel shell=/bin/bash
- group: name=deploy state=present
- cron: name="backup" minute=0 hour=2 job="/opt/backup.sh"
- systemd: name=nginx state=started enabled=yes
# 文件属性
- file: path=/opt/app state=directory owner=deploy mode=0755
- stat: path=/etc/nginx/nginx.conf
register: nginx_conf
# 条件执行
- debug: msg="需要升级"
when: ansible_memory_mb.real.total < 2048
# 循环
- user: name="{{ item }}" state=present
loop:
- alice
- bob
- charlie
生产环境最佳实践
# 1. 加密敏感信息
ansible-vault encrypt secrets.yml
ansible-vault edit secrets.yml
ansible-playbook playbook.yml --ask-vault-pass
# 2. 先检查再执行(Dry Run)
ansible-playbook playbook.yml --check --diff
# 3. 限制执行范围
ansible-playbook playbook.yml --limit web01
# 4. 跳过标签
ansible-playbook playbook.yml --skip-tags "restart,reboot"
# 5. 逐步执行
ansible-playbook playbook.yml --step
# 6. 查看输出(不截断)
ANSIBLE_STDOUT_CALLBACK=yaml ansible-playbook playbook.yml
自动化哲学:任何需要重复做两次以上的操作,都值得写成 Playbook。不要信任手动操作——人会犯错,Playbook 不会。