Make a list of up servers and pass that list to work with my actual playbook?

user5447339 asked:

I have an ansible playbook as shown below and it works fine most of the times. But recently what I am noticing is it is getting stuck on some of the servers from ALL group and just sits there. It doesn’t even move forward to other servers in the ALL list.

# This will copy files
---
- hosts: ALL
  serial: "{{ num_serial }}"
  tasks:
      - name: copy files
        shell: "(ssh -o StrictHostKeyChecking=no abc.com 'ls -1 /var/lib/jenkins/workspace/copy/stuff/*' | parallel -j20 'scp -o StrictHostKeyChecking=no abc.com:{} /data/records/')"

      - name: sleep for 5 sec
        pause: seconds=5

So when I started debugging, I noticed on the actual server it is getting stuck – I can ssh (login) fine but when I run ps command then it just hangs and I don’t get my cursor back so that means ansible is also getting stuck executing above scp command on that server.

So my question is even if I have some server in that state, why not just Ansible times out and move to other server? IS there anything we can do here so that ansible doesnt pause everything just waiting for that server to respond.

Note server is up and running and I can ssh fine but when we run ps command it just hangs and because of that ansible is also hanging.

Is there any way to run this command ps aux | grep app on all the servers in ALL group and make a list of all the servers which executed this command fine (and if gets hang on some server then time out and move to other server in ALL list) and then pass on that list to work with my above ansible playbook? Can we do all this in one playbook?

Update:-

I am getting an error like this:

ERROR! The 'pause' module bypasses the host loop, which is currently not supported in the free strategy and would instead execute for every host in the inventory list.

The error appears to have been in '/var/lib/jenkins/workspace/process/check.yml': line 10, column 9, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:


      - name: sleep for 5 sec
        ^ here
Build step 'Execute shell' marked build as failure

My answer:


You could gather facts.

---
hosts: all
gather_facts: True
tasks:

By gathering facts explicitly, you force Ansible to try to connect to every host (and update your fact cache). If a host is unreachable, it will be skipped for the rest of the playbook. By default the timeout for gathering facts is 10 seconds, so this should reduce the amount of time you have to wait.


View the full question and any other answers on Server Fault.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.