ECS COMPUTER CHECKS

ECS COMPUTER CHECKS

showall

  • In a SOC window type: showall and make sure all processes are active (except MPURGE)

    df

  • In a SOC window type: df and check the listing for available data storage capacity and the current levels of file size.

    Check the visual "SOC USER" window. If at any time these filesystems are at very high levels (95% or above) the files should be manually purged.

    To do this: click on SYSTEM ADMIN. on the top tool-bar. Click on PURGE FILES. An option PURGE FILES CONTROL window will open with automatic running. To overide this, click on manual to start purge. When the manual purge is done (look at the messages in the event window), click on AUTOMATIC (this will restart the automatic purge program).

    ps -ef |grep Tlm

  • In a TLM window, type: ps -ef |grep Tlm and look for the six processes: -DistTlm, CapTlm, and RefTlm for both mdi and vc01. i.e. the following:

    /ecs/exe/DistTlm -e soc -h soc -x tlm -r tlm -s vc01

    /ecs/exe/DistTlm -e soc -h soc -x tlm -r tlm -s mdi

    /ecs/exe/CapTlm -e soc -r tlm -svc01 -

    /ecs/exe/CapTlm -e soc -r tlm -smdi -

    /ecs/exe/RefTlm -e soc -svc01 -

    /ecs/exe/RefTlm -e soc -smdi -

  • If something doesn't look right, you may need to stop and start the ECS software.
  • The best way to check the telemetry connections to iws'es from home seems to be:

    tlm $ netstat | grep iws | sort +4.b [ | wc ]

  • The sort is to make sure you get them in the same order every time - makes it easier to spot changes. The optional pipe through "wc" is for a quick check that the expected number of connections are there - currently the first number given by wc (the number of lines) is 18, and I think that's what it's normally supposed to be (of course, e.g. soho1/soho2 may not always be connected, etc, so the number won't *always* be the same).
  • The second number in the list that comes out without the sort can also be of value - it's the number of bytes in the send queue... I haven't looked at this while whe have had problems, but I suspect it will be quite large if we're having network problems again (numbers in the 2000-4000 range seem to occur every now and then even under normal conditions).

    NOTE: This can be developed info a script to show deviations from a normal state, using cut & diff

    To see why/if packets are being dropped in event log:

  • In a TLM window, type: netstat | more see:

    Proto Recv-Q Send-Q Local Address Foreign Address (state)

    tcp 0 0 tlm.1980 soho1.iws_vc01 ESTABLISHED

    means:

    Send-Q will fill up if the network or the machine has a problem soho1.iws_vc01 is the machine and port (port is either a number or a name like iws_vc01)

    Load Checking on TLM

  • Rusty created a little script that will check the Load on the TLM box. The name of the script is loadchk and is located down /usr/local/bin. However, if you are logged in as ecs, you should be able to run it by just typing loadchk and thats it. It will output the time/date and the current load on the box. Don't worry about writing it down, it is set up so it will display on your screen and write to a log file. Let Cynthia know if you have any problems with it. There is a cron job to run every 10 minutes, but it would be good for anyone to run it periodically when Cynthia is not here and can't monitor the log file. If you see a load that is 10 or above, that would probably warrent a page or call to Cynthia so she can log in and check things out.
  • Location: /u/ecs/soc/info/computer_checks_info.html