KILLED FTPS

KILLED FTPS

Summary:

Here's the longer story:

  • Once upon a time, a DR was written because outgoing ftp of REL or QKL files to an IWS would hang. Someone wrote a script & cron job, /ecs/exe/kill_old_ftps, to kill ftp sessions that were more than 1 hour old. However, in such cases our software does not detect that a problem occurred, and the team whose ftp session hung will not get their REL/QKL file, *nor* an email saying it failed.

  • When investigating what was thought to be the same problem, with CELIAS, Stein found out that CELIAS' problem was not what we thought: Their ftp daemon had been changed, and generated failure messages that were not recognized by our software. So, adding an appropriate entry in /home/ecs/ecs/ecs_config/Config.ftp, their problem was fixed.

  • But in troubleshooting the above, we also removed the redirection of output from the cron job kill_old_ftps. If no processes are killed, there's no output, so nothing is sent to us. But if an ftp has been killed, an email like this will be sent to us:

    Date: ...

    From: daemon@soc.nascom.nasa.gov

    To: ecs@soc.nascom.nasa.gov

    Killing old FTP process: 11994 01:50:29 csh -c ftp -n < /tmp_logs/ftp.LG7J7a >& /tmp_logs/ftpxfer.LG7J7a

    Killing old FTP process: 19932 01:50:29 ftp -n

    Killing old FTP process: 34420 01:45:29 ftp -n

    Killing old FTP process: 40562 01:45:29 csh -c ftp -n < /tmp_logs/ftp.fm7Daa >& /tmp_logs/ftpxfer.fm7Daa

    Cron: The previous message is the standard output and standard error of one of your cron commands.

  • If we get something like this in the future, it (probably) means that an ftp of a REL or QKL file to some IWS has hung, and the cron job killed it. (I say "probably" because a manual ftp left running will trigger the same action.

  • In this example, *two* hanging ftp jobs were killed.

  • The input files (e.g. /tmp_logs/ftp.LG7J7a and /tmp_logs/ftp.fm7Daa in this case) are not deleted, so you simply more them:

    soc $ more /tmp_logs/ftp.LG7J7a

    open erne

    user cepac 1S0H0_C3pac

    bin

    cd soc_reports

    put /ftp/tlm_files/CEPHK/CEPHK_050410_105731.QKL CEPHK_050410_105731.QKL

    close

    quit

  • That pretty much gives you the machine it was ftp'ing to ("open erne") and the name of the file that didn't make it.

  • Based on that information, you email the team that didn't get the file so they can fetch it on their own.

    The following section can be disregarded if/when Rusty updates the script to be more discriminating:

    However, the kill_old_ftps process will sometimes unsuccessfully attempt to kill incoming ftp sessions as well, resulting in messages like these:

    Date: ...

    From: daemon@soc.nascom.nasa.gov

    To: ecs@soc.nascom.nasa.gov

    kill: 26042: permission denied

    Killing old FTP process: 26042 01:59:58 ftpd: 194.199.161.231: medocadm: RETR ee2005054181.402

    *************************************************

    Cron: The previous message is the standard output and standard error of one of your cron commands.

  • These emails are just to be ignored, assuming they don't occur very often, in which case they might indicate a network problem or "attack" of someone trying to get all our telemetry data over and over again, etc.
  • Location: /u/ecs/soc/info/killed_ftps_info.html