CHAPTER 5: Troubleshooting Guide
The following is a (surely incomplete) list of points to check when
troubleshooting CPSR2. It represents the problems we have encountered
and the procedures we have developed to fix them. As time goes by, the
list will surely grow. Feel free to submit problems to the authors if you
encounter a scenario that is not described here.
Current Hints -- UPDATE: 9/11/2004
The most common problem with CPSR2 recently has been GUIManager
(green) freezing.
To solve this one:
- Kill the GUIManager
- On pegasus, find a window, be sure you have an agent running then:
pegasus% loopssh cpsr 1 30 "killall -9 dm_daemon;dm_daemon"
- On *any* cpsr machine, then type:
GUIManager
That should fix it. It is happening every 5-6 hours.
If the GUIMonitor (blue) freezes, this is usually because a machine
is dead, or an ekg is dead.
To try and rescue this:
- Kill the GUIMonitor
- On pegasus:
pegasus% loopssh cpsr 1 30 "killall -9 ekg_daemon;ekg_daemon"
- Restart the GUIMonitor on a cpsr node.
- Master reset after this, then restart the 20cm_cpsr2_das
and ObsMonitor programs again.
If a machine has crashed, the loopssh will freeze on a node which should
be obvious. If this happens, go upstairs and reboot it.
Once it is rebooted, go back to step 2.
NOTE: After a reboot you will also need to restart the dm_daemon
on that node.
Another "feature" we find is that sometimes tcs cannot start the
observation. If this occurs try the following:
- Quit the GUIMonitor
- Kill the 20cm_cpsr2_das's (control c them)
- On pegasus:
pegasus% killall -9 cpsr2d
pegasus% cpsr2d
- Restart the GUIMonitor
- Hit master reset - wait until all slaves are back online
- Restart the das's on cpsr1 and cpsr2
- Hit the "connect" button on the main control panel (GUIMonitor)
- Reset all the parameters on the main control panel
- De-select and re-select CPSR2 in TCS
- Try to start an observation again on tcs
Note that killing cpsr2d will reset all the cpsr2 header parameters
to their defaults, which may include turning off level setting. Once
you have the das programs started, hit the "connect" button on the GUI
front panel and go through all the entry boxes, ensuring that everything
is set correctly by typing in the value and hitting "enter".
After restarting cpsr2d (in fact you might want to try this first,
to make sure the problem isn't on the TCS end) be sure to:
De-select and re-select CPSR in TCS. This re-establishes the network
control connection.
Archived Hints
In normal circumstances, most problems can be fixed using the "Master
Reset" button on the manual control panel. This will kill all the CPSR2
related software on the cluster and re-start it, restoring the instrument
to a functional state. It will also kill the ObsMonitor and das scripts
(but not the monitoring GUIs), these will have to be re-started manually.
If "Master Reset" fails to work, read further...
Frozen GUIMonitor of GUIManager
- The most likely cause of a frozen GUI is a machine crash
or daemon hang. We are working on ways to better detect this
situation, but the best thing to do is loopssh around the
cluster (See chapter 2) and kill the dm_daemons if it is the
GUIManager that has frozen, or the ekg_daemons if it is the
GUIMonitor that has frozen. If the loopssh gets stuck on a
particular machine, chances are this is the one responsible
and it will have to be re-booted.
One of the primary nodes fails to initialize
- All the system parameter changes required to run the CPSR2
software are done at boot time, so in theory there should rarely
be any trouble initializing the nodes anymore. The most likely
explanation is that the connection between the GUI and the
ekg_daemon on the machine has broken. Try restarting the
daemon.
One of the secondary nodes refuses to come online
- Same advice as for the primary node failure to initialize.
20cm_cpsr2_das reports an error at runtime
- Provided the primary nodes have been initialized properly, this
should not happen. Even if some of the machines have lost their
gigabit connection, 20cm_cpsr2_das will time out after a few seconds,
mark the node bad and continue with the startup sequence. The best
thing to do in this situation is simply re-initialize the primary
nodes.
20cm_cpsr2_das ignores the "GO" signal
- cpsr2d is probably in a strange state. Follow the instructions
directly below. If restarting cpsr2d fails to solve the problem:
- The FFD might be stuck. Try the test procedures detailed in the
FFD/EDT section below. If the FFD refuses to generate data, power
cycle it.
20cm_cpsr2_das crashes when it receives the "GO" signal
- You have probably encountered the infamous UTC start bug, or one
of its subtle variations. This
can be confirmed by reading the error message in the 20cm_cpsr2_das
terminals. If it says something like "UTC not found in header",
follow these steps to fix things:
- Hit the "Abort" button on the GUIMonitor manual control page.
- Quit the GUIMonitor.
- Run the command killall -9 cpsr2d on
pegasus. Then run cpsr2d &.
- Restart the GUIMonitor.
- Hit Master Reset and restart the 20cm_cpsr2_das processes.
- Try to observe. It should work first time.
- If the error message mentions Yamasaki test failures, follow the
procedures outlined below for verifying the integrity of the FFD/EDT
system.
Memory buffers overflow on a primary node
- This is usually symptomatic of another problem somewhere else
in the system. Here are the steps to follow:
- Hit the "Abort" button on the GUIMonitor manual control page.
- Find out which secondary node the primary was sending to at
the time of the overflow. This information can be found on the
GUIMonitor primary node panel.
- Go to the GUIMonitor closeup page for this secondary node.
- If one or both of the cpsr2_recv and cpsr2_dbdisk processes have
crashed, hit the green "Bring Online" button.
- If everything on the secondary node looks good, you might just
be experiencing a slow network. Try again and see how things go.
- Now go back to the primary node page and initialise both the
primary nodes. You will have to restart the 20cm_cpsr2_das processes
as well.
- Be aware that a buffer overflow and subsequent abort can leave
cpsr2d in a bad state. If you have trouble restarting the observation,
see the previous three troubleshooting sections.
The folded profiles look strange
- Make sure the cables are all plugged in to the "engineering
test rack" upstairs in the correct order. If the bands and / or
polarisations have been swapped, the signal will be de-dispersed
to the wrong frequency. This should not be a problem unless the
cables have been disconnected and reconnected for some reason.
The pulse arrival times look strange
- Check that the primary nodes' internal clocks
are reporting the correct time. If they are not, follow the
directions in chapter 2.
You suspect the FFD/EDT system is not working properly
- To check the integrity of data leaving the EDT cards, run
/psr/cvshome/sba/yama_read and use
/psr/cvshome/sba/ffd2_serial to
start and stop the digitiser. Hit the primary node panel reset button
for cpsr2 before attempting this to make sure nothing else is trying
to use the serial port.
- If the EDT cards get into a strange state, you can reload their
configuration settings with /opt/EDTpcd/pcdload
which MUST be run from the /opt/EDTpcd/ directory.
The primary nodes refuse to send data to one of the secondaries
- The data acquisition code reads a text file to decide which nodes
it is going to attempt to connect to. This list lives in /home/cpsr/20cmlist
so make sure it is up-to-date! A previous observer may have deleted one of
the machines from this file for some reason.
Back to the index
|