CHAPTER 2: Setting up the software
LAST UPDATE: 26/04/2005The software start-up procedure has been radically overhauled recently. Follow the following easy steps after cabling the system:
1. Quit TCS.
Xrunmb 50cm (for 50cm only)
Xrunmb 20cm (for dual-band 20cm)
Xrunmb 1050cm (for 10cm on some machines and 50cm on the others)
(you will be prompted for the cpsr password. Then, umpteen windows will pop up everywhere. The last of which is the scriptMonitor, a handy ascii program for keeping track of the current status of the cluster. Among these windows are ones labelled PRIMARY CPSR1 and PRIMARY CPSR2. There are four graphics windows. Two for level seting and two for monitoring the pulsars. When, and only when you see the sign saying NOW START TCS should you do so. This whole procedure usually takes about 90 seconds.)
5. Start TCS on any observatory machine.
Select Pulsar, Expert, then load the appropriate schedule file.
Don't double-click on the filename! Just select it and click ok.
This is a glish feature!
We recommend you start on a CAL, then
stop and wait until it appears on the CPSR2 ObsMonitors. This
will take a few minutes. It is vital that you have the levels
set properly or else you will never get anywhere. Before
running TCS it is not a bad idea to run the lorun command
with an appropriate argument on something like perseus.
lorun ~/losetup/p456_20cm.cmd (20cm mode)
lorun ~/losetup/swin1050cmcpsr2.cmd (1050 mode)
Most Common Bugs.
Bug Class 1. Cannot take data at all.
1. Due to a recent change of frequency, the levels cannot be set
properly by cpsr2_vmon.
You are doomed! You must rerun the lorun script or fiddle with
the attenuator settings to get enough power in the IFs. Once you
have done this, Hitting MASTER RESET on the GUIMonitor and restarting
seems to work after waiting 60 seconds. You should have about
3 units of power in the LOGUI next to the L40 settings.
2. Despite observing happily for hours, all of a sudden you get a
Primary BUFFER WARNING message followed by a primary crashing.
This is due to an unfortunate logic error in the das's. Again
a master reset should fix things after a 60 second delay.
If this fails, Running Xrunmb 20cm (or whatever) will destroy
everything and start from scratch. Useful at 3am!
There is another class of problem in which you take data but for one reason or another, never see the pulsar.
Bug Class 2. Data taken ok, but no pulsar.
1. The pulsar never appears in the Obsmonitors.This is likely to be because the pulsar ephemeris is not in the $TEMPO/tzpar directory on cpsr1. Copy the format of something like 1909-3744.par and try again. In fact, ensure all your pulsars have an ephemeris before you start. You can check whether the psrdisp jobs are failing by looking at files ending in the .processingFAILURE in $MONDIR on cpsr1. That will be a big hint!
2. You observe, but despite knowing the pulsar is "real", if just looks like noise!?This may be because your ephemeris is wrong, or your polynomial has insufficient parameters to cope with the tight orbit of a binary. Check your pulsar appears in /import/cpsr1/archives/CPSR2/psrdisp_args.db and add coefficients as appropriate to the command line options for psrdisp. Follow the format exactly!
Old Manual Starts HerePegasus is connected to the racks upstairs by a gigabit ethernet fibre cable. It is crucial to the operation of the entire instrument. It is also the gateway machine for the cluster. None of the cluster nodes are directly connected to the internet (for security reasons). The only machines they can see are pegasus and anything else on the cluster LAN. In fact, since pegasus does no IP forwarding, none of the cluster machines can see the wider internet at all.
The cluster machines all run various binaries that handle the transfer of data. The primary nodes are responsible for shifting data from the FFD to the secondaries via ethernet, the secondary nodes receive this data and store it on their local disks. They also reduce the data using coherent de-dispersion algorithms, but this is covered in chapter 4.
Each primary node only sends to one secondary at a time, so there are only ever two gigabit connections active. In the standard operating mode, the primaries fire a gigabyte of baseband data to each secondary in turn. cpsr1 only sends to the odd numbered secondaries, and cpsr2 sends to the even numbered ones. This means that the last secondaries on the list will receive their first block of data several minutes after the first machines, but this represents only a small startup lag in the data reduction. When observing in 50 cm mode (when the available receiver passband is only 64 MHz wide) only one half of the board and one primary node are used. In this mode, that one primary sends to all the secondary nodes so that the full processing power of the cluster can still be used.
While the concept is fairly simple, there are a lot of things that can go wrong, and a couple of configuration issues to keep in mind. To help keep watch on things it is necessary to run a monitoring program that reports on the status of the cluster and allows you to change its configuration if necessary. This program comes in two variants, one has a graphical interface with buttons and menus and should only be run when sitting in the control room. The other shows a terminal based text-only summary of the vital statistics, but does not allow any control commands to be run. The first program is called "GUIMonitor" and the second is called "TEXTMonitor". Normally, TEXTMonitor is used from a remote location to allow monitoring over a low-bandwidth connection.
The rest of this chapter describes how to start GUIMonitor and use it to prime the cluster.
Section 1: Preliminary SetupWhile GUIMonitor is the main interface to the cluster, it requires several other processes that run in the background. These processes are fairly stable and should usually not cause trouble. However in the event that one or more has crashed, been killed or simply frozen up, you may be required to restart them. There are two classes of support program, the first class are the ones that run on pegasus itself. These are described below:
Utility Software on Pegasuscpsr2d
This little network daemon is responsible for interprocess communications between the data acquisition code, the GUI and TCS. It is vital to the operation of the system and must always be running on pegasus itself. If you start the GUI and it is unable to connect to the socket provided by cpsr2d, a red warning message will be displayed in the manual control panel. If you see this warning, quit the GUI, open a terminal on pegasus and run the commands:
killall -9 cpsr2d cpsr2d &
Unfortunately, doing this will revert all the observing parameters to their default values and the daemon will not accept updated values until it has been connected to the two primary nodes. Be sure to run a test observation and reset all the parameters on the main panel to ensure the system is in the state you require.
For the purposes of pulsar timing, the start time of each observation must be known very accurately. The FFD does this by listening to the syncronisation pulses generated by the observatory time standards. However these only tell you when the exact beginning of the next second is. The data acquisition software still needs to know the UTC, so it understands what the time will be when the next sync pulse arrives to start the FFD running. Since the UTC reference only needs to be accurate to within about 300 milliseconds, we use the system clocks in the computers themselves, synchronised with the observatory time standards via Network Time Protocol software. Pegasus runs a job that corrects its clock against the observatory clocks. We have found that this is good enough to keep it within about 10 ms of the actual time, more than adequate for our purposes. Pegasus in turn (being the only external machine the cluster nodes can talk to) runs a Network Time Protocol Daemon, ntpd, which the cluster nodes can connect to to correct their clocks. This keeps all the cluster nodes within about 10 ms of the true time.
You can check the clocks on the machines using:
Note that if ntpd is ever restarted on pegasus, it takes a few minutes before the program decides the system clock is stable enough to be useful. During this time it will refuse external connections. The clocks in the cluster nodes should be stable enough to cope with the temporary loss of ntpd on pegasus, though you should always try to get it running again as soon as possible.
This program is responsible for converting warning messages from the GUIMonitor into speech. It is not in any way essential to the running of the instrument, however it is very helpful. The audio warnings mean you can be less concerned with watching all the numbers yourself. If anything goes outside of normal operating parameters, pegasus will tell you about it. To run this program, enter the command:
on pegasus. Note that there is nothing to warn you if this program is not running. If it is, you should hear the phrase "System Initialised" when you start the GUIMonitor. Sometimes it will crash or hang and you will have to run
killall -9 festival_server
before trying again.
The cluster nodes do not use any sort of network-based authentication scheme, they all have local password files. To log onto them, you will need to find a staff member who can tell you the password. To make life easier, pegasus is set up to run an ssh-agent at boot time. This enables you to log into any cluster node without a password, from any local Xterm, provided you have authorised the agent using the cpsr passphrase. Again, staff can tell you what it is. All you have to do to authorise the agent is run the command:
and supply the passphrase when requested. This will enable password-free logins to all the cluster nodes.
This command only works after you have enabled the ssh-agent. It is a simple script that allows you to run the same command on all cluster nodes (or a sequential subset thereof). It is espescially useful if you need to restart all of the ekg_daemons or dm_daemons (see below). The syntax is as follows:
loopssh cpsr 1 30 "command sequence"
where "cpsr" is the prefix used to name the machines, "1 30" is the
range of numbers that will be sequentially appended to the prefix
and "command sequence" is any command that you would normally run
on the local machine, "ekg_daemon" or "killall -9 ekg_daemon" for
example. If the command you intend to run does not return, you can
make the ssh connection fork into the background by appending a -f
argument to the above line, after the command sequence. Beware of
doing this unless absolutely necessary, because leaving 30 ssh
connections hanging is not always the best of ideas...
Utility Software on the Cluster Nodesekg_daemon
These little programs run in the background on all the cluster nodes and are responsible for reporting back to the central monitoring system. They decide whether a cluster node is ready to receive data or not. They also run the setup scripts that prepare the nodes for their duty. These daemons should be running at all times. They do not however start automatically when a machine is rebooted, so if for any reason you have to power-cycle or soft-reboot a machine, be sure to run the following command from pegasus:
ssh cpsrXX ekg_daemon
Where XX is replaced by the node number, between 1 and 30. If you authorised the ssh-agent on pegasus, no password is required.
These are very similar to the ekg_daemons in that they sit in the background and should always be running. The dm_daemons talk to the data manager GUI and facilitate the processing and archiving of raw data. They must also be started manually if a machine is rebooted. Simply run this command from pegasus:
ssh cpsrXX dm_daemon
Section 2: The CPSR II MonitorTo fire up GUIMonitor and check the health of the cluster you will have to log on to pegasus in the lower control room. Talk to one of the staff members to obtain a password. Start up a terminal and secure shell to cpsr30. Note that the GUI can be run from any of the cluster nodes, but it is best not to use the primaries for anything other than data acquisition. We usually run it from cpsr30 just to be consistent. To start the GUI, simply enter:
You will be presented with the manual control panel, which has text entry boxes for configuring the observing parameters and buttons for starting and stopping data acquisition. At the top of the window you will see a set of tabs that allow you to select a variety of different displays. Below is a description of the information available under each tab.
Manual Control PanelThe Manual Control Panel is used when you wish to observe without the help of TCS. You can enter all the relevant parameters like frequency, bandwidth, sky coordinates and so on. Note that none of these entry boxes have any effect on the observatory hardware. The values they contain are simply passed to the header of the baseband data files so they can be processed correctly. It is up to the observer to ensure the values in these fields match with what the telescope and down-conversion chain is doing.
The secondary node settings should normally be left at the default values unless you are sure you know what you are doing. These parameters control the size of the RAM buffers and the amount of data sent to each machine every round.
The buttons in the lower section are used to send commands to the data acquisition system. They work as follows:
Target List PageFor use in future versions...
Cluster Summary PageThis page allows you to monitor the health of the cluster in real time. It shows you the load, free disk space, status of the data acquisition system and the health of the gigabit ethernet interface on each machine, as well as an indication of the network transfer rates. Status information is gathered by a small daemon process running on each of the cluster machines (the ekg_daemons). The GUI talks to these daemons via a network socket. In the unlikely event that one of these dameons crashes, you can restart it by secure shelling to the machine and running the command:
There are several columns of data, each row in the table represents one cluster machine, whose name is given in the first, left-most column.
Load statistics are displayed in the second column and are mostly used for diagnosing slow machines and for making sure that data is being processed. Load on the secondaries should be around 2.0 when data is being reduced. If it goes beyond 5 or 6, you probably have rouge processes on the machine.
Disk space is shown in column three and refers to the scratch partition used for storing baseband data before it is processed. If the CPUs can keep up with the data rate, these numbers should not change much. However if you are observing a high DM pulsar the processing will lag behind the recording and the disks will start to fill.
The status message in the fourth column relates to the overall state of the data acquisition system on each machine. It can be any one of the following:
Note that some aspects of the GUI only update around 5 times every minute, so there is a bit of lag between reality and what the diagnostic display reports.
The last column displays the time (in seconds) it took to transfer the last chunk of data to each machine. The number you get depends on the mode you are running in, but on average you should expect a 1 GB transfer to take somewhere between 16 and 17 seconds. If you start to see times higher than this, you are in trouble. Keep a close eye on the primary buffer displays and be prepared to stop the observation (see chapter 3) if the transfers are taking too long.
The Primary Node PanelThe primary node panel is split verticaly in two, one half for each primary node.
The primary nodes run unique processes that handle the distribution of data to the secondary nodes. This page lets you monitor the characteristics of the system that are specific to the two primaries. Of most importance is the shared memory status display. The primary nodes have a large chunk of memory set aside in a group of buffers. These buffers are filled with data from the EDT cards and drained by gigabit ethernet transfer.
The default settings for number and size of buffers should be sufficient for most situations. Leave them be unless you are sure you need different settings.
The rate of network transfer should be sufficient to keep pace with the incoming data, in which case you will never use more than one or two of the buffer slots. However if something goes wrong and the network cannot keep up, the buffers will start to fill. There are enough buffers to give you about 16 seconds of grace, which means the system can recover from the occasional slow link. Keep an eye on the amount of blue in the progress bar. If it ever reaches the right hand side, the data acquisition code will abort. If you see the number of free slots steadily dropping, stop the observation and figure out what the problem is. If the buffers ever get more than 2/3 full, the GUI will take over and try to stop the observation without your prompting. If this happens, it will start saying things like "Buffer Overflow" or "Emergency Override", provided the festival_server is running.
The only other things you should need to use on this page are the reset and initialize buttons. The reset button kills all the data acquisition related processes and destroys the shared memory used by the system. The initialise button also does this, but it then rebuilds everything afterwards. Thus if you overflow the buffers you will need to use the initialize button on the relevant primary node (or the Master Reset button on the main panel).
The Secondary Node Closeup PagesThe final 28 pages in the GUIMonitor all look very similar. There is one tab for each secondary node (labeled with the node number). These pages display information about the data acquisition system and memory space on the secondary nodes. Usually this information is only useful in diagnosing the cause of a crash, or if you want to individualy enable or disable machines.
Whilst the primary nodes run the code responsible for getting data from the FFD to the network, the secondaries run code responsible for getting data from the network and storing it on disk. For this task they use memory buffers identical in structure (and size) to the primary node buffers, but fewer of them. The secondaries all have fast SCSI drives and their disk I/O rates easily keep pace with the data flow, so it is rare that the secondary buffers ever fill. Each secondary node page has a "Memory Buffers Free" display. This should always be within one or two of the "Number of Buffers". If a primary node overflows its ring buffer, it might be worth checking the closeup page of the machine it was sending to. If the disk writing process crashes, the system will not hop to another node, it will just sit and fill all the buffers until everything crashes. If you see that the secondary node has no free buffers, or if either if the crucial recording applications listed at the bottom of the panel) are down, you will need to initialize it as well as the primary (Again, the main Master Reset button takes care of this for you).
The two lines below the "Node Status" display report on the two processes responsible for data acquisition on a secondary node. These are called "cpsr2_recv" and "cpsr2_dbdisk". Both of these processes should be running at all times. If you see that one of them is listed as "DOWN" the node will have to be reset. Note that there is nothing to stop you resetting a node during an observation, as long as it is not the one currently receiving data!
The two buttons at the bottom of the page perform the same tasks as the reset and initialize buttons on the primaries. Hit the "Bring Online" button to do a full reset of the data acquisition system on a secondary node. The "Take Offline" button just kills the system, it does not attempt to restart it. This might be useful if you want to prevent the primaries from sending data to specific machines for some reason.
Section 3: Priming the ClusterOnce you have the GUIMonitor running, hit the "Master Reset" button on the manual control panel. Watch the summary panel and make sure that all the nodes report "online" within about 15 seconds. If some do not, go to their closeup page and see what the problem is. If simply hitting their "Bring Online" button does not fix things, see the troubleshooting information in chapter 5.
Once the secondaries are ready to go, flip to the primary information panel and make sure nodes 1 and 2 are ready to go. The buffer capacity monitors should be empty. If you have the festival_server running, you will probably get warnings about buffer overflows as the shared memory is deleted and re-constructed. This is normal. The warnings should go away and you should have two clear buffer monitors within about 15 seconds. If the memory buffers fail to clear, try another initialization. If this fails, see the troubleshooting section.
Once the primary and secondary nodes are all online, you must run the data acquisition code on the primaries. This must be done manually, so you can see the progress information in the controling terminals. Fire up another two terminals on pegasus and secure shell to cpsr1 in one, cpsr2 in the other. Then simply run the command:
20cm_cpsr2_das in both terminals.
If you intend to observe at 50 cm, with only one half of the cluster, run 50cm_cpsr2_das in the window corresponding to the node that has the telescope IF connected, and monitor in the other.
These das scripts just run the main data aquisition program with different target lists.
You should see some messages fly past and the program will eventually reach a point where it prints "Waiting on condition variable..." which means it is ready to start recording data. At this point, everything is poised and set to go.
If the program fails to print this message and instead complains about connection or shared memory errors, see the troubleshooting section in chapter 5.