CHAPTER 1: Setting up the hardware

LAST UPDATE: 25/02/2004

This chapter explains the configuration of the hardware in the two CPSR2 racks in the upstairs control room at the Parkes radio telescope. You only need to consult this first chapter if the cluster has been shut down and de-powered, or the machines suffer from a crash that cannot be recovered using software alone. If all the machines are turned on and ready to go, skip straight to the software setup instructions in chapter two.



Introduction: Hardware Overview

The CPSR2 instrument consists of a fast, flexible digitiser (FFD) board, connected on one end to the Parkes down-conversion chain and on the other end to a cluster of 30 rackmount computers that handle data acquisition and processing. A gigabit fibre ethernet cable runs from the cluster racks (in the upstairs control room) to a linux workstation (Pegasus) in the observer's control room. Pegasus is used to control the system when observing.

The cluster machines are all Dell 2650 servers with dual 2.2 GHz Pentium 4 processors. They are named cpsr1 through cpsr30 (Do not confuse the computer called cpsr1 with the name of the DLT-based precursor instrument, CPSR1).

The machines at the top of each rack (cpsr1 and cpsr2) have a different hardware configuration to the rest and are called the "primary nodes". The other 28 machines are all identical to each other and are called "secondary nodes".

The machines are arranged in a fairly intuitive fashion, the odd numbers are stacked in the leftmost rack, in ascending order from top to bottom. Unfortunately the VME crate housing the FFD takes up so much space that not all 15 of the odd machines fit in this rack , so they spill over into the rightmost rack, where they are inserted in numerical order between the even numbered machines.


The whole cluster is serviced by two network layers, one of which is standard 100 Mbit ethernet. This layer is used for control and monitoring. The second layer is a high speed gigabit ethernet system, used for shifting baseband data between machines. The 100 Mbit layer uses a single dedicated Cisco switch for all 30 machines and is connected via a gigabit fibre uplink module to a desktop machine in the lower control room called pegasus. This means that the highest bandwidth between pegasus and any individual cluster machine is only 100 Mbit/s, but in theory the line can sustain 10 such simultaneous connections before saturating. The gigabit ethernet layer uses two 24 port switches, one in each rack. They are connected by two cross over lines that should enable 2 Gbit/s transfer rates between both halves of the cluster, twice as much as should be necessary to support the standard observing modes. The gigabit switch in the left hand rack is connected by another fibre line to a machine in the tower correlator room that hosts an Apple XRaid unit, providing up to 2 TB of additional storage space that can be accessed with data rates up to 1 Gbit/s.

The FFD accepts four independent IF channels, each 64 MHz wide. In normal operating mode, these IF's come from the downconversion rack in the upper control room. 64 MHz filters are used to select the appropriate section of bandwidth offered by whatever receiver is mounted in the focus cabin. The RF signal is mixed down to baseband (actually, to the band limited section ranging from 64 MHz to 128 MHz) before it enters the FFD. The board deals with the IF's in two pairs of two, so in the standard 2 bit recording mode, each half of the FFD receives orthogonal polarisations from one centre frequency. In this way it is possible to observe with full polarimetry at two different sky frequencies, limited only by the passband of the receiver and the down-conversion chain mixer settings. For example, using the coaxial 10/50 cm receiver it is possible to observe simultaneously in the 10 cm and 50 cm bands, though only across 64 MHz wide sections. Each FFD band is sampled and the digitised bits are packed into a data stream that is fed to the two primary nodes by means of EDT 60 Direct Memory Access cards installed in the PCI slots. The thick, black cables with large, multi-pin connectors are responsible for getting the data from one place to the other. One half of the board goes to cpsr1, the other half goes to cpsr2. In the usual observing mode, the cluster is split in two, with two sets of 14 secondary nodes processing each of the two sky frequencies. There is also a small serial cable running from cpsr2 to the FFD board. This is used for sending commands like "stop", "go" and gain trim settings to the sampler. The primary nodes both have four SCSI disks, for a total of over 250 GB of storage space. They also have 3 GB of RAM, which is used to buffer the data before it is sent out to the secondary nodes for processing.


Each secondary node has two 72 GB SCSI hard disks and 1 GB of RAM. They are fed baseband data via the gigabit ethernet layer and run software (PSRDISP) that coherently de-disperses the raw data into folded archives. Data management software sends folded profiles back to cpsr1 for temporary storage.



Section 1: Cold Start Procedures

This section deals with how to get everything going from scratch, if power to the racks has been turned off. These procedures have become much simpler since the Giganet network layer was removed and replaced with standard gigabit ethernet. However it is still not an easy procedure and you should allow at least two hours to get everything up and running before you intend to start observing. Now that the system is inside its shielded MFB racks there should be no reason to de-power except in the case of equipment failure or maintenance. Thus in nearly every case observers will encounter the system in a ready state and should only have to worry about software issues.

What follows is intended for experienced observers only. If you are not comfortable with what these instructions ask you to do, then by all means seek out the help of an experienced staff member. If you have to deal with any of the startup scenarios in this section then the system is already in a non-standard state and we recommend people with detailed knowledge of the procedures handle the situation.

At this point, you should take note of some important facts about the Dell 2650 computers:

  • Their power management is a little unusual. The backplane of the machine (including the PCI bus) is powered up from the moment you stick a live IEC cable in the power socket. There is no hard "off" switch and the only way to fully power cycle a machine is to unplug it from the outlet, wait about 10 seconds and plug it back in again. The power button on the front of the machine only controls the power hungry internal systems like disks and the CPU. Shutting the machine down and turning it off from the front panel is not enough to fully reset the network cards, or EDT cards in the case of the primary nodes. This seems to be a design "feature" that would probably be valuable were we actually using the machines as servers. Unfortunately for us it is just a bit of a pain.
  • To turn a machine on once the backplane has power, just hit the little round button with the green light in the centre on the front panel. They're stiffly sprung, so you'll need to use a bit of pressure. One click is enough to start the power-up cycle.
  • To shut down a machine, first run the "shutdown -h now" command (see the software section for more info on this) as root, and wait for the machine to reach runlevel zero. Once there, press and hold the power button on the front panel until you see some of the lights go out and hear the fans spin down. This should be almost instant.
Let's assume that everything has been cleanly shut down and the power to the racks has been disabled. In general the network switches should be powered up first. However these days the racks are all fed from the same few breakers so it is difficult to do anything but shut down whole sections at a time. Simply turn on the main breakers above the racks and ensure that the individual phase breakers (follow the orange cables) inside the racks are on.

This should set off a cascade of noise and lights inside the racks as the computers go through their power up cycle. Note that the machines should switch themselves into the standby state once they have gone through the initial power-on cycle. They will not actually boot until you tell them to do so. The network switches will power up and go into fully operational mode immediately. The FFD crate may or may not have it's 110V power supply turned on. If the orange switch on the front of the VME crate is not lit, turn it on.

Simply power up all the Dell machines in the rack by holding in their power buttons until they come to life. These days, the machines usually boot and find their network without any trouble, but there used to be a few issues associated with the ethernet layers. Most of the rest of this section deals with how to trouble-shoot things should a machine fail to find the network. Be aware that the Ethernet layers, while pretty stable in general, have occasional glitches. The most annoying of these is the famous "flashing light" problem. We think this problem was fixed by updating the drivers that control the on-board ethernet controllers, but this information is included for historical reasons.

When a machine is powered up it sometimes fails to correctly initialise its ethernet ports. This is characterised by the activity light on the port rapidly flashing on and off at a steady rate. Sometimes "/etc/init.d/network restart" fixes this problem, but usually you will have to reboot or power cycle the machine.

Use the one of the cluster monitor interfaces (see chapter 2) to check the status of the networks once all the machines have come to life. You will have to manually sort out any that do not want to talk to the rest of the system. This can take some time.

Once you reach the point where all the machines can talk to the control system and all the gigabit ethernet status reports are good, everything should be ready to go. See chapter two for instructions on how to set up the machines in the lower control room to run the observing software.

The only other reason to open the racks is if a machine crashes so badly that is has to be reset with the power switch. If one of the nodes locks up completely (see the trouble-shooting section for information about how to detect this), figure out which one it is and reset it by holding the power button in until it shuts down. Wait a few seconds and press the power button again to bring it back to life. The machines all have little LCD panels on the front that diaplay their name and number, so it should be obvious which is which. Sometimes, the machines in the bottom of the right hand rack will flash a little orange waring about temperature, but this is just because they are next to the air-conditioning vent and are thus running slightly too cold. This is not considered to be a problem. Overheating IS a problem, but the racks are fitted with exhaust temperature sensors that report to the observatory environmental monitoriong system, so an overheated rack will sound a showtel alarm on the monitors in both control rooms. If this happens when no staff are present, go to the following url:

http://www.parkes.atnf.csiro.au/cgi-bin/monitoring/equip_mon.cgi

This page displays the temperature readings for the two cpsr racks. If the temperature drifts above about 40 degrees, go upstairs and open the rack doors to allow ambient air in. This will only ever happen if the A/C fails, so staff should be notified as soon as possible.

Back to the index