
CHAPTER 1: Setting up the hardwareLAST UPDATE: 25/02/2004This chapter explains the configuration of the hardware in the two CPSR2 racks in the upstairs control room at the Parkes radio telescope. You only need to consult this first chapter if the cluster has been shut down and de-powered, or the machines suffer from a crash that cannot be recovered using software alone. If all the machines are turned on and ready to go, skip straight to the software setup instructions in chapter two.
Introduction: Hardware OverviewThe CPSR2 instrument consists of a fast, flexible digitiser (FFD) board, connected on one end to the Parkes down-conversion chain and on the other end to a cluster of 30 rackmount computers that handle data acquisition and processing. A gigabit fibre ethernet cable runs from the cluster racks (in the upstairs control room) to a linux workstation (Pegasus) in the observer's control room. Pegasus is used to control the system when observing.The cluster machines are all Dell 2650 servers with dual 2.2 GHz Pentium 4 processors. They are named cpsr1 through cpsr30 (Do not confuse the computer called cpsr1 with the name of the DLT-based precursor instrument, CPSR1). The machines at the top of each rack (cpsr1 and cpsr2) have a different hardware configuration to the rest and are called the "primary nodes". The other 28 machines are all identical to each other and are called "secondary nodes". The machines are arranged
in a fairly intuitive fashion, the odd numbers are stacked in the
leftmost rack, in ascending order from top to bottom. Unfortunately
the VME crate housing the FFD takes up so much space that not all 15
of the odd machines fit in this rack , so they spill over into the
rightmost rack, where they are inserted in numerical order between the
even numbered machines.
![]() The whole cluster is serviced by two network layers, one of which is standard 100 Mbit ethernet. This layer is used for control and monitoring. The second layer is a high speed gigabit ethernet system, used for shifting baseband data between machines. The 100 Mbit layer uses a single dedicated Cisco switch for all 30 machines and is connected via a gigabit fibre uplink module to a desktop machine in the lower control room called pegasus. This means that the highest bandwidth between pegasus and any individual cluster machine is only 100 Mbit/s, but in theory the line can sustain 10 such simultaneous connections before saturating. The gigabit ethernet layer uses two 24 port switches, one in each rack. They are connected by two cross over lines that should enable 2 Gbit/s transfer rates between both halves of the cluster, twice as much as should be necessary to support the standard observing modes. The gigabit switch in the left hand rack is connected by another fibre line to a machine in the tower correlator room that hosts an Apple XRaid unit, providing up to 2 TB of additional storage space that can be accessed with data rates up to 1 Gbit/s.
The FFD accepts four independent IF channels, each 64 MHz wide. In normal
operating mode, these IF's come from the downconversion rack in the
upper control room. 64 MHz filters are used to select the appropriate
section of bandwidth offered by whatever receiver is mounted in the focus
cabin. The RF signal is mixed down to baseband (actually, to the band
limited section ranging from 64 MHz to 128 MHz) before it enters the FFD.
The board deals with the IF's in two pairs of two, so in the standard 2 bit
recording mode, each half of the FFD receives orthogonal polarisations from
one centre frequency. In this way it is possible to observe with full
polarimetry at two different sky frequencies, limited only by the passband
of the receiver and the down-conversion chain mixer settings. For example,
using the coaxial 10/50 cm receiver it is possible to observe simultaneously
in the 10 cm and 50 cm bands, though only across 64 MHz wide sections.
Each FFD band is sampled and the digitised bits are packed into a
data stream that is fed to the two primary nodes by means of
EDT 60 Direct Memory Access cards installed in the PCI slots. The thick,
black cables with large, multi-pin connectors are responsible for getting
the data from one place to the other. One half of the board goes to cpsr1,
the other half goes to cpsr2. In the usual observing mode, the cluster
is split in two, with two sets of 14 secondary nodes processing each of the
two sky frequencies. There is also a small serial cable running from cpsr2
to the FFD board. This is used for sending commands like "stop", "go" and
gain trim settings to the sampler.
The primary nodes both have four SCSI disks, for a total of over 250 GB of
storage space. They also have 3 GB of RAM, which is used to buffer the data
before it is sent out to the secondary nodes for processing.
![]() Each secondary node has two 72 GB SCSI hard disks and 1 GB of RAM. They are fed baseband data via the gigabit ethernet layer and run software (PSRDISP) that coherently de-disperses the raw data into folded archives. Data management software sends folded profiles back to cpsr1 for temporary storage.
Section 1: Cold Start ProceduresThis section deals with how to get everything going from scratch, if power to the racks has been turned off. These procedures have become much simpler since the Giganet network layer was removed and replaced with standard gigabit ethernet. However it is still not an easy procedure and you should allow at least two hours to get everything up and running before you intend to start observing. Now that the system is inside its shielded MFB racks there should be no reason to de-power except in the case of equipment failure or maintenance. Thus in nearly every case observers will encounter the system in a ready state and should only have to worry about software issues.
What follows is intended for experienced observers only. If you are not
comfortable with what these instructions ask you to do, then by all means
seek out the help of an experienced staff member. If you have to deal with
any of the startup scenarios in this section then the system is already in
a non-standard state and we recommend people with detailed knowledge of the
procedures handle the situation.
This should set off a cascade of noise and lights inside the racks as the computers go through their power up cycle. Note that the machines should switch themselves into the standby state once they have gone through the initial power-on cycle. They will not actually boot until you tell them to do so. The network switches will power up and go into fully operational mode immediately. The FFD crate may or may not have it's 110V power supply turned on. If the orange switch on the front of the VME crate is not lit, turn it on. Simply power up all the Dell machines in the rack by holding in their power buttons until they come to life. These days, the machines usually boot and find their network without any trouble, but there used to be a few issues associated with the ethernet layers. Most of the rest of this section deals with how to trouble-shoot things should a machine fail to find the network. Be aware that the Ethernet layers, while pretty stable in general, have occasional glitches. The most annoying of these is the famous "flashing light" problem. We think this problem was fixed by updating the drivers that control the on-board ethernet controllers, but this information is included for historical reasons. When a machine is powered up it sometimes fails to correctly initialise its ethernet ports. This is characterised by the activity light on the port rapidly flashing on and off at a steady rate. Sometimes "/etc/init.d/network restart" fixes this problem, but usually you will have to reboot or power cycle the machine.
Use the one of the cluster monitor interfaces (see chapter 2) to check the
status of the networks once all the machines have come to life. You will
have to manually sort out any that do not want to talk to the rest of the
system. This can take some time.
The only other reason to open the racks is if a machine crashes so badly that is has to be reset with the power switch. If one of the nodes locks up completely (see the trouble-shooting section for information about how to detect this), figure out which one it is and reset it by holding the power button in until it shuts down. Wait a few seconds and press the power button again to bring it back to life. The machines all have little LCD panels on the front that diaplay their name and number, so it should be obvious which is which. Sometimes, the machines in the bottom of the right hand rack will flash a little orange waring about temperature, but this is just because they are next to the air-conditioning vent and are thus running slightly too cold. This is not considered to be a problem. Overheating IS a problem, but the racks are fitted with exhaust temperature sensors that report to the observatory environmental monitoriong system, so an overheated rack will sound a showtel alarm on the monitors in both control rooms. If this happens when no staff are present, go to the following url: http://www.parkes.atnf.csiro.au/cgi-bin/monitoring/equip_mon.cgi
This page displays the temperature readings for the two cpsr racks. If the
temperature drifts above about 40 degrees, go upstairs and open the rack
doors to allow ambient air in. This will only ever happen if the A/C fails,
so staff should be notified as soon as possible.
|