Mantis currently runs on the Thinking Machines Corp. CM-5 and is built using a Tcl/Tk graphical user interface linked to a modified version of the Free Software Foundation's gdb debugger. Mantis made its debut at U. C. Berkeley during the Spring 1994 semester and was used heavily by the parallel computation course.
The remainder of the report is organized into the following sections:
In our experience, Split-C programs break into two layers. The top layer consists of a set of bulk synchronous blocks. The nodes enter each block more or less simultaneously and are synchronized at the end of each block before entering the next. The second layer occurs within the blocks. At this layer, the programmer thinks of each node as being distinct---each operates on a different set of data or even different code, but the model itself is sequential.
For the example, we shall draw on code to simulate the world of WaTor, introduced by A. K. Dewdney in 1984 and documented further by Fox et. al., which has become a valuable tool for teaching parallel programming at Berkeley (in the CS267 course). In the original WaTor, sharks and fish share a world of water and interact through a small set of rules. For the example, we shall use a version of WaTor in which fish alone populate an infinite plane and are attracted to other fish according to an inverse-square law. We use this version to introduce programmers to basic issues in data distribution and access as well as typical methods used in solving gravitational and electomagnetic problems.
We present a conceptual solution which illustrates the Split-C programming model discussed above; the commented code appears in the appendix. At the top level, all processors will simulate the world in discrete time steps of length determined by the velocity and acceleration of the fish. Each time step breaks into synchronous phases for computation of forces, movement of fish, and collection of statistics. These phases compose the bulk-synchronous layer of the program. Within each phase, the code is sequential, although it operates on the global address space since each processor looks at each fish. Although this program is small, it exemplifies the programming model used in many much larger programs.
Main window. Symbols for the scdtest program have just been loaded.
The first step towards finding the bug is to locate the bus error, so we run the program by pressing the Run Program button and then open the status window with another mouse click. Since starting a program on the CM-5 takes a noticeable amount of time, Mantis indicates this waiting period to the user through the status area and inverted pendulum icon. The pendulum rocks back and forth until Mantis is ready for another command. The user interface retains full functionality during this period, but any actions which require the attention of Mantis itself are stacked for later execution.
Almost immediately after the example program starts, the bus error occurs. We detect the errors via the status window. The green squares (the darker squares in the black and white version) represent nodes which are running (perhaps waiting for a reply from another node), while the yellow squares represent nodes in error. If any of the nodes had stopped at a breakpoint or been halted by the programmer, they would have been displayed in blue.
Status window. Many nodes are still running (green/dark) and many others have errors (yellow/light).
Picking one of the problematic nodes, we click on the square in the status window to create the node window. The signal section confirms our belief that this is the same problem encountered from the command line, and the stack section shows that the node has stopped in all_compute_force. Furthermore, Mantis has focussed the source display section on the top stack frame and highlighted the current line in yellow. In addition to these features, the node window controls program execution on a per node basis, supporting the common single-node viewpoint inside of synchronous blocks. The node window also gives access to the various data display windows.
Node window. Node number 46 has had a bus error in all_compute_force at the line highlighted in yellow.
After looking briefly at the source line at which the error occurred, we decide to have a look at the variables used. We select the expression "local_fish" by pointing and clicking the left mouse button, then evaluate it by pressing the right button. The value returns in the evaluation window. This window provides the main interface for examining state and verifying invariants by allowing the programmer to evaluate arbitrary expressions in the context of a stopped program. The capabilities include support for the extensions to C such as the global address space and spread arrays. The section at the bottom of the window allows the user to change state by entering a new value for a given variable.
Evaluation window. Examining the expression "local_fish" reveals the cause of the problem.
From the window, we learn that the value of "local_fish" has not been initialized (see lines 66-67 in the appendix). We add the initialization and recompile the program, hoping that we'll have no further problems.
Output window. The partially debugged program hangs before completion.
Using the Node button in the main window, we create a second node window. We then stop the processor by clicking on the Halt button which appears in the node window. It stops inside of a barrier, but we see that the stack frame just below the barrier is splitc_main, so we click on that frame and arrive at the node window depicted in the figure. Notice that the source file section has automatically displayed the relevant section of code and highlighted the barrier call. Using the node number entry box in the upper left corner of the node window to move between nodes, we halt and examine a few other nodes and find that all have stopped in the same barrier.
Node window. Node 0 has hung in a barrier. The programmer has moved up the stack to splitc_main, and the call is highlighted in yellow.
At this point we begin to hypothesize about the problem: the chance that we happened to stop all or even most of the nodes waiting at the barrier is slim unless at least one other node is not at a barrier. By letting the processors run for a bit and stopping them again, we can make that chance arbitrarily small. The problem then becomes to figure out why some processors are not going through the barrier. To get a better picture of the state on each node, we open the local variables window by pressing the Locals button. The window shows the values of all variables local to the frame selected in the corresponding node window. In this case, we see that the time t agrees well with the last value shown in the output window, as we expect since node 0 performs the print statements. When we shift to another node, however, the time no longer agrees---somehow the processors have broken the programmer's concept of bulk synchronous, equal-length time steps. We assume that some node has reached T_FINAL and believes itself to be done, causing the rest of the nodes to wait indefinitely at the barrier.
Local variables window. The variables shown correspond to the splitc_main stack frame chosen in the node window.
To verify this hypothesis, we set breakpoints on two nodes at the point in splitc_main where t is updated (line number 170). For each node, we open a display window with the expression "delta_t," as shown in the display window figures. The display window is similar to the evaluation window, except that it evaluates the expression automatically whenever the processor stops or a new frame is selected. This is ideal for our purposes, because we would like to compare the values of delta_t between processors for the same bulk synchronous step, pressing only the Continue button of each node window between comparisons. As soon as the second step, we note that the times have diverged, as shown in the figures.
Display window. The expression "delta_t" is evaluated on node 0 each time the processor stops or the programmer changes the frame.
Another display window. The expression "delta_t" is evaluated on node 1.
We have verified our hypothesis about the nodes becoming unsynchronized, but we have yet to understand how this occurs. To understand the process, we move upwards for a brief period and consider the bulk synchronous model. We would like to stop all of the processors after they have completed the first few stages; in particular, we would like to stop them all just before they calculate the value of delta_t for the next time step. We click on the Global button in the main window to get a global window.
Global window. The programmer has just set a breakpoint at line 199 (in splitc_main).
The global window supports the bulk synchronous view of the Split-C program. It allows the user to toggle breakpoints for all of the nodes simultaneously, and to start and stop all of the nodes. We set the desired breakpoint, as shown by the solid dot at line 199 in the figure, and then restart the program. Once the processors have all stopped, we can pick two and examine the quantities used to calculate delta_t. Finding that both max_speed and max_acc differ, we look up a few lines and realize that our calls to all_reduce_to_one_dmax have returned different values, and we find the bug: we wanted to use all_reduce_to_all_dmax, which returns the reduced value to all of the processors (see lines 183-186 in the appendix). We make the changes and recompile the program, after which it runs as we expect.
Mantis also attempts to make the process of finding source and executable files as simple as possible. Source file information is taken from the executable symbol tables and compiled into a menu of source files. Files in the Split-C library are currently stripped out of this menu; this can also result in user code with the same file name not appearing in the menu. We would like to be able to give a list of header files as well, but this information is not available from the executable. File selection entry boxes also have a control button immediately to the right, always marked with an arrowhead pointing to the right, which allows the user to make choices via a special dialog. The user can traverse directories and select files in this dialog by simply double-clicking with the mouse.
The upper two entry boxes in the main window allow selection of the executable and command line arguments, while the third provides a mechanism for locating source files if insufficient information appears in the symbol table.
The right edge of the main window holds a pair of buttons which provide information on software copying and warranties according to the Free Software Foundation policies.
Along the bottom of the main window is a row of buttons divided into two groups. The right group contains the Quit button and a button marked Clear which dumps all information and closes all windows.
The left group of buttons manages the creation of other windows. Each of the three generates a new window: the status window, the node window, or the global window. The global and node windows are described in later sections for the most part, while the status window is discussed here.
The status window gives a graphical display of the current state of all processors. The processors are displayed as a two dimensional mesh of squares in a pattern which maximizes the size of each processor within the window size chosen by the user. The shape may therefore change when the window is resized. The table below summarizes the colors and patterns used for each possible state (patterns are used only when color is unavailable). Clicking on any of the processor squares brings up a node window corresponding to that processor. Thus if a certain node has caused an error, the user may detect it visually and select it via the mouse for examination.
Mantis Status Window Colors and Patterns
Halting and continuing the processors can be performed via either a node window or the global window. In either case, a pair of buttons appears to the left, just below the source display area. In the case of the global window, these buttons will stop or start all processors, while for the node windows, the buttons will stop or start only the processor being examined with that node window. The node window also has buttons for stepping a single processor through a line of code. The Step button will continue program execution until it reaches the next line of code or enters a new procedure. The Next button will wait for any procedure called to complete and for the processor to reach the next line of code.
Because of the methods employed for single-stepping on Sparc processors, it is necessary for the debugger process to watch each step. Combining this with the possible need for synchronization with another processor which may be halted, we find that stepping can cause deadlock. To allow the user to override this problem, the pendulum pops up into button form when stepping occurs (instructions also appear in the status area), and if the button is pressed, the processor is halted immediately whether or not it has finished the step.
Output from the process being debugged appears in a special Mantis window. Both stdout and stderr are channeled into this window, which appears automatically. Two buttons at the bottom allow the user to discard the data currently in the window and to dismiss the window completely. In addition to the output from the process, notification that the process being debugged has exited or terminated is also given here.
Access to source files and functions is provided by a line of controls just above the source display section. On the right is a menu of selection methods which displays the method currently in use. In the figure, this button appears as Source. The selection method determines the meaning of the entry box and the button to the left as follows:
Directly below the status section of the node window is a region which displays signal and stack information when the processor is halted. If a signal (e.g., a segmentation fault) caused the process to halt, that information will be given in the area marked "Signal." The stack section displays stack frames, which the user can select by pointing and clicking or can traverse up or down, one at a time, using a pair of buttons to the right. When no information is available, the stack motion buttons are disabled (they turn grey). When the user moves to a stack frame, the source file display area changes to show the source code corresponding to the stack frame, complete with a highlighted line where the processor is stopped, as can be seen in the first and second node window figures.
The evaluation window provides a fairly standard interface from the PC world for evaluating expressions and changing variable values. The expression to be evaluated is entered in the top entry box and the value is returned in the middle box. A menu button to the right of the entry box gives a list of recently evaluated expressions. The user can then cycle through a small set of expression by simply selecting each from the menu. The bottom entry allows the user to change the value of an expression. Errors in evaluation are shown in both the node window and in the evaluation window to allow the user to focus on either window.
In addition to this common method for checking values, Mantis provides several others. Just above the source display area is a row of buttons which manage data display. The Evaluate button resides in the middle and is equivalent to pressing the right mouse button is the source display region. The Display button to the right creates a display window, which is like the evaluation window except that the expression is re-evaluated automatically when the processor halts or the stack frame is changed. Since the user may want to maintain several displayed expressions, Mantis makes the display window as small as possible by removing the section used to change values. Values must be changed via the evaluation window instead. Finally, the Locals button to the left creates a window which displays variables local to the selected stack frame. Since this information changes with stack frames, it is updated automatically whenever the frame is changed. Note that all data display windows are marked across the top with the color and processor number of the node window to which they correspond. The windows are also grouped with the node window so that iconifying or deiconifying the node window does the same with the subwindows.
Because Split-C currently lacks a standard output format for global and spread pointer values, it was necessary to create one for Mantis. Naturally, Mantis accepts this format on input as well as standard Split-C constructs such as toglobal and tolocal. Any global or spread pointer appears as follows:
Creation of global and spread pointers mirrors the language. The address of a spread array section, for example, will be a spread pointer if the section consists of one or more of the blocks on each processor, and will be a global pointer if the section is part of a block:
&a[1] has type int (*spread)[2]}, while
&a[1][1] has type int *global.
It is important to keep in mind that local pointers are dereferenced on the processor indicated in the window, regardless of where the pointer was obtained. If, for example, the user has a global pointer to a local pointer to an integer, dereferencing that value twice will return the value at the address of the local pointer in the address space of the processor corresponding to the window, just as it would in Split-C.
Finally, since dynamically allocated arrays appear often in Split-C, it is worth describing the notation used to cast a pointer to an array and to set the values of a dynamic array. To print a fixed number N of elements beginning at a pointer p to type my_type, use the form:
The user interface is written using the Tool Command Language (Tcl) and X11 toolkit (Tk) developed by Ousterhout, which greatly simplified the task. It would be hard to overestimate the value of Tcl's interpreted script nature for rapidly creating attractive, functional, and consistent user interfaces. The interface required about seven thousand lines of code, which divide roughly equally into script code and C code.
The Mantis debugger is based on the Free Software Foundation's gdb debugger. gdb consists of roughly two hundred thousand lines of code and provides a portable sequential debugging environment. We felt that the modifications required to add language support and allow parallel processes with gdb would require much less time than it would to write a high-quality debugger from scratch. Furthermore, by using gdb, we can simply incorporate our changes into future releases and thereby ease portability problems. This has worked to great advantage with Split-C and the gcc compiler, and we expect the same results with Mantis and gdb.
Unlike Split-C, Mantis currently runs only on the CM-5. Under the CMOST operating system, process information must be obtained through the Time-Sharing Daemon which runs on the CM-5 host processor. To support this model, we use a single debugger process on the host which communicates with the TS-Daemon to gather debugging information. Unfortunately, this means that our debugger process must directly interact with a large number of child processes instead of the usual one-to-one relationship between the debugger and the child. Adding this capability to the gdb debugger was the most difficult modification required, and also the least portable since it permeates a great deal of the gdb source. Since most other platforms, in particular networks of workstations, will not require these changes, we have begun to build a more portable version for other platforms and currently plan to finish the new version some time in October 1994. We also hope to make Mantis publicly available by December 1994.
We also claimed that a debugger must provide efficient means of performing common tasks such as execution control (e.g., breakpoints) and verification of invariants. In Mantis, execution can be managed for each node individually or for all nodes as a group. State and invariants can be checked with a variety of methods, each capable of understanding the global address space, spread arrays, and other abstractions which define Split-C.
Although Mantis has met the goals fairly well, several issues remain. In general, the problem of condensing information into a format which can be scanned quickly and which directs the user to a bug remains largely unaddressed. Comparisons of stack frames and node-scanning buttons are two ideas which in the future might reduce the amount of information which Mantis users must examine when looking for a bug. Also, it would be useful and fairly straightforward to integrate a data visualization system with the Mantis debugger, but we have yet to do so. Finally, we do not address the ideas of tracing and deterministic replay which currently appear in the literature, primarily because we do not feel that these concepts will be particularly useful in debugging production codes.
During the time in which the first implementation of Mantis was developed, Split-C has been ported to several other platforms by Luna, including the Intel Paragon and networks of Hewlett-Packard workstations. Work is also currently underway to provide Split-C for arbitrary networks of homogeneous workstations as well as the Meiko CS-2. For those interested in examples of Split-C applications, we suggest Culler et. al., which explains the language by example.
[2] D. E. Culler, A. C. Dusseau, K. E. Schauser, R. P. Martin, "Fast Parallel Sorting under LogP: from Theory to Practice," in "Portability and Performance for Parallel Processing," A. J. G. Hey and J. Ferrante, eds., pp. 71-98, John Wiley & Sons, Ltd., 1994.
[3] K. Dewdney, "Computer Recreations: Sharks and fish wage an ecological war on the toroidal planet Wa-Tor," Scientific American, December 1984.
[4] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, D. Walker, "Solving Problems on Concurrent Processors," Vol. I, Ch. 17, pp. 307-325, Prentice Hall, Englewood Cliffs, New Jersey.
[5] S. Luna, "Implementing an Efficient Portable Global Memory Layer on Distributed Memory Multiprocessors," U. C. Berkeley Technical Report #CSD-94-810, May 1994.
[6] J. K. Ousterhout, "Tcl: An Embeddable Command Language," Proc. USENIX Winter Conference, pp. 133-146, 1990.
[7] J. K. Ousterhout, "An X11 Toolkit Based on the Tcl Language," Proc. USENIX Winter Conference, pp. 105-115, 1991.
[8] T. von Eicken, D. E. Culler, S. C. Goldstein, K. E. Schauser, "Active Messages: a Mechanism for Integrated Communication and Computation," Proceedings of the International Symposium on Computer Architecture, 1992
1 #include2 #include 3 #include 4 #include 5 #include 6 7 8 #define NFISH 100 /* number of fish */ 9 #define T_FINAL 10.0 /* simulation end time */ 10 #define GRAV_CONSTANT 1.0 /* proportionality constant of 11 gravitational interaction */ 12 13 /* 14 This structure holds information for a single fish, including 15 position, velocity, and mass. 16 */ 17 18 typedef struct { 19 double x_pos, y_pos; 20 double x_vel, y_vel; 21 double mass; 22 } fish_t; 23 24 25 /* 26 Place fish in their initial positions. 27 */ 28 29 void all_init_fish (int num_fish, fish_t *local_fish) 30 { 31 int i, n; 32 double total_fish = PROCS * num_fish; 33 34 for (i = 0, n = MYPROC * num_fish; i < num_fish; i++, n++) { 35 local_fish[i].x_pos = (n * 2.0) / total_fish - 1.0; 36 local_fish[i].y_pos = 0.0; 37 local_fish[i].x_vel = 0.0; 38 local_fish[i].y_vel = local_fish[i].x_pos; 39 local_fish[i].mass = 1.0 + n / total_fish; 40 } 41 } 42 43 44 /* 45 Compute the force on all local fish according to the 2-dimensional 46 gravity rule, 47 F = d * (GMm/d^2), 48 and add it to the force vector (fx, fy). Note that both fish and 49 both components of the force vector are local. 50 */ 51 52 void all_compute_force (int num_fish, fish_t *spread fish, 53 double *x_force, double *y_force) 54 { 55 int i, j; 56 fish_t *local_fish, remote_fish; 57 double delta_x, delta_y, dist_sq, grav_base; 58 59 /* Clear forces on local fish. */ 60 for (i = 0; i < num_fish; i++) { 61 x_force[i] = 0.0; 62 y_force[i] = 0.0; 63 } 64 65 /* 66 BUG: The programmer forgot to initialize local_fish before 67 using it below. 68 */ 69 70 /* Move through the global fish list and 71 accumulate forces on local fish. */ 72 for (j = 0; j < NFISH; j++) { 73 74 /* Read remote fish data. */ 75 remote_fish = fish[j]; 76 77 /* Calculate force between remote fish and all local 78 fish and accumulate force on local fish. */ 79 for (i = 0; i < num_fish; i++) { 80 delta_x = remote_fish.x_pos - local_fish[i].x_pos; 81 delta_y = remote_fish.y_pos - local_fish[i].y_pos; 82 dist_sq = MAX ((delta_x * delta_x) + (delta_y * delta_y), 0.01); 83 grav_base = 84 GRAV_CONSTANT * (local_fish[i].mass) * (remote_fish.mass) / 85 dist_sq; 86 87 x_force[i] += grav_base * delta_x; 88 y_force[i] += grav_base * delta_y; 89 } 90 } 91 } 92 93 94 /* 95 Move fish one time step, updating positions, velocity, and 96 acceleration. Return local computations of maximum acceleration, 97 maximum speed, and sum of speeds squared. 98 */ 99 100 void all_move_fish (int num_fish, fish_t *local_fish, double delta_t, 101 double *x_force, double *y_force, 102 double *max_acc_ptr, double *max_speed_ptr, 103 double *sum_speed_sq_ptr) 104 { 105 int i; 106 double x_acc, y_acc, acc, speed, speed_sq; 107 double max_acc = 0.0, max_speed = 0.0, sum_speed_sq = 0.0; 108 109 /* Move fish one at a time and calculate statistics. */ 110 for (i = 0; i < num_fish; i++) { 111 112 /* Update fish position, calculate acceleration, and update 113 velocity. */ 114 local_fish[i].x_pos += (local_fish[i].x_vel) * delta_t; 115 local_fish[i].y_pos += (local_fish[i].y_vel) * delta_t; 116 x_acc = x_force[i] / local_fish[i].mass; 117 y_acc = y_force[i] / local_fish[i].mass; 118 local_fish[i].x_vel += x_acc * delta_t; 119 local_fish[i].y_vel += y_acc * delta_t; 120 121 /* Accumulate local max speed, accel and contribution to 122 mean square velocity. */ 123 acc = sqrt (x_acc * x_acc + y_acc * y_acc); 124 max_acc = MAX (max_acc, acc); 125 speed_sq = (local_fish[i].x_vel) * (local_fish[i].x_vel) + 126 (local_fish[i].y_vel) * (local_fish[i].y_vel); 127 sum_speed_sq += speed_sq; 128 speed = sqrt (speed_sq); 129 max_speed = MAX (max_speed, speed); 130 } 131 132 /* Return local computation results. */ 133 *max_acc_ptr = max_acc; 134 *max_speed_ptr = max_speed; 135 *sum_speed_sq_ptr = sum_speed_sq; 136 } 137 138 139 /* 140 Simulate the movement of NFISH fish under gravitational attraction. 141 */ 142 143 splitc_main () 144 { 145 double t = 0.0, delta_t = 0.01; 146 double max_acc, max_speed, sum_speed_sq, mnsqvel; 147 fish_t *spread fish, *local_fish; 148 int num_fish; 149 double *x_force, *y_force; 150 151 /* Allocate a global spread array for the fish data set and 152 obtain a pointer to the local portion of the array. Then 153 find the number of fish owned by this processor. */ 154 fish = all_spread_malloc (NFISH, sizeof (fish_t)); 155 local_fish = tolocal (fish); 156 num_fish = my_elements(NFISH); 157 158 /* Allocate force accumulation arrays. */ 159 x_force = (double *)malloc (num_fish * sizeof (double)); 160 y_force = (double *)malloc (num_fish * sizeof (double)); 161 162 /* Initialize the fish structures, then synchronize 163 to ensure completion. */ 164 all_init_fish (num_fish, local_fish); 165 barrier (); 166 167 while (t < T_FINAL) { 168 169 /* Update time. */ 170 t += delta_t; 171 172 /* Compute forces on fish due to other fish, then synchronize to 173 ensure that no fish are moved before calculations finish. */ 174 all_compute_force (num_fish, fish, x_force, y_force); 175 barrier (); 176 177 /* Move fish according to forces and compute rms velocity. 178 Note that this function is completely local. */ 179 all_move_fish (num_fish, local_fish, delta_t, x_force, y_force, 180 &max_acc, &max_speed, &sum_speed_sq); 181 182 /* 183 BUG: The programmer used all_reduce_to_one_d* instead of 184 all_reduce_to_all_d*, resulting in different time steps on each 185 processor which eventually caused synchronization errors and 186 hung the program. 187 */ 188 189 /* Compute maxima across all processors and sum for rms speed 190 (this will synchronize the processors). */ 191 max_acc = all_reduce_to_one_dmax (max_acc); 192 max_speed = all_reduce_to_one_dmax (max_speed); 193 sum_speed_sq = all_reduce_to_one_dadd (sum_speed_sq); 194 mnsqvel = sqrt (sum_speed_sq / NFISH); 195 196 /* Adjust delta_t based on maximum speed and acceleration--this 197 simple rule tries to insure that no velocity will change 198 by more than 10%. */ 199 delta_t = 0.1 * max_speed / max_acc; 200 201 /* Print out time and rms velocity for this step. */ 202 on_one {printf ("%15.6lf %15.6lf\n", t, mnsqvel);} 203 } 204 }