Programming fun

The file c0x32.obj is the startup code for a console mode programs, the type of program that begins at the function main. A GUI program, the type that starts at the function WinMain, uses a different startup code file.

Top

Threads / Processes

Es tracta de investigar que veu el WTM i/o el PROCEXP (SysInternals) quan un thread en llença un altre i després desapareix.

Punt interessants :

quin es el equivalent "Windows" de fork() ?
_beginthread(), CreateProcess(), CreateThread()
The CreateRemoteThread function creates a thread that runs in the virtual address space of another process.

URL (4/17) :

A process contains its own independent virtual address space with both code and data, protected from other processes. Each process, in turn, contains one or more independently executing threads. A thread running within a process can create new threads, create new independent processes, and manage communication and synchronization between the objects.

By creating and managing processes, applications can have multiple, concurrent tasks processing files, performing computations, or communicating with other networked systems. It is even possible to exploit multiple processors to speed processing.

This chapter explains the basics of process management and also introduces the basic synchronization operations that will be used throughout the rest of the book.

Windows Processes and Threads

Every process contains one or more threads, and the Windows thread is the basic executable unit. Threads are scheduled on the basis of the usual factors: availability of resources such as CPUs and physical memory, priority, fairness, and so on. Windows has supported symmetric multiprocessing (SMP) since NT4, so threads can be allocated to separate processors within a system.

From the programmer's perspective, each Windows process includes resources such as the following components:

One or more threads.
A virtual address space that is distinct from other processes' address spaces, except where memory is explicitly shared. Note that shared memory-mapped files share physical memory, but the sharing processes will use different virtual addresses to access the mapped file.
One or more code segments, including code in DLLs.
One or more data segments containing global variables.
Environment strings with environment variable information, such as the current search path.
The process heap.
Resources such as open handles and other heaps.

Each thread in a process shares code, global variables, environment strings, and resources. Each thread is independently scheduled, and a thread has the following elements:

A stack for procedure calls, interrupts, exception handlers, and automatic storage.
Thread Local Storage (TLS)-arrays of pointers giving each thread the ability to allocate storage to create its own unique data environment.
An argument on the stack, from the creating thread, which is usually unique for each thread.
A context structure, maintained by the kernel, with machine register values.

Process Creation

The fundamental Windows process management function is CreateProcess, which creates a process with a single thread. It is necessary to specify the name of an executable program file as part of the CreateProcess call.

It is common to speak of parent and child processes, but these relationships are not actually maintained by Windows. It is simply convenient to refer to the process that creates a child process as the parent.

CreateProcess has ten parameters to support its flexibility and power. Initially, it is simple to use default values. Just as with CreateFile, it is appropriate to explain all the CreateProcess parameters. Related functions then become easier to understand.

Note first that the function does not return a HANDLE; rather, two separate handles, one each for the process and the thread, are returned in a structure specified in the call. CreateProcess creates a new process with a primary thread. The example programs are always very careful to close both of these handles when they are no longer needed in order to avoid resource leaks; a common defect is to neglect to close the thread handle. Closing a thread handle, for instance, does not terminate the thread; the CloseHandle function only deletes the reference to the thread within the process that called CreateProcess.

BOOL CreateProcess ( LPCTSTR lpApplicationName, LPTSTR lpCommandLine, LPSECURITY_ATTRIBUTES lpsaProcess, LPSECURITY_ATTRIBUTES lpsaThread, BOOL bInheritHandles, DWORD dwCreationFlags, LPVOID lpEnvironment, LPCTSTR lpCurDir, LPSTARTUPINFO lpStartupInfo, LPPROCESS_INFORMATION lpProcInfo )

Return: TRUE only if the process and thread are successfully created.

Parameters

Some parameters require extensive explanations in the following sections, and many are illustrated in the program examples.

lpApplicationName and lpCommandLine (this is an LPTSTR and not an LPCTSTR) together specify the executable program and the command line arguments, as explained in the next section.

lpsaProcess and lpsaThread point to the process and thread security attribute structures. NULL values imply default security and will be used until Chapter 15, which covers Windows security.

bInheritHandles indicates whether the new process inherits copies of the calling process's inheritable open handles (files, mappings, and so on). Inherited handles have the same attributes as the originals and are discussed in detail in a later section.

dwCreationFlags combines several flags, including the following.

CREATE_SUSPENDED indicates that the primary thread is in a suspended state and will run only when ResumeThread is called.
DETACHED_PROCESS and CREATE_NEW_CONSOLE are mutually exclusive; don't set both. The first flag creates a process without a console, and the second flag gives the new process a console of its own. If neither flag is set, the process inherits the parent's console.
CREATE_NEW_PROCESS_GROUP specifies that the new process is the root of a new process group. All processes in a group receive a console control signal (Ctrl-c or Ctrl-break) if they all share the same console. Console control handlers were described in Chapter 4 and illustrated in Program 4-5. These process groups have similarities to UNIX process groups and are described later in this chapter.

Several of the flags control the priority of the new process's threads. The possible values are explained in more detail at the end of Chapter 7. For now, just use the parent's priority (specify nothing) or NORMAL_PRIORITY_CLASS.

lpEnvironment points to an environment block for the new process. If NULL, the process uses the parent's environment. The environment block contains name and value strings, such as the search path.

lpCurDir specifies the drive and directory for the new process. If NULL, the parent's working directory is used.

lpStartupInfo specifies the main window appearance and standard device handles for the new process. Use the parent's information, which is obtained from GetStartupInfo. Alternatively, zero out the associated STARTUPINFO structure before calling CreateProcess. To specify the standard input, output, and error handles, set the standard handler fields (hStdInput, hStdOutput, and hStdError) in the STARTUPINFO structure. For this to be effective, also set another STARTUPINFO member, dwFlags, to STARTF_USESTDHANDLES, and set all the handles that the child process will require. Be certain that the handles are inheritable and that the CreateProcess bInheritHandles flag is set. The Inheritable Handles subsection gives more information and an example.

lpProcInfo specifies the structure for containing the returned process, thread handles, and identification. The PROCESS_INFORMATION structure is as follows:

typedef struct PROCESS_INFORMATION { HANDLE hProcess ; HANDLE hThread ; DWORD dwProcessId ; DWORD dwThreadId ; } PROCESS_INFORMATION ;

Why do processes and threads need handles in addition to IDs? The ID is unique to the object for its entire lifetime and in all processes, whereas a given process may have several handles, each having distinct attributes, such as security access. For this reason, some process management functions require IDs, and others require handles. Furthermore, process handles are required for the general-purpose, handle-based functions. Examples include the wait functions discussed later in this chapter, which allow waiting on handles for several different object types, including processes. Just as with file handles, process and thread handles should be closed when no longer required.

Note: The new process obtains environment, working directory, and other information from the CreateProcess call. Once this call is complete, any changes in the parent will not be reflected in the child process. For example, the parent might change its working directory after the CreateProcess call, but the child process working directory will not be affected, unless the child changes its own working directory. The two processes are entirely independent.

The UNIX and Windows process models are considerably different. First, Windows has no equivalent to the UNIX fork function, which makes a copy of the parent, including the parent's data space, heap, and stack. fork is difficult to emulate exactly in Windows, and, while this may seem to be a limitation, fork is also difficult to use in a multithreaded UNIX system because there are numerous problems with creating an exact replica of a multithreaded system with exact copies of all threads and synchronization objects, especially on an SMP system. Therefore, fork, by itself, is not really appropriate in any multithreaded system.

CreateProcess is, however, similar to the common UNIX sequence of successive calls to fork and execl (or one of five other exec functions). In contrast to Windows, the search directories in UNIX are determined entirely by the PATH environment variable.

As previously mentioned, Windows does not maintain parent-child relationships among processes. Thus, a child process will continue to run after the creating parent process terminates. Furthermore, there are no process groups in Windows. There is, however, a limited form of process group that specifies all the processes to receive a console control event.

Windows processes are identified both by handles and by process IDs, whereas UNIX has no process handles.

Specifying the Executable Image and the Command Line

Either lpApplicationName or lpCommandLine specifies the executable image name. The rules are as follows.

lpApplicationName, if not NULL, is the name of the executable. Quotation marks can be used if the image name contains spaces. More detailed rules are described below.
Otherwise, the executable is the first token in lpCommandLine.
Usually, only lpCommandLine is specified, with lpApplicationName being NULL. Nonetheless, there are detailed rules for lpApplicationName.
if lpApplicationName is not NULL, it specifies the executable module. Specify the full path and file name, or use a partial name and the current drive and directory will be used; there is no additional searching. Include the file extension, such as .EXE or .BAT, in the name.
if the lpApplicationName string is NULL, the first white-space-delimited token in lpCommandLine is the program name. If the name does not contain a full directory path, the search sequence is as follows:
1. The directory of the current process's image
2. The current directory
3. The Windows system directory, which can be retrieved with GetSystemDirectory
4. The Windows directory, which is retrievable with GetWindowsDirectory
5. The directories as specified in the environment variable PATH

The new process can obtain the command line using the usual argv mechanism, or it can invoke GetCommandLine to obtain the command line as a single string.

Notice that the command line is not a constant string. This is consistent with the fact that the argv parameters to the main program are not constant. A program could modify its arguments, although it is advisable to make any changes in a copy of the argument string.

The new process is not required to be built with the same UNICODE definition as that of the parent process. All combinations are possible. Using _tmain as discussed in Chapter 2 is helpful in developing code for either UNICODE or ASCII operation.

Inheritable Handles

Frequently, a child process requires access to an object referenced by a handle in the parent; if this handle is inheritable, the child can receive a copy of the parent's open handle. The standard input and output handles are frequently shared with the child in this way. To make a handle inheritable so that a child receives and can use a copy requires several steps.

The bInheritHandles flag on the CreateProcess call determines whether the child process will inherit copies of the inheritable handles of open files, processes, and so on. The flag can be regarded as a master switch applying to all handles.

It is also necessary to make an individual handle inheritable; it is not done by default. To create an inheritable handle, use a SECURITY_ATTRIBUTES structure at creation time or duplicate an existing handle.

The SECURITY_ATTRIBUTES structure has a flag, bInheritHandle, that should be set to TRUE. Also, recall that nLength should be set to sizeof (SECURITY_ATTRIBUTES).

The following code segment shows how an inheritable file or other handle is typically created. In this example, the security descriptor within the security attributes structure is NULL; Chapter 15 shows how to include a security descriptor.

HANDLE h1, h2, h3 ; SECURITY_ATTRIBUTES sa = { sizeof(SECURITY_ATTRIBUTES), NULL, TRUE } ; ... h1 = CreateFile ( ..., & sa, ... ) ; /* Inheritable. */ h2 = CreateFile ( ..., NULL, ... ) ; /* Not inheritable. */ h3 = CreateFile ( ..., & sa, ... ) ; /* Inheritable. sa can be reused. */

A child process still needs to know the value of an inheritable handle, so the parent needs to communicate handle values to the child using an interprocess communication (IPC) mechanism or by assigning the handle to standard I/O in the STARTUPINFO structure, as is done in the first example of this chapter (Program 6-1) and in several additional examples throughout the book. This is generally the preferred technique because it allows I/O redirection in a standard way and no changes are needed in the child program.

Alternatively, nonfile handles and handles that are not used to redirect standard I/O can be converted to text and placed in a command line or in an environment variable. This approach is valid if the handle is inheritable because both parent and child processes identify the handle with the same handle value. Exercise 6-2 suggests how to demonstrate this, and a solution is presented on the book's Web site.

The inherited handles are distinct copies. Therefore, a parent and child might be accessing the same file using different file pointers. Furthermore, each of the two processes can and should close its own handle.

Figure 6-2 shows how two processes can have distinct handle tables with two distinct handles associated with the same file or other object. Process 1 is the parent, and Process 2 is the child. The handles will have identical values in both processes if the child's handle has been inherited, as is the case with Handles 1 and 3.

06fig02.gif = Figure 6-2 Process Handle Tables

On the other hand, the handle values may be distinct. For example, there are two handles for File D, where Process 2 obtained a handle by calling CreateFile rather than by inheritance. Finally, as is the case with Files B and E, one process may have a handle to an object while the other does not. This would be the case when the child process creates the handle or when a handle is duplicated from one process to another, as described in the upcoming Duplicating Handles section.

Process Handle Counts

A common programming error is to neglect to close handles when they are no longer needed; this can result in resource leakage, which in turn can degrade performance, cause program failures, and even impact other processes. NT 5.1 added a new function that allows you to determine how many handles any process has open. In this way, you can monitor your own process or other processes.

Here is the function definition, which is self-explanatory:

BOOL GetProcessHandleCount ( HANDLE hProcess, PDWORD pdwHandleCount )

Process Identities

A process can obtain the identity and handle of a new child process from the PROCESS_INFORMATION structure. Closing the child handle does not, of course, destroy the child process; it destroys only the parent's access to the child. A pair of functions is used to obtain current process identification.

HANDLE GetCurrentProcess (VOID) DWORD GetCurrentProcessId (VOID)

GetCurrentProcess actually returns a pseudohandle and is not inheritable. This value can be used whenever a process needs its own handle. You create a real process handle from a process ID, including the one returned by GetCurrentProcessId, by using the OpenProcess function. As is the case with all sharable objects, the open call will fail if you do not have sufficient security rights.

HANDLE OpenProcess ( DWORD dwDesiredAccess, BOOL bInheritHandle, DWORD dwProcessId )

Return: A process handle, or NULL on failure.

Parameters

dwDesiredAccess determines the handle's access to the process. Some of the values are as follows.

SYNCHRONIZE- This flag enables processes to wait for the process to terminate using the wait functions described later in this chapter.
PROCESS_ALL_ACCESS- All the access flags are set.
PROCESS_TERMINATE- It is possible to terminate the process with the TerminateProcess function.
PROCESS_QUERY_INFORMATION- The handle can be used by GetExitCodeProcess and GetPriorityClass to obtain process information.

bInheritHandle specifies whether the new handle is inheritable. dwProcessId is the identifier of the process requiring a handle.

Finally, a running process can determine the full pathname of the executable used to run it with GetModuleFileName or GetModuleFileNameEx, using a NULL value for the hModule parameter. A call from within a DLL will return the DLL's file name, not that of the .EXE file that uses the DLL.

Server structure

Envir vars

Command	Description	Status
B desti@servidor.mail <texte>	send an e-mail to a fixed (or given) destination.	-
C	Client selection : normal / complete.	-
D	directory management C = display current, G = go, R = remove (empty). L = list current - <text> ... <text> <"EOM">	done, v 2.6
E	create (own) Event Log.	-
F	file management C = create, D = delete, R = rename.	done, v 2.8
H	Help : return the list of accepted commands and their syntax.	done, v 2.3
I	return IP data : IP address, Hostname, etc.	-
K "Z" / "N" / ...	kill Zone Alarm / Norton / Symantec / Panda / McAfee (McShield) ...	-
L	Pop-Up a ListBox with some text. See "owncombo.c sample in MSDN"	-
M	mirror the following text.	done, v 2.7
O	return OPSYS version. See f:\cpp\DiskInfo\DI.C + f:\cpp\Registry\GetSystemVersion	-
P	return PROCESS list - <text> ... <text> <"EOM">	-
R	return REGISTRY branch - <text> ... <text> <"EOM">	-
S <envir var name>	return System Envir value	-
T	return TIME.	done, v 2.5
V	return Server VERSION.	done, v 2.2
W or log	write into "own" Event Log, if it exists. Use "E" command to "create" it !	-
other	return Not Implemented Yet. Use "H" for Help. Obviously, the offending text has to be returned also.	done, v 2.4

From Performance Tuning for Linux Servers

IPC

What is Interprocess Communication?

Interprocess communication allows processes to synchronize with each other and exchange data. In general, System V (SysV) IPC facilities provide three types of resources:

* Semaphores. Allow processes to synchronize with other and also prevent collisions when multiple processes are sharing resources.
* Message queues. Asynchronously pass small data, such as messages, between processes.
* Shared memory segments. Provide a fast way for processes to share relatively large amounts of data by sharing a common segment of memory among multiple processes.

In addition to these resources, IPC pipes and FIFOs are among the most commonly used IPC facilities in UNIX-based systems:

* Pipes are unidirectional, first-in/first-out data channels that pass unstructured data streams between related processes.
* FIFOs (a.k.a. named pipes) are pipes that have a persistent name associated with them.

The ipcs command

Almost all Linux distributions include the ipcs command, which provides information about the IPC resources that are currently loaded on the system. ipcs lets you determine the current IPC limits that the system allows and also lets you check the status of the three IPC resources that are currently in use on the system. For example, if your application fails to start, you can check the IPC usage on your system to determine if an IPC limit has been exceeded. To determine the status of the system's IPC resources, as root, issue the ipcs command with the -u option:

# ipcs -u

To determine the limits on the IPC resources reported by the ipcs -u command, use the ipcs -l command:

# ipcs -l

Dynamically modifying the configurable IPC parameters

Beginning with the 2.4 kernel, Linux supports the dynamic modification of most of the IPC parameters through either the /proc file system interface or with the sysctl facility.

# echo 1024 > /proc/sys/kernel/msgmni
# sysctl -w kernel..msgmni= 1024

Code tuning

The basic profiling tools in Linux are the -p (profile) and -pg (profile for gprof) options in gcc, and the prof and gprof utilities. Compiling using -p or -pg causes gcc to insert instructions necessary to obtain profiling information into the object code.

Running the prof command with the application allows you to then obtain the following:

Each procedure, ordered by descending processor activity
The percentage of CPU time used by each procedure
The execution time in seconds for all references
The number of times each procedure was called
The average time for a call to the procedure

Running the gprof command with the application gathers (among other information) the following:

The percentage of CPU time used by each procedure and its calling tree
A time breakdown for each procedure and what it calls
The number of times a procedure was called
What procedures were called by each procedure

Because gprof includes the descendents of a procedure in its timings, it is more useful for procedures calling library routines.

Algorithm: Achieving Performance Through Design Choices

All the compiler optimizations and hand-coded tuning methods in a programmer's toolbox do not matter as much in a program's performance as the proper choice of solution for a given problem. The rest of this section examines a sample network-based program and some of the design choices that can lead to a well-performing application.

Problems and Solution Possibilities

To understand some of the choices, we will describe a sample program that will demonstrate some of the issues concerning programming and optimal performance. We will look at a typical problem to understand some performance issues.

Programmers working on network-based solutions must deal with a number of issues, some of which we will demonstrate here by focusing on a simplified multithreaded file I/O Internet server.

We intend to demonstrate the use of threading and sockets, show reasonably high-performance programming techniques, and measure the performance of the resulting Internet page server. Note that we are spelling internet with a lowercase i here because we are not processing HTTP.

A Problem - How Fast Can We Connect, Create a File, Read It, and Disconnect ?

Our program will consist of two parts: a client part and a server part. The server will initialize itself and await a command from a client. The server will have the following usage:

server [-t nnn] [-pooling] [ip_address]:portnum

where -t indicates the number of threads to use (the default is 1). The -pooling option instructs the server to use a pool of threads rather than create and destroy a thread for each incoming work item.

The client has the following usage:

c [-t nnn] -filesize ppp -blksize mmm -qsize jjj [ip_address]:portnum

Here, we can instruct the client to use nnn threads and create files of size ppp using a block size of mmm. Finally, we instruct the client to do this jjj times. Both client and server use the TCP/IP address [ip_address]:portnum. The meaning here is that if the ip_address is unspecified, the program (server or c) uses the local machine address.

Thus, the following command starts the server program, allows the use of up to eight worker threads, and listens on port 4001:

server -t 8  :4001

The command

c -t 4 -qsize 1000 -filesize 2m -blksize 8192 :4001

connects to the server located at port 4001 on the local machine to perform 1,000 operations, each consisting of a file create, writing 2 million (2*1024*1024) bytes, reading 2 million bytes (8192 bytes at a time), sending 2 million bytes (8K at a time) over the TCP/IP connection, and, finally, closing the file and deleting it. By specifying a complete TCP address as follows:

c -t 4 -qsize 1000 -filesize 2m -blksize 8192 10.0.0.4:4001

the client program can be run from any computer that can communicate via TCP/IP with the server program.

The server and client could be written in most of the languages listed earlier and others not listed. However, rather than digress into the reasons why one language is or is not appropriate, let me just say that the C and C++ languages give access to the system entry APIs in Linux (the system calls). These "system call" APIs provide operating system services used by high-performance programs. We will direct our attention to some the system call primitives documented in section 2 of the manual.

The Program

Our program's responsibility will be to accept the command-line parameterization, set up to accept socket connections and distribute the resulting requests across the available threads. Further, we will design the server portion and the client portion to simply be parts of the same source code. When compiled, if the resulting executable is named server, it performs the server actions; otherwise, it performs client actions.

The server portion requires several threads. The first thread initializes all other threads and then listens for incoming requests and queues them. The second thread schedules work found in the queue. The third thread cleans up threads that terminate.

Pseudocode for the first server thread would look like the code shown here:

Server Main

  Parse options
  Create listening socket

  Start cleanup thread
  Start threadscheduler thread

  Forever {
    Accept new socket
    If quit break
    Queue new socket
  }
  exit

Here, the actual socket file descriptor is the object that is queued. Two other threads must be described. The cleanup thread has pseudocode, as shown here:

Thread Cleanup

  While threadactive {
    Locate active thread
    "wait" for thread // cleans up
  }
  exit

The thread scheduler has pseudocode, as shown here:

Thread scheduler

  While qsize > 0 {
    Locate inactive thread
    Start inactive thread with work item
  }
  exit

The thread scheduler sets a thread to work on the queued socket. Surrounding the basic work to be done is the mild protocol, which accepts a text command from the client (over the socket), parses the command, executes the actual work, and finally sends a "DONE" message back to the client. Code that wraps around the actual worker thread is shown here:

Thread wrapper

  Accept work assignment from socket
  Parse text command
    Mark myself as "active"
      Do real work
    Mark myself as "inactive"
  Reply to socket with an "DONE" message

We have described the server-side pseudocode that supports an almost arbitrary threaded work item. Before going into the details of what is involved in using Linux and gcc to create a program, let me also describe the client that issues the "commands" to this server and produces the timed results.

The client must be able to cause all this work to happen and report the timing of the results. If the server is multithreaded, on a multiprocessor, for a given number of threads, the results should improve as we add processors. Our server program will demonstrate that by scaling as we add processors. The client pseudocode has the following command-line behavior:

client -t 8 -qsize 1000 -filesize 2m -blksize 8192 :4001

The command-line options have the same meaning as for the server. When the client receives the last "DONE" message from a server thread, it prints a summary timing of the entire operation. The client pseudocode is shown here:

Client

 Parse command line options
 Put all work into a queue
 Start each thread; each will pull something off the queue
 until the queue is empty.

We will demonstrate some of the performance choices we must make to write a program on Linux. From this brief description, we will go into more detail on the programming aspects and how to measure the resulting performance. The client thread has the pseudocode shown here:

Client Thread

 While there is work to be done
 Get work item of work queue
 Send text string describing work to server
 Read and check date from server until done.
 Wait for "DONE"
 Send 'Q'
 Close socket

The client performs all the timing and reports the results.

Designing the Code

The choice of C or C++ rather than other languages is due to the personal experiences of the author.

Others who have experience in other languages might use FORTRAN or Python. However, more than the individual experiences of one person should be considered. Most programs are part of a product and require multiple people to write. Programming language selection should be based on the experiences of the participants in the programming team, the supportability of the language, and the debugging tools available. C and C++ are safe choices. On Linux, the C compiler that comes with Linux is gcc, the GNU Compiler Collection.

We will go through the various issues one by one and include snapshots of code. The complete program is included in the appendix.

The Server

Our program is designed as a single source file. When invoked, it asks whether it is named server. If it is, it performs the server functions. If it is not named server, it behaves like a client. The point of including the entire source in a single program is to simplify the building of the program and copying it to other systems. It is not necessarily the best choice. The question is asked as shown here:

char *p;
if(strchr(av[1],"/"))
    p = strrchr(av[1],"/");
else
    p = av[1];
if(equal(p,"server"))
    ServerMain();
else
    ClientMain();

This coding style is a convenience. The example brings us to another simplification that really represents personalization of code. Many programmers have idiosyncratic mechanisms that are included in every program they write. There are also reasons to define shortcuts of common names for things that are different on different platforms. In multiple-person developments, the shortcuts can interfere with rapid understanding of the code if they become too plentiful. This shortcut list is reasonably small, but to have a group of people embrace it would require a discussion and consensus. Following are the shortcuts we used:

#   define SLASHC       '/'
#   define SLASHSTR     "/"
#   define SOCKTYPE     int
#   define SLEEP(x)     sleep(x)
#   define Errno        errno
#   define BADSOCK      -1
#   define LCK          pthread_mutex_t
#   define SEMA_T       sem_t       // (man sem_init)
#   define YIELD        Yield()
#   define SOCKET        int
#   define SOCKERR      -1
#   define EXITTHREAD() return
#   define INT64        long long
#   define UINT64       unsigned long long

    typedef pthread_t   THREAD_T;

#   define TVAL         struct timeval

#   define equal        !strcmp
#   define equaln       !strncmp

It can be noted that converting these shortcuts to other platforms that support C/C++ is trivial.

Timing

The tstart(), tend(), and tval() routines provide a mechanism for recording time in the microsecond range. These routines are implemented by using the gettimeofday() routine on Linux. The documentation for gettimeofday() is accessed with the following command: man gettimeofday

Our version here is reentrant. tstart() and tend() record their values in a location specified by an input pointer. The design supports reentrancy, which is important when designing a multiple-thread application.

One issue with timing routines is the possibility that during a measurement session, someone or some program changes the system time. If that happens, you could get results that are not repeatable. For that reason, we make two recommendations:

Turn off all NTP (Network Time Protocol) servers. Using netstat -i, you can determine the list of open ports on the local machine. Make sure that port 123 is not in use. If it is, stop the service that is using it. If an NTP server is not running, there is little likelihood it will change the system time.
Run performance measurement tests multiple times to ensure that the results are repeatable.

With these two guidelines, it is unlikely that you will get into a lot of trouble.

The timing routines work like a stopwatch. To time something, the following sequence is used:

 TVAL ts,te;
 double t;

 tstart ( & ts ) ;
 do_something() ;
 tend ( & te ) ;
 t = tval ( & ts, & te ) ;
 printf ( "do_something() took [%8.5f] seconds.\n", t ) ;

The code for the timing routines looks like this:

void tstart(struct timeval *t)
{
    gettimeofday(t, NULL);
}
void tend(struct timeval *t)
{
    gettimeofday(t,NULL);
}

double tval(struct timeval *tv1, struct timeval *tv2)
{
    double t1, t2;

    t1 =  (double)tv1->tv_sec +
          (double)tv1->tv_usec/(1000*1000);
    t2 =  (double)tv2->tv_sec +
           (double)tv2->tv_usec/(1000*1000);
    return t2-t1;
}

Using the gettimeofday() system call allows resolution to one millionth of a second on an Intel-based Linux machine.

Sockets

Linux supports Berkeley sockets. Our program takes an input command-line parameter and converts it to an IP address and port number, suitable for use with the sockets library. For connection (client) or listening (server), both an IP address and a port number are required. The TcpParse(), ipaddress(), and portnum() routines convert an ASCII string to an IP address or port number. All three require reasonably precise input. If any encounters an error, it prints an error message and causes the program to exit.

Our discussion won't be a tutorial on sockets. We assume you are familiar with the basics of programming sockets using the AF_INET family of protocols. We also assume our interests here are in stream-oriented sockets as opposed to datagrams (TCP versus UDP).

There are two ways to use sockets. The first is to listen for incoming connections. The second is to initiate an outgoing connection. On Linux, sockets can be used between two threads on the same machine or can be directed to a program on another machine; the distinction is only at the command-line level, where the IP address and port number of the server software are located. For our testing and demonstration purposes, we confine ourselves to a single machine. When the program is fully described, we make test runs between two different machines.

The server portion of our program creates a socket and listens for connections. It does this in a listen and accept thread, whose only responsibility is to start new thread work. Server code creates a socket, performs a listen() on it, and then accepts() new connections. The accept action creates a new socket that is handed off to a thread to perform work. The newly created socket is a bidirectional communication channel (a full socket). The following shows the code for the listen and accept routine:

void listen_and_accept(SOCKTYPE *sock)
{
    int rc;
    SOCKTYPE sock3;

    static SOCKTYPE sock1;
    static struct sockaddr_in addr1;
    struct sockaddr_in addr2;
    int addr2len;
    static int first = 1;

    if(first) {
        addr1.sin_family      = AF_INET;
        addr1.sin_addr.s_addr = naddr;   // global
        addr1.sin_port        = port;    // global

        sock1 = socket(AF_INET, SOCK_STREAM, 0);
        if(sock1 == BADSOCK) {
            printf("socket FAILED: err=[%d].\n",Errno);
            exit(1);
        }

        rc = bind(sock1,(const struct sockaddr *)&addr1,
                 sizeof(addr1));
        if(rc == SOCKERR) {
            printf("bind FAILED: err=[%d].\n", Errno);
            exit(1);
        }

        rc = listen(sock1,5);
        if(rc) {
            printf("Listen FAILED: err=[%d].\n", Errno);
            exit(1);
        }
        first = 0;
    }
    addr2len = sizeof(addr2);
    sock3 = accept(sock1, (struct sockaddr *)&addr2,
                (socklen_t *)&addr2len);
    if(sock3 == BADSOCK) {
        printf("Accept FAILED: err=[%d].\n", Errno);
        exit(1);
    }
    *sock = sock3;
}

In this code segment, we have decided that all unexpected results should cause a program termination. Without detailed analysis of why each error might occur, this method makes for a predictable program, notwithstanding errors. Continuing in the presence of any of the preceding errors is difficult; passing a return value back to the calling routine that a failure occurred simply compounds the problems by making the caller attempt to discover what went wrong. By printing a message and exiting the program immediately, discovering an unexpected result focuses programming efforts precisely where the problem occurred. We have found over the years that correctness and debugability are more important than performance, and our code tries to reflect this point of view.

Servers listen and accept while clients connect. The following shows another wrapper program to perform the details of establishing a connection:

extern int econnrefuseretries;

SOCKTYPE clientconnect(int *per_client_refusecnt)
{
    SOCKTYPE sock2;
    struct sockaddr_in addr1;
    int refusedcount;

    addr1.sin_family      = AF_INET;
    addr1.sin_addr.s_addr = naddr;
    addr1.sin_port        = port;
    sock2 = socket(AF_INET, SOCK_STREAM, 0);
    if(sock2 == BADSOCK) {
        printf("socket FAILED: err=[%d].\n", Errno);
        exit(1);
    }

    refusedcount = 0;
    while(connect(sock2, (struct sockaddr *)&addr1,
                  sizeof(addr1))) {
        int err;

        err = Errno;
        if(err != ECONNREFUSED) {
            printf("connect FAILED: err=[%d].\n",Errno);
            exit(1);
        }
        SLEEP(2); // Be polite
        *per_client_refusecnt++;
        if(refusedcount++ >= econnrefuseretries) {
            printf("connect FAILED: ");
            printf("after %d ECONNREFUSED attempts\n", econnrefuseretries);
            exit(1);
        }
    }
    return sock2;
}

These two code segments bear some discussion. The listen_and_accept() routine is written to be used as simply as possible; it either returns a usable socket or prints an error and exits. The clientconnect() routine either returns a usable socket or prints an error and exits. It processes the ECONNREFUSED error return in an attempt to deal with a server that is too busy to accept the socket request. That happens to any listen and accept loop when the queue of requests in the operating system is longer than the default of 5. For instance, if a server issues a listen_and_accept(), receives an incoming socket, and then simply goes to sleep, succeeding incoming requests are queued by the operating system until there are five of them. The sixth connection request is refused, and the ECONNREFUSED error is reported by the client issuing a connect request (clientconnect() in our case). Our code simply sleeps for 2 seconds and tries again. We have allowed, by default, four retries. The number can be changed by editing the program and recompiling. Alternatively, it would be trivial to add a new option that allows the value to be set on the command line.

We have described the actions required to create socket connections. The details of socket connections are seldom the source of performance problems. It is what happens after the socket is created that becomes interesting.

Threads

Because our program uses threads and demonstrates thread pooling, we must describe how threads work and our usage of them. We don't claim that any of the following code is the best-performing code. We do claim that the code demonstrates how threads can be used in both a pooled and nonpooled manner, and that the code is reasonably good. No doubt, improvements are possible.

Recalling from earlier, our description of the server main program where a scheduler and a cleanup thread were created, we will show these two modules. There is a third module we haven't yet mentioned. It is the thread that listens for new work to queue. The work is the newly created socket, and it is the socket that is queued. Our worker program reads a text string from the socket containing the file size and block size to use in doing the work. Other experimentations with this code could replace our worker thread with one written for almost any purpose whatever. To summarize, the following takes place:

The Listener listens for new connections and inserts work (the newly created socket) into the work queue.
The thread scheduler waits for work in the queue and starts a thread passing the socket file descriptor to the thread. The scheduler can be run using pooled threads or in the mode where it creates a new thread for each work item (-pooling command-line option).
The work thread does all the work and finally closes the socket.
The thread cleanup awaits thread deaths and cleans up after them.

Surrounding the worker thread are the fixtures to report back to the client the completion of the task.

The thread listener is coded as shown here:

void ServerMain()
{
    SOCKTYPE sock;

    initq ( & workq ) ;
    newThread ( (void(*)(void *) ) threadScheduler,0, &schedulerT);
    newThread ( (void(*)(void *) ) threadCleanup,0, &cleanupT);
    newThread ( (void(*)(void *) ) threadDbg, 0, &dbgT);

    // Server waits in a listen/accept sequence and hands off
    // request to a thread. Threads are either dynamically
    // created or there is a pool.
    //
    // listen_and_accept creates sockets
    //
    for(;;) {
        listen_and_accept(&sock);
        enqueue(&workq, sock);
        vsema(&workq.sema);
    }
}

The thread scheduler reads information from a queue. The queue is protected by synchronization primitives, which we discuss later in the chapter. The protections allow multiple threads to update the queue without making the queue metadata inconsistent. The thread scheduler is coded as shown here:

void threadScheduler()
{
    SOCKTYPE sock;
    int rc;

    //
    // threadScheduler decrements availableThreads,
    // threadCleanup   increments availableThreads.
    //
    for(;;) {

        schedulerstate = 1;
        psema(&workq.sema);

        schedulerstate = 2;
        psema(&availableThreads);

        schedulerstate = 3;
        if(dequeue(&workq, (int *)&sock)) {
            schedulerstate = 4;
            rc = threadStart(sock); // starting a thread
            if(rc == -1) {
                printf("threadStart FAILED: maxthreads=%d\n", maxthreads); fflush(stdout);
                exit(1);
            }
        }
        else {
            printf("Workq Sema count wrong\n");
            exit(1);
        }
    }
}

Thread cleanup is coded as shown here:

void threadCleanup()
{
    int j = 0;

    //
    // For pooled threads, this loop will last forever.
    //
    for(;;) {
        cleanupstate = 1;
        psema(&ActiveT);
        cleanupstate = 2;

        for(j = 0; j < maxthreads; j++) {
            if(threads[j].exists) {
                threadWait(&threads[j].thrd);
                threads[j].exists = 0;
                threads[j].active = 0;
                nexists--;
                vsema(&availableThreads);
            }
        }
    }
}

These three modules require some background discussion. The listener module simply accepts new sockets and queues them. It is simple, and the only thing to think about is that it must queue the work consistently; therefore, it uses the enqueue() routine. The thread scheduler must create threads or allocate existing (pooled) threads. Threads are created with a pthread_create() call. Because our program wants to start a thread that is either already created (a pooled thread) or create a fresh one, we have chosen to abstract the thread-starting process with the threadStart() routine. threadStart()uses the pooling global variable to determine whether to use a pooled thread or to simply start a new thread. The pooling global defaults to no pooling and is a command-line option.

The command line tells the server and client how many threads to use. The default is one. The server runs no more than the number of threads specified on the command line at one time. This works in the same way for the client. Each server thread performs its work and ends. A pooled thread ends by marking itself as inactive and blocking on a lock. An unpooled thread simply returns, thus destroying the thread. For nonpooled threads, the cleanup routine is essential if the system is not to run out of resources. It is the cleanup routine that returns the resources used by nonpooled threads to the system.

We add one further observation about our thread code. Writing threaded code that schedules work using pooled threads is not exactly trivial. Determining why things are not working properly can be time-consuming. We demonstrate here one method of making the debugging task simpler. Our thread management routines (scheduler and cleanup) each have a global state variable. The state variables represent the current state of the routine and are accessible from all threads in the program. The variables are changed just prior to any function call that could block. By printing these two state variables, we can determine exactly where in the code the scheduler and cleanup threads are currently blocked. We wrote a trivial debug thread that can print the values of these state variables, and together with the state variables the debugging task was significantly accelerated.

Synchronization

The scheduler uses synchronization primitives (semaphores) to block when there is nothing to do. Semaphores are used as a synchronized counting mechanism where, if there are things to do, the number of things to do is reflected in the value of the semaphore. If the count is greater than zero, the psema() routine decrements the count by one and returns to the calling program. If the value of the semaphore is zero, the psema() routine blocks. The vsema() operation increments the semaphore and never blocks. Our psema() and vsema() routines are based on the semaphores(3) interfaces defined in section 3 of the manual (man 3 semaphores). Our interfaces take a name that enables us to debug them with more clarity - we can print the name of the semaphore we are examining by inserting appropriate print statements. We use one semaphore to count the number of available threads (either pooled or nonpooled), one to count the number of tasks (jobs, work items, or whatever we want to call them), and one to count the number of active threads.

The semaphore interfaces are defined as shown here:

void initsema(PVSEMA *s, int initvalue, int hi, char *name)
{
  if(sem_init(&s->pv,0,initvalue) == -1) {
    printf("sem_init() FAILED: sema=[%s] err=[%d].\n",name,Errno);
      exit(1);
  }
  s->name = name;
}

void psema(PVSEMA *s)    // decrements or blocks if s==0
{
  sem_wait(&s->pv);
}
void vsema(PVSEMA *s)    // increments
{
  int rc;

  rc = sem_post(&s->pv);
  if(rc == -1) {
    printf("sem_post FAILED: err=[%d].\n",Errno);
    exit(1);
  }
}

Finally, there are the queuing operations. Their job is to protect the data structures describing the queue of jobs coming in. The queue is operated with enqueue() and dequeue() operations. To support the queuing and dequeuing of data, lck and unlck implement locks on specified objects (the work queue in this case). We will show the enqueue routine and the associated dequeue routine.

You might ask why there are two counting mechanisms. The semaphore counting mechanism is specifically for counting. Because the numbers of threads (active and available) are strictly numbers, the only mechanism we need is counting. The queuing routines were written for more general queuing where objects could be queued. In this case, they require more than a simple count. For our particular demonstration here, they have been detuned to queue only integers (socket file descriptors); thus, they are similar in function to the semaphore operations.

That said, the following shows the queue initialization and the enqueuing routines:

void initq(queue_t *q)
{
    initsema(&q->sema, 0, workqsize, qsema);
    q->val = (int *)Malloc(workqsize*sizeof(int));
    memset(q->val,'\0',workqsize*sizeof(int));
    q->qmax = workqsize;
    initlck(&q->lck, workqname);
    q->head = q->tail = 0;
}

//
// This version stops queuing when it bumps
// into its own tail.
//
int enqueue(queue_t *q, int val)
{
    int ret = 1;
    int h;

    lck(&q->lck);
    h = q->head + 1;
    if(h == q->qmax)
            h = 0;
    if(h != q->tail) {
        q->val[h] = val;
        q->head = h;
        q->qcnt++;
    }
    else
        ret = 0;
    unlck(&q->lck);
    return ret;
}

The dequeue routine is similar.

The locking (lck and unlck) routines are built from Posix thread mutexes. Mutexes are efficient thread synchronization primitives defined in the Posix thread library. They support mutual exclusion between threads, but not between processes. A Posix mutex supports three kinds of mutex:

* PTHREAD_MUTEX_FAST_NP
* PTHREAD_MUTEX_RECURSIVE_NP
* PTHREAD_MUTEX_ERRORCHECK_NP

The FAST variant does not allow reentry into the mutex by the calling thread. Thus, if thread A locks using a pthread_mutex_lock and it attempts to lock the same mutex a second time, it blocks. Depending on the program's design, this might produce a deadlock. The RECURSIVE kind allows multiple pthread_mutex_lock() calls within a thread. However, for each pthread_mutex_lock() call, there must be a corresponding pthread_mutex_unlock() call. Finally, the ERRORCHECK kind returns an error if thread A attempts to lock a mutex more than once.

Our lock and unlock calls use the FAST kind because our design is such that if a thread calls a mutex lock more than once, it is an error in our logic. Debugging such a design is more easily accomplished if the ERRORCHECK kind is substituted during development. The following shows the lock and unlock primitives:

void initlck(LCK *l, char *name)
{
    //
    // Linux default is a "fast" mutex. A "fast" mutex
    // locks when the same thread calls it twice.
    //
    pthread_mutex_init(l,NULL);
}
void lck(LCK *l)
{
    int err;

    err = pthread_mutex_lock(l);
    if(err != 0) {
        printf("pthread_mutex_lck FAILED: err=[%d].\n",err);
        exit(1);
    }
}
void unlck(LCK *l)
{
    pthread_mutex_unlock(l);
}
int islocked(LCK *l)
{
    return (pthread_mutex_trylock(l) != EBUSY) ;
}

Of these primitives, only the mutexes from the thread library are based on the mutual exclusion instructions of the native processor. They should be significantly faster than interprocess synchronization primitives, because frequently no system call is required to execute. Contrast that with a semaphore operation that must maintain a counter that is visible to all processes. To either increment the counter or decrement it, the interface must issue a system call necessitating a transition into the operating system proper. For this overhead reason, mutexes are generally recognized as delivering high performance - or, to put it another way, to take less time to execute.

Using the timing primitives described previously, it is trivial to make a program that executes millions of calls to the interface and prints the time it takes to do it. One of the authors has done this and published the results for Red Hat 7.0. The results are reproduced here:

Interface	Linux 2.4.2 (microseconds per call)
SRV5_Semaphores	1.828
Posix Semaphores	0.487
Pthread_mutex	0.262

This measurement was done in a ThinkPad 600X (650MHz, 512MB memory). (SVR5 semaphores are a second variety of semaphores, older in design, and, obviously, as shown in the table, a bit slower.) Documentation for SVR5 semaphores can be seen with the man semop command. Documentation for the Posix semaphores can be found with man sem_init, and documentation for the Posix thread mutexes can be seen using the man pthread_mutex_init command.

FILE I/O

Our file I/O is quite simple. We want to create a file of arbitrary length, write data into it, read it back, and check the data. As we read the data from the file, it is sent over the internet to the client who requested it. We invent data on the fly. Basically, we write blocks of data into a file until it reaches the requisite size. We also write the page number or the block number into each block. Thus, each block is self-identifying to some extent. The following shows the code within the worker thread that creates, writes, reads, closes, and removes the file:

int fd;

fd = open(namebuf, O_RDWR|O_CREAT, S_IRWXU);
if(fd == -1) {
  printf("open [%s] FAILED: err=[%d].\n",namebuf,Errno);
  exit(1);
}

//writeFile();

pageno = 0;
bytesleft = fsz;

while(bytesleft > 0) {
  cnt = (bytesleft < bsz) ? bytesleft : bsz;
  if(cnt > 4)
    memcpy(buf, &pageno, sizeof(pageno));
  if(write(fd, buf, cnt) != cnt) {
    printf("write %d bytes FAILED: err=%d\n",Errno);
    exit(1);
  }
  pageno++;
  bytesleft -= cnt;
}

//readFile();

lseek(fd, 0L, SEEK_SET);

bytesleft = fsz;
rpageno = 0;

while(bytesleft > 0) {
  cnt = (bytesleft < bsz) ? bytesleft : bsz;
  if(read(fd, buf, cnt) != cnt) {
    printf("read %d bytes FAILED: err=%d\n",Errno);
    exit(1);
  }
  if(cnt > 4) {
    if(0 != memcmp(buf, &rpageno, sizeof(rpageno))) {
      printf("Read Compare ERROR: rpageno=%d",rpageno);
      printf(" buf[0] = %x %x %x %x\n",
        buf[0]&0xFF,
        buf[1]&0xFF,
        buf[2]&0xFF,
        buf[3]&0xFF);
      exit(1);
    }
  }
  rc = Send(sock, buf, cnt, 0);
  if(rc != cnt) {
    printf("th[%d]: SERVER: Send FAILED: rc=%d err=%d\n",
        th, rc,Errno);
    exit(1);
  }
  rpageno++;
  bytesleft -= cnt;
}

//closeFile();

if(close(fd) == -1) {
  printf("close FAILED: err=%d\n",Errno);
  exit(1);
}
//deleteFile();

if(unlink(namebuf) == -1) {
  printf("unlink <%s> FAILED: err=%d\n",namebuf,Errno);
  exit(1);
}

As you can see from this example, the code is straightforward. After writing all the data to the file, the program uses lseek() to return to the beginning of the file, where it begins reading the file. Each block of the file is checked for correctness (trivial check, admittedly) and then is transmitted to the client program (the client program also checks the data). There is nothing complex about this code, other than the fact that the file size and the block size can be parameterized.

The Client

The client portion of the code uses some of the facilities previously discussed. Whether or not the server configures itself to use pooled threads, the client simply starts the number of threads specified on the command line and waits until all the work is completed.

Each client thread continues in a loop, taking a single item of the work queue, sending it to the server, receiving all the data the server transmits back, checking the data as it arrives, and finally closing the socket. The client then proceeds to get another item of the work queue, starting the same process all over again. Each client thread continues these operations until the work queue is empty. As each client determines that the queue is empty, it exits.

This particular design methodology represents an asynchronous approach to problem solving. A thread is dedicated to each work item. If any particular work item takes a longer amount of time, the remaining threads continue emptying the queue. Using this methodology, computing fractal pictures could easily be optimized where some pixels take millions of iterations to complete and some take less than 100. The difficult pixels would occupy a thread, while many trivial pixels (fractal points with a small number of iterations) would be completed by the remaining threads.

The client code loops look like this:

while(workqcnt > 0) {
  lck(&workqcntL);
    if(workqcnt == 0) {
      unlck(&workqcntL);
      break;
    }
    workqcnt--;
  unlck(&workqcntL);
  sock = clientconnect(&tp->refusecnt);
  rpageno = 0;

// Send command to server.
  sprintf(tp->cbuf, "F,%d,%d", filesize,fileblksz);
  rc = Send(sock, (char *)tp->cbuf, CBUFSIZE, 0);
  if(rc != CBUFSIZE) {
    printf("\tCLIENT[%d]: CBUF Send FAILED: err=%d\n", th,Errno);
    exit(1);
  }
  tp->sndbytes += CBUFSIZE;

// set buffer size using thread safe reMalloc()
  if(fileblksz > threads[th].bufmax) {
    threads[th].buf = (char *)reMalloc(threads[th].buf, fileblksz);
    threads[th].bufmax = fileblksz;
  }
  threads[th].bufsiz = fileblksz;
  buf = threads[th].buf;

// Receive filesize bytes from Server and check contents.
  bytesleft = filesize;
  while(bytesleft > 0) {
    cnt = (bytesleft < fileblksz) ? bytesleft : fileblksz;
    rc = Recv(sock, buf, cnt, 0);
    if(rc == SOCKERR) {
      printf("CLIENT: th[%d]: Recv failed: rc=%d err=%d\n", th, rc,Errno);
      exit(1);
    }
    else if(rc == 0)
      break;
    if(rc > 4) {
      if(0 != memcmp(&rpageno, buf, 4)) {
        printf("CLIENT: compare error on pageno %d",rpageno);
        printf(" buf[0] = %x %x %x %x\n",
          buf[0]&0xFF,
          buf[1]&0xFF,
          buf[2]&0xFF,
          buf[3]&0xFF);
        exit(1);
      }
    }
    tp->rcvbytes += rc;
    rpageno++;
    bytesleft -= rc;
  }

// Wait for "DONE" from Server
  rc = Recv(sock, (char *)tp->cbuf, 4, 0);
  if(rc != 4 || !equaln(tp->cbuf, "DONE", 4)) {
    printf("\tCLIENT[%d]: Recv 'DONE' FAILED: rc=%d err=%d\n", th,rc,Errno);
    fflush(stdout);
    exit(1);
  }
  tp->rcvbytes += 4;

// Send 'Q'
  tp->cbuf[0] = 'Q';
  rc = Send(sock, (char *)tp->cbuf, 1, 0);
  if(rc != 1) {
    printf("\tCLIENT[%d]: Send 'Q' FAILED: rc=%d err=%d\n", th,rc,Errno);
    fflush(stdout);
    exit(1);
  }
  tp->sndbytes += 1;
  rc = CLOSESOCK(sock);
  if(rc != 0) {
    printf("\tCLIENT[%d]: close socket %d FAILED: Errno=%d\n", th,sock,Errno);
    exit(1);
  }
}

A number of different synchronization mechanisms have been used here. Counting semaphores are used to simply count available resources. The counter blocks when none is available. Both the thread scheduler and cleanup thread use a semaphore to know when something needs to be done. An earlier cleanup design simply looked for active threads every 2 seconds. Although the performance difference is probably negligible, the resulting design using semaphores leaves the system completely idle when there is nothing to do. (psema, vsema, and initsema are based on Posix semaphores.)

Locking primitives are used to count the number of work items in the client. They are based on a memory variable and a critical section lock. This design was devolved from the queuing primitives described in the next paragraph. The desire to queue millions of items suggested that each should take no memory. Therefore, this interface was derived to support decrementing a counter as the mechanism for dequeuing an object. (initlck, lck, and unlck are based on Posix thread mutexes.)

Finally, the third version of synchronization used is to queue objects. A Posix pthread mutex is used to guard an actual memory queue, each element of which can contain a single integer. The integer in our case is a socket file descriptor received from the listen_and_connect() routine in the server's main thread. (initq, enqueue, and dequeue are based on the previously described locking primitives, which in turn are based on Posix pthread mutexes.)

Our design is asynchronous. The server consumes no CPU cycles if there is nothing to do. The client either has something to do or it exits. The client's responsibility is to pass all the work to be done to the server, wait until the server completes all the work, and finally print timing and performance results.

Compilation Options

After we have written our program, we want to compile it and run it. Our program is called srv3.cpp. To compile it, the following command line is used:

g++ -O2 -Wall srv3.cpp -lpthread -o server && cp server c

The g++ command is used to assure we are compiling using the strong typing of C++. This particular program uses almost no C++ features, but the strong typing of C++ is used. The command line above the -Wall option instructs the compiler to print all warnings. Demanding the strictest possible conformance to excellent programming standards is guaranteed to produce code that requires less debugging and less support. One of the authors has seen software projects remove unknown bugs from programs simply by changing the compilation option to more emit warnings and changing the code to remove the warnings. If a program of any size compiles and executes properly and has never endured the removal of all warnings, we challenge you to go through the effort once. If after the effort you aren't convinced that bugs were removed, we would be quite surprised.

The -Wall command line is intended to produce two executables. The first is a program called server, and the second is a program called c. The first is our server program, and the other is our client program. (We didn't name it client for fear of colliding with an existing program that might be called client.)

A useful option to the GNU C compiler is the -v option, which causes g++ to print all the intermediate steps it takes to produce the executables. When using the verbose option to g++, g++ does not instruct the linker to also produce verbose output. To do that, the following addition is required:

-Xlinker -verbose

Thus, the following produces the most output (into a file called xx). From it you can see what the compiler and linker are doing:

g++ -O2 -Wall -v -Xlinker -verbose srv3.cpp -lpthread -o server 2>xx

Libraries

We compiled our program using the Posix thread dynamic link library by specifying -lpthread on the command line. We could have used statically linked libraries. As installed, we could not compile our program using static libraries. That said, why would we want to?

Static versus dynamic libraries is a question whose answer is surprising.

Dynamic link libraries have the following benefits:

* GPL independence. If your programs are linked to dynamic link libraries, they are not encumbered with GPL (GNU Public Library) provisions. The GPL license requires all who embed GPL code in a program to make available the source code. The current understanding is that statically linking a program is a form of embedding GPL code in a program. Therefore, development teams that produce programs statically linked with GPL could also be required to publish the source code of the entire program. This reason alone is generally enough to eliminate the thought of statically linking programs.
* Dynamic linking allows bugs to be fixed independent of the program. If a bug shows up in a dynamic link library, simply shipping a new library can fix your program. Due to the heightened awareness of security issues, updated dynamic link libraries are quite likely to happen.
* Dynamic linking produces smaller programs. Programs dynamically linked contain only stubs of APIs needed to execute the program. Clearly, stubs are much smaller than the actual code to implement an API or even an entire suite of functions.
* An increase in portability. The emerging Linux Standards Base specification for Linux operating systems requires applications to be dynamically linked to ensure proper use of local system services. A static version of a system library may no longer work properly on future revisions of the OS.

Code Discussion

The reasons seem compelling, but the other side of the picture leaves the issue open to design. Here are some counterpoints:

* GPL independence can also be achieved by simply buying the appropriate libraries to use with your product. That becomes difficult with system libraries and may not be possible. The Posix threading library is a case in point where obtaining a GPL free version might be exceedingly difficult.
* Static libraries mean that when a bug is fixed, your program is unaffected. When your program uses a single API in a library containing possibly thousands of APIs, the likelihood of the library's being updated is high. Each update to the library is a risk to your program over which you have no control. Such risks are worth investigating.
* Because the size of disk drives has increased much faster than program size, on-disk footprint is less of an issue.

The general recommendation is to use dynamic link libraries where possible, using static linking for libraries that may not be found on the destination platform and that cannot be distributed with your application.

Anatomy of a Web Server Transaction

Next, we'll show the steps a web server takes in response to a client request. For this purpose, we provide the following pseudocode:

s = socket(); /* allocate listen socket */
bind(s, 80);  /* bind to TCP port 80    */
listen(s);    /* indicate willingness to accept */
while (1) {
 newconn = accept(s);                                    /* accept new connection */
 remoteIP = getsockname(newconn);                        /* get remote IP addr */
 remoteHost = gethostbyname(remoteIP);                   /* get remote IP DNS name */
 gettimeofday(currentTime);                              /* determine time of day */
 read(newconn, reqBuffer, sizeof(reqBuffer));            /* read client request */
 reqInfo = serverParse(reqBuffer);                       /* parse client request */
 fileName = parseOutFileName(requestBuffer);             /* determine file name */
 fileAttr = stat(fileName);                              /* get file attributes */
 serverCheckFileStuff(fileName, fileAttr);               /* check permissions */
 open(fileName);                                         /* open file */
 read(fileName, fileBuffer);                             /* read file into buffer */
 headerBuffer = serverFigureHeaders(fileName, reqInfo);  /* determine headers */
 write(newSock, headerBuffer);                           /* write headers to socket */
 write(newSock, fileBuffer);                             /* write file to socket */
 close(newSock);                                         /* close socket */
 close(fileName);                                        /* close file */
 write(logFile, requestInfo);                            /* write log info to disk */
}

This pseudocode is a relatively simple implementation of a server that does not employ any possible optimizations and can handle only one request at a time. This example only hints at the more complex functionality that is required by the server, such as how to parse the HTTP request and determine whether the client has the appropriate permissions to view the file. In addition, it has no error handling-for example, if the client requests a file that does not exist. However, this example gives a good idea of what steps are required by a server.

Performance Tools for Evaluating Web Servers

Many tools are available for evaluating the performance of web servers, also known as workload generators. These are programs that run on client machines, emulating a client's behavior, constructing HTTP requests, and sending them to the server. The workload generator can typically vary the volume of requests it generates, called load, and measures how the server behaves in response to that load. Performance metrics include items such as request latency (how long it took an individual response to come back from the server) and throughput (how many responses a server can generate per second).

Perhaps the most commonly used tool is SPECWeb99. This tool is distributed by the Standard Performance Evaluation Corporation (SPEC) nonprofit organization, whose web site is www.spec.org. This tool is probably the most-cited benchmark, and it is used for marketing purposes by server vendors such as IBM, Sun, and Microsoft. Unfortunately, the tool costs money, although it is available freely to member institutions such as IBM. The benchmark is intended to capture the main performance characteristics that have been observed in web servers, such as the size distribution and popularity of files requested. The tool is considered a macro-benchmark in that it is meant to measure whole system performance.

Another tool frequently used is httperf from HP Labs, which is available freely under an open-source license. This tool is highly configurable, allowing you to stress isolated components of a web server-for example, how well a server handles many idle connections. Thus, it is used more as a microbenchmark.

Many other tools exist for evaluating web server performance, including SURGE, WebBench, and WaspClient. However, describing them all is outside the scope of this chapter. Nevertheless, many options are available for stressing, testing, and measuring servers, and many of these are freely available.

Also useful to web site operators are log analysis tools. These tools look through the logs generated by the server and report information such as how many visits a site received over a period of time and where the visitors came from. Performance can be optimized when the operator understands how visitors are using a site. Logs are typically kept in a standard format called the Apache Common Log format. Many commercial tools are available; however, two freely available open-source tools are analog and webalizer.

(win32) Programming Tips

Identificació

Delay (mSecs)

Arguments / paràmetres

TCP/IP

Process list

Win32::Process::List and Win32::Process:Info, then use Win32::TaskScheduler to kill it.

LBT	MB	Threads vs Processes	Client / Server	Server Structure	WCT	Registry
GTS - general TCP server	Services	Envir vars	Tips (code)	Borland	PHP	Links

Projectes actuals

LBT

MessageBox