=head1 NAME Performance Tuning =head1 Description An exhaustive list of various techniques you might want to use to get the most performance possible out of your mod_perl server: configuration, coding, memory use, and more. =head1 The Big Picture To make the user's Web browsing experience as painless as possible, every effort must be made to wring the last drop of performance from the server. There are many factors which affect Web site usability, but speed is one of the most important. This applies to any webserver, not just Apache, so it is very important that you understand it. How do we measure the speed of a server? Since the user (and not the computer) is the one that interacts with the Web site, one good speed measurement is the time elapsed between the moment when she clicks on a link or presses a I button to the moment when the resulting page is fully rendered. The requests and replies are broken into packets. A request may be made up of several packets, a reply may be many thousands. Each packet has to make its own way from one machine to another, perhaps passing through many interconnection nodes. We must measure the time starting from when the first packet of the request leaves our user's machine to when the last packet of the reply arrives back there. A webserver is only one of the entities the packets see along their way. If we follow them from browser to server and back again, they may travel by different routes through many different entities. Before they are processed by your server the packets might have to go through proxy (accelerator) servers and if the request contains more than one packet, packets might arrive to the server by different routes with different arrival times, therefore it's possible that some packets that arrive earlier will have to wait for other packets before they could be reassembled into a chunk of the request message that will be then read by the server. Then the whole process is repeated in reverse. You could work hard to fine tune your webserver's performance, but a slow Network Interface Card (NIC) or a slow network connection from your server might defeat it all. That's why it's important to think about the Big Picture and to be aware of possible bottlenecks between the server and the Web. Of course there is little that you can do if the user has a slow connection. You might tune your scripts and webserver to process incoming requests ultra quickly, so you will need only a small number of working servers, but you might find that the server processes are all busy waiting for slow clients to accept their responses. But there are techniques to cope with this. For example you can deliver the respond after it was compressed. If you are delivering a pure text respond--gzip compression will sometimes reduce the size of the respond by 10 times. You should analyze all the involved components when you try to create the best service for your users, and not the web server or the code that the web server executes. A Web service is like a car, if one of the parts or mechanisms is broken the car may not go smoothly and it can even stop dead if pushed too far without first fixing it. And let me stress it again--if you want to have a success in the web service business you should start worrying about the client's browsing experience and B how good your code benchmarks are. =head1 System Analysis Before we try to solve a problem we need to identify it. In our case we want to get the best performance we can with as little monetary and time investment as possible. =head2 Software Requirements Covered in the section "L". =head2 Hardware Requirements (META: Only partial analysis. Please submit more points. Many points are scattered around the document and should be gathered here, to represent the whole picture. It also should be merged with the above item!) You need to analyze all of the problem's dimensions. There are several things that need to be considered: =over =item * How long does it take to process each request? =item * How many requests can you process simultaneously? =item * How many simultaneous requests are you planning to get? =item * At what rate are you expecting to receive requests? =back The first one is probably the easiest to optimize. Following the performance optimization tips in this and other documents allows a perl (mod_perl) programmer to exercise their code and improve it. The second one is a function of RAM. How much RAM is in each box, how many boxes do you have, and how much RAM does each mod_perl process use? Multiply the first two and divide by the third. Ask yourself whether it is better to switch to another, possibly just as inefficient language or whether that will actually cost more than throwing another powerful machine into the rack. Also ask yourself whether switching to another language will even help. In some applications, for example to link Oracle runtime libraries, a huge chunk of memory is needed so you would save nothing even if you switched from Perl to C. The last two are important. You need a realistic estimate. Are you really expecting 8 million hits per day? What is the expected peak load, and what kind of response time do you need to guarantee? Remember that these numbers might change drastically when you apply code changes and your site becomes popular. Remember that when you get a very high hit rate, the resource requirements don't grow linearly but exponentially! More coverage is provided in the section "L". =head1 Essential Tools In order to improve performance we need measurement tools. The main tool categories are benchmarking and code profiling. It's important to understand that in a major number of the benchmarking tests that we will execute we will not look at the absolute result numbers but the relation between the two and more result sets, since in most cases we would try to show which coding approach is preferable and the you shouldn't try to compare the absolute results collected while running the same benchmarks on your machine, since you won't have the exact hardware and software setup anyway. So this kind of comparison would be misleading. Compare the relative results from the tests running on your machine, don't compare your absolute results with those in this Guide. =head2 Benchmarking Applications How much faster is mod_perl than mod_cgi (aka plain perl/CGI)? There are many ways to benchmark the two. I'll present a few examples and numbers below. Check out the C directory of the mod_perl distribution for more examples. If you are going to write your own benchmarking utility, use the C module for heavy scripts and the C module for very fast scripts (faster than 1 sec) where you will need better time precision. There is no need to write a special benchmark though. If you want to impress your boss or colleagues, just take some heavy CGI script you have (e.g. a script that crunches some data and prints the results to STDOUT), open 2 xterms and call the same script in mod_perl mode in one xterm and in mod_cgi mode in the other. You can use C from the C package to emulate the browser. The C directory of the mod_perl distribution includes such an example. See also two tools for benchmarking: L and L =head3 Benchmarking Perl Code If you are going to write your own benchmarking utility, use the C module and the C module where you need better time precision (less than 10msec). An example of the C module usage: benchmark.pl ------------ use Benchmark; timethis (1_000, sub { my $x = 100; my $y = log ($x ** 100) for (0..10000); }); % perl benchmark.pl timethis 1000: 25 wallclock secs (24.93 usr + 0.00 sys = 24.93 CPU) If you want to get the benchmark results in micro-seconds you will have to use the C module, its usage is similar to C's. use Time::HiRes qw(gettimeofday tv_interval); my $start_time = [ gettimeofday ]; sub_that_takes_a_teeny_bit_of_time(); my $end_time = [ gettimeofday ]; my $elapsed = tv_interval($start_time,$end_time); print "The sub took $elapsed seconds." See also the L. =head3 Benchmarking a Graphic Hits Counter with Persistent DB Connections Here are the numbers from Michael Parker's mod_perl presentation at the Perl Conference (Aug, 98). (Sorry, there used to be links here to the source, but they went dead one day, so I removed them). The script is a standard hits counter, but it logs the counts into a mysql relational DataBase: Benchmark: timing 100 iterations of cgi, perl... [rate 1:28] cgi: 56 secs ( 0.33 usr 0.28 sys = 0.61 cpu) perl: 2 secs ( 0.31 usr 0.27 sys = 0.58 cpu) Benchmark: timing 1000 iterations of cgi,perl... [rate 1:21] cgi: 567 secs ( 3.27 usr 2.83 sys = 6.10 cpu) perl: 26 secs ( 3.11 usr 2.53 sys = 5.64 cpu) Benchmark: timing 10000 iterations of cgi, perl [rate 1:21] cgi: 6494 secs (34.87 usr 26.68 sys = 61.55 cpu) perl: 299 secs (32.51 usr 23.98 sys = 56.49 cpu) We don't know what server configurations were used for these tests, but I guess the numbers speak for themselves. The source code of the script was available at http://www.realtime.net/~parkerm/perl/conf98/sld006.htm. It's now a dead link. If you know its new location, please let me know. =head3 Benchmarking Response Times In the next sections we will talk about tools that allow us to benchmark response times. =head4 ApacheBench ApacheBench (C) is a tool for benchmarking your Apache HTTP server. It is designed to give you an idea of the performance that your current Apache installation can give. In particular, it shows you how many requests per second your Apache server is capable of serving. The C tool comes bundled with the Apache source distribution. Let's try it. We will simulate 10 users concurrently requesting a very light script at C. Each simulated user makes 10 requests. % ./ab -n 100 -c 10 www.example.com/perl/test.pl The results are: Document Path: /perl/test.pl Document Length: 319 bytes Concurrency Level: 10 Time taken for tests: 0.715 seconds Complete requests: 100 Failed requests: 0 Total transferred: 60700 bytes HTML transferred: 31900 bytes Requests per second: 139.86 Transfer rate: 84.90 kb/s received Connection Times (ms) min avg max Connect: 0 0 3 Processing: 13 67 71 Total: 13 67 74 We can see that under load of ten concurrent users our server is capable of processing 140 requests per second. Of course this benchmark is correct only when the script under test is used. We can also learn about the average processing time, which in this case was 67 milli-seconds. Other numbers reported by C may or may not be of interest to you. For example if we believe that the script I is not efficient we will try to improve it and run the benchmark again, to see whether we have any improve in performance. C, available from CPAN, provides a Perl interface for C. =head4 httperf httperf is a utility written by David Mosberger. Just like ApacheBench, it measures the performance of the webserver. A sample command line is shown below: httperf --server hostname --port 80 --uri /test.html \ --rate 150 --num-conn 27000 --num-call 1 --timeout 5 This command causes httperf to use the web server on the host with IP name hostname, running at port 80. The web page being retrieved is I and, in this simple test, the same page is retrieved repeatedly. The rate at which requests are issued is 150 per second. The test involves initiating a total of 27,000 TCP connections and on each connection one HTTP call is performed. A call consists of sending a request and receiving a reply. The timeout option defines the number of seconds that the client is willing to wait to hear back from the server. If this timeout expires, the tool considers the corresponding call to have failed. Note that with a total of 27,000 connections and a rate of 150 per second, the total test duration will be approximately 180 seconds (27,000/150), independently of what load the server can actually sustain. Here is a result that one might get: Total: connections 27000 requests 26701 replies 26701 test-duration 179.996 s Connection rate: 150.0 conn/s (6.7 ms/conn, <=47 concurrent connections) Connection time [ms]: min 1.1 avg 5.0 max 315.0 median 2.5 stddev 13.0 Connection time [ms]: connect 0.3 Request rate: 148.3 req/s (6.7 ms/req) Request size [B]: 72.0 Reply rate [replies/s]: min 139.8 avg 148.3 max 150.3 stddev 2.7 (36 samples) Reply time [ms]: response 4.6 transfer 0.0 Reply size [B]: header 222.0 content 1024.0 footer 0.0 (total 1246.0) Reply status: 1xx=0 2xx=26701 3xx=0 4xx=0 5xx=0 CPU time [s]: user 55.31 system 124.41 (user 30.7% system 69.1% total 99.8%) Net I/O: 190.9 KB/s (1.6*10^6 bps) Errors: total 299 client-timo 299 socket-timo 0 connrefused 0 connreset 0 Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0 L =head4 http_load C is yet another utility that does webserver load testing. It can simulate 33.6kbps modem connection (I<-throttle>) and allows you to provide a file with a list of URLs, which we be fetched randomly. You can specify how many parallel connections to run using the I<-parallel N> option, or you can specify the number of requests to generate per second with I<-rate N> option. Finally you can tell the utility when to stop by specifying either the test time length (I<-seconds N>) or the total number of fetches (I<-fetches N>). A sample run with the file I including: http://www.example.com/foo/ http://www.example.com/bar/ We ask to generate three requests per second and run for only two seconds. Here is the generated output: % ./http_load -rate 3 -seconds 2 urls http://www.example.com/foo/: check-connect SUCCEEDED, ignoring http://www.example.com/bar/: check-connect SUCCEEDED, ignoring http://www.example.com/bar/: check-connect SUCCEEDED, ignoring http://www.example.com/bar/: check-connect SUCCEEDED, ignoring http://www.example.com/foo/: check-connect SUCCEEDED, ignoring 5 fetches, 3 max parallel, 96870 bytes, in 2.00258 seconds 19374 mean bytes/connection 2.49678 fetches/sec, 48372.7 bytes/sec msecs/connect: 1.805 mean, 5.24 max, 0.79 min msecs/first-response: 291.289 mean, 560.338 max, 34.349 min So you can see that it has reported 2.5 requests per second. Of course for the real test you will want to load the server heavily and run the test for a longer time to get more reliable results. Note that when you provide a file with a list of URLs make sure that you don't have empty lines in it. If you do -- the utility won't work complaining: ./http_load: unknown protocol - L =head4 the crashme Script This is another crashme suite originally written by Michael Schilli (and was located at http://www.linux-magazin.de site, but now the link has gone). I made a few modifications, mostly adding my () operators. I also allowed it to accept more than one url to test, since sometimes you want to test more than one script. The tool provides the same results as B above but it also allows you to set the timeout value, so requests will fail if not served within the time out period. You also get values for B (seconds per request) and B (requests per second). It can do a complete simulation of your favorite Netscape browser :) and give you a better picture. I have noticed while running these two benchmarking suites, that B gave me results from two and a half to three times better. Both suites were run on the same machine, with the same load and the same parameters, but the implementations were different. Sample output: URL(s): http://www.example.com/perl/access/access.cgi Total Requests: 100 Parallel Agents: 10 Succeeded: 100 (100.00%) Errors: NONE Total Time: 9.39 secs Throughput: 10.65 Requests/sec Latency: 0.85 secs/Request And the code: The LWP::Parallel::UserAgent benchmark: F =head3 Benchmarking PerlHandlers The C module does C Benchmarking. With the help of this module you can log the time taken to process the request, just like you'd use the C module to benchmark a regular Perl script. Of course you can extend this module to perform more advanced processing like putting the results into a database for a later processing. But all it takes is adding this configuration directive inside I: PerlFixupHandler Apache::Timeit Since scripts running under C are running inside the PerlHandler these are benchmarked as well. An example of the lines which show up in the I file: timing request for /perl/setupenvoff.pl: 0 wallclock secs ( 0.04 usr + 0.01 sys = 0.05 CPU) timing request for /perl/setupenvoff.pl: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) The C package is a part of the I files collection available from CPAN. =head3 Other Benchmarking Tools Other tools you may want to take a look at: =over =item * C C module runs tests on remote URLs or local web files containing Perl/JSP/HTML/JavaScript/etc. and generates a detailed test report. It's available from CPAN. =item * C C is a test-harness application to test the integrity of a user's path through a web site. It's available from CPAN. =item * C and C C is a mod_perl handler that records an HTTP session and stores it on the web server's file system. C reads the recorded session from the file system, and formats it for playback using C or C. This is useful when writing acceptance and regression tests. It's available from CPAN. =item * C This tool is somewhat complex to set up, but once you get it running it gives you stats that you could only duplicate with ab or http_load if you did quite a bit of extra scripting around them. It also allows multiple client machines to be used for providing heavy loads. This tool is useful if you need to know things like at what point people start finding your sight slow, as opposed to at what point the server becomes unresponsive. L =item * C Flood is a load-tester being developed through the Apache Software Foundation. From the Flood FAQ: "Flood is a profile-driven HTTP load tester. In layman's terms, it means that flood is capable of generating large amounts of web traffic. Flood's flexibility and power arises in its configuration syntax. It is able to work well with dynamic content." L =back =head2 Code Profiling Techniques The profiling process helps you to determine which subroutines or just snippets of code take the longest time to execute and which subroutines are called most often. Probably you will want to optimize those. When do you need to profile your code? You do that when you suspect that some part of your code is called very often and may be there is a need to optimize it to significantly improve the overall performance. For example if you have ever used the C pragma, which extends the terse diagnostics normally emitted by both the Perl compiler and the Perl interpreter, augmenting them with the more verbose and endearing descriptions found in the C manpage. You know that it might tremendously slow you code down, so let's first prove that it is correct. We will run a benchmark, once with diagnostics enabled and once disabled, on a subroutine called I. The code inside the subroutine does an arithmetic and a numeric comparison of two strings. It assigns one string to another if the condition tests true but the condition always tests false. To demonstrate the C overhead the comparison operator is intentionally I. It should be a string comparison, not a numeric one. use Benchmark; use diagnostics; use strict; my $count = 50000; disable diagnostics; my $t1 = timeit($count,\&test_code); enable diagnostics; my $t2 = timeit($count,\&test_code); print "Off: ",timestr($t1),"\n"; print "On : ",timestr($t2),"\n"; sub test_code{ my ($a,$b) = qw(foo bar); my $c; if ($a == $b) { $c = $a; } } For only a few lines of code we get: Off: 1 wallclock secs ( 0.81 usr + 0.00 sys = 0.81 CPU) On : 13 wallclock secs (12.54 usr + 0.01 sys = 12.55 CPU) With C enabled, the subroutine test_code() is 16 times slower, than with C disabled! Now let's fix the comparison the way it should be, by replacing C<==> with C, so we get: my ($a,$b) = qw(foo bar); my $c; if ($a eq $b) { $c = $a; } and run the same benchmark again: Off: 1 wallclock secs ( 0.57 usr + 0.00 sys = 0.57 CPU) On : 1 wallclock secs ( 0.56 usr + 0.00 sys = 0.56 CPU) Now there is no overhead at all. The C pragma slows things down only when warnings are generated. After we have verified that using the C pragma might adds a big overhead to execution runtime, let's use the code profiling to understand why this happens. We are going to use C to profile the code. Let's use this code: diagnostics.pl -------------- use diagnostics; print "Content-type:text/html\n\n"; test_code(); sub test_code{ my ($a,$b) = qw(foo bar); my $c; if ($a == $b) { $c = $a; } } Run it with the profiler enabled, and then create the profiling stastics with the help of dprofpp: % perl -d:DProf diagnostics.pl % dprofpp Total Elapsed Time = 0.342236 Seconds User+System Time = 0.335420 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 92.1 0.309 0.358 1 0.3089 0.3578 main::BEGIN 14.9 0.050 0.039 3161 0.0000 0.0000 diagnostics::unescape 2.98 0.010 0.010 2 0.0050 0.0050 diagnostics::BEGIN 0.00 0.000 -0.000 2 0.0000 - Exporter::import 0.00 0.000 -0.000 2 0.0000 - Exporter::export 0.00 0.000 -0.000 1 0.0000 - Config::BEGIN 0.00 0.000 -0.000 1 0.0000 - Config::TIEHASH 0.00 0.000 -0.000 2 0.0000 - Config::FETCH 0.00 0.000 -0.000 1 0.0000 - diagnostics::import 0.00 0.000 -0.000 1 0.0000 - main::test_code 0.00 0.000 -0.000 2 0.0000 - diagnostics::warn_trap 0.00 0.000 -0.000 2 0.0000 - diagnostics::splainthis 0.00 0.000 -0.000 2 0.0000 - diagnostics::transmo 0.00 0.000 -0.000 2 0.0000 - diagnostics::shorten 0.00 0.000 -0.000 2 0.0000 - diagnostics::autodescribe It's not easy to see what is responsible for this enormous overhead, even if C seems to be running most of the time. To get the full picture we must see the OPs tree, which shows us who calls whom, so we run: % dprofpp -T and the output is: main::BEGIN diagnostics::BEGIN Exporter::import Exporter::export diagnostics::BEGIN Config::BEGIN Config::TIEHASH Exporter::import Exporter::export Config::FETCH Config::FETCH diagnostics::unescape ..................... 3159 times [diagnostics::unescape] snipped ..................... diagnostics::unescape diagnostics::import diagnostics::warn_trap diagnostics::splainthis diagnostics::transmo diagnostics::shorten diagnostics::autodescribe main::test_code diagnostics::warn_trap diagnostics::splainthis diagnostics::transmo diagnostics::shorten diagnostics::autodescribe diagnostics::warn_trap diagnostics::splainthis diagnostics::transmo diagnostics::shorten diagnostics::autodescribe So we see that two executions of C and 3161 of C are responsible for most of the running overhead. If we comment out the C module, we get: Total Elapsed Time = 0.079974 Seconds User+System Time = 0.059974 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 0.00 0.000 -0.000 1 0.0000 - main::test_code It is possible to profile code running under mod_perl with the C module, available on CPAN. However, you must have apache version 1.3b3 or higher and the C enabled during the httpd build process. When the server is started, C installs an C block to write the I file. This block will be called at server shutdown. Here is how to start and stop a server with the profiler enabled: % setenv PERL5OPT -d:DProf % httpd -X -d `pwd` & ... make some requests to the server here ... % kill `cat logs/httpd.pid` % unsetenv PERL5OPT % dprofpp The C package is a Perl code profiler. It will collect information on the execution time of a Perl script and of the subs in that script (remember that C and C are just like any other subroutines you write, but they come bundled with Perl!) Another approach is to use C, which hooks C into mod_perl. The C module will run a C profiler inside each child server and write the I file in the directory C<$ServerRoot/logs/dprof/$$> when the child is shutdown (where C<$$> is the number of the child process). All it takes is to add to I: PerlModule Apache::DProf Remember that any PerlHandler that was pulled in before C in the I or I, will not have its code debugging information inserted. To run C, chdir to C<$ServerRoot/logs/dprof/$$> and run: % dprofpp (Lookup the C directive's value in I to figure out what's your C<$ServerRoot>.) =head2 Measuring the Memory of the Process Very important aspect of performance tuning is to make sure that your applications don't use much memory, since if they do you cannot run many servers and therefore in most cases under a heavy load the overall performance degrades. In addition the code may not be clean and leak memory, which is even worse, since if the same process serves many requests and after each request more memory is used, after awhile all RAM will be used and machine will start swapping (use the swap partition) which is a very undesirable event, since it may lead to a machine crash. The simplest way to figure out how big the processes are and see whether they grow is to watch the output of top(1) or ps(1) utilities. For example the output of top(1): 8:51am up 66 days, 1:44, 1 user, load average: 1.09, 2.27, 2.61 95 processes: 92 sleeping, 3 running, 0 zombie, 0 stopped CPU states: 54.0% user, 9.4% system, 1.7% nice, 34.7% idle Mem: 387664K av, 309692K used, 77972K free, 111092K shrd, 70944K buff Swap: 128484K av, 11176K used, 117308K free 170824K cached PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND 29225 nobody 0 0 9760 9760 7132 S 0 12.5 2.5 0:00 httpd_perl 29220 nobody 0 0 9540 9540 7136 S 0 9.0 2.4 0:00 httpd_perl 29215 nobody 1 0 9672 9672 6884 S 0 4.6 2.4 0:01 httpd_perl 29255 root 7 0 1036 1036 824 R 0 3.2 0.2 0:01 top 376 squid 0 0 15920 14M 556 S 0 1.1 3.8 209:12 squid 29227 mysql 5 5 1892 1892 956 S N 0 1.1 0.4 0:00 mysqld 29223 mysql 5 5 1892 1892 956 S N 0 0.9 0.4 0:00 mysqld 29234 mysql 5 5 1892 1892 956 S N 0 0.9 0.4 0:00 mysqld Which starts with overall information of the system and then displays the most active processes at the given moment. So for example if we look at the C processes we can see the size of the resident (C) and shared (C) memory segments. This sample was taken on the production server running linux. But of course we want to see all the apache/mod_perl processes, and that's where ps(1) comes to help. The options of this utility vary from one Unix flavor to another, and some flavors provide their own tools. Let's check the information about mod_perl processes: % ps -o pid,user,rss,vsize,%cpu,%mem,ucomm -C httpd_perl PID USER RSS VSZ %CPU %MEM COMMAND 29213 root 8584 10264 0.0 2.2 httpd_perl 29215 nobody 9740 11316 1.0 2.5 httpd_perl 29216 nobody 9668 11252 0.7 2.4 httpd_perl 29217 nobody 9824 11408 0.6 2.5 httpd_perl 29218 nobody 9712 11292 0.6 2.5 httpd_perl 29219 nobody 8860 10528 0.0 2.2 httpd_perl 29220 nobody 9616 11200 0.5 2.4 httpd_perl 29221 nobody 8860 10528 0.0 2.2 httpd_perl 29222 nobody 8860 10528 0.0 2.2 httpd_perl 29224 nobody 8860 10528 0.0 2.2 httpd_perl 29225 nobody 9760 11340 0.7 2.5 httpd_perl 29235 nobody 9524 11104 0.4 2.4 httpd_perl Now you can see the resident (C) and virtual (C) memory segments (and shared memory segment if you ask for it) of all mod_perl processes. Please refer to the top(1) and ps(1) man pages for more information. You probably agree that using top(1) and ps(1) is cumbersome if we want to use memory size sampling during the benchmark test. We want to have a way to print memory sizes during the program execution at desired places. If you have C modules installed, which is a perl glue to the C library, it's exactly what we need. Note: C requires the C library but is not available for all platforms. See the docs in the source at ftp://ftp.gnome.org/pub/GNOME/stable/sources/gtop/ to check whether your platform/flavor is supported. C provides an API for retrieval of information about processes and the whole system. We are interested only in memory sampling API methods. To print all the process related memory information we can execute the following code: use GTop; my $gtop = GTop->new; my $proc_mem = $gtop->proc_mem($$); for (qw(size vsize share rss)) { printf " %s => %d\n", $_, $proc_mem->$_(); } When executed we see the following output (in bytes): size => 1900544 vsize => 3108864 share => 1392640 rss => 1900544 So if we are interested in to print the process resident memory segment before and after some event we just do it: For example if we want to see how much extra memory was allocated after a variable creation we can write the following code: use GTop; my $gtop = GTop->new; my $before = $gtop->proc_mem($$)->rss; my $x = 'a' x 10000; my $after = $gtop->proc_mem($$)->rss; print "diff: ",$after-$before, " bytes\n"; and the output diff: 20480 bytes So we can see that Perl has allocated extra 20480 bytes to create C<$x> (of course the creation of C needed a few bytes as well, but it's insignificant compared to a size of C<$x>) The C module with help of the C module allows you to watch all your system information using your favorite browser from anywhere in the world without a need to telnet to your machine. If you are looking at what information you can retrieve with C, you should look at C as it deploys a big part of the API C provides. If you are running a true BSD system, you may use C instead of C. For example: print "used memory = ".(BSD::Resource::getrusage)[2]."\n" For more information refer to the C manpage. =head2 Measuring the Memory Usage of Subroutines With help of C you can find out the size of each and every subroutine. =over =item 1 Build and install mod_perl as you always do, make sure it's version 1.22 or higher. =item 1 Configure /perl-status if you haven't already: SetHandler perl-script PerlHandler Apache::Status order deny,allow #deny from all #allow from ... =item 1 Add to httpd.conf PerlSetVar StatusOptionsAll On PerlSetVar StatusTerse On PerlSetVar StatusTerseSize On PerlSetVar StatusTerseSizeMainSummary On PerlModule B::TerseSize =item 1 Start the server (best in httpd -X mode) =item 1 From your favorite browser fetch http://localhost/perl-status =item 1 Click on 'Loaded Modules' or 'Compiled Registry Scripts' =item 1 Click on the module or script of your choice (you might need to run some script/handler before you will see it here unless it was preloaded) =item 1 Click on 'Memory Usage' at the bottom =item 1 You should see all the subroutines and their respective sizes. =back Now you can start to optimize your code. Or test which of the several implementations is of the least size. For example let's compare C's OO vs. procedural interfaces: As you will see below the first OO script uses about 2k bytes while the second script (procedural interface) uses about 5k. Here are the code examples and the numbers: =over =item 1 cgi_oo.pl --------- use CGI (); my $q = CGI->new; print $q->header; print $q->b("Hello"); =item 2 cgi_mtd.pl --------- use CGI qw(header b); print header(); print b("Hello"); =back After executing each script in single server mode (-X) the results are: =over =item 1 Totals: 1966 bytes | 27 OPs handler 1514 bytes | 27 OPs exit 116 bytes | 0 OPs =item 2 Totals: 4710 bytes | 19 OPs handler 1117 bytes | 19 OPs basefont 120 bytes | 0 OPs frameset 120 bytes | 0 OPs caption 119 bytes | 0 OPs applet 118 bytes | 0 OPs script 118 bytes | 0 OPs ilayer 118 bytes | 0 OPs header 118 bytes | 0 OPs strike 118 bytes | 0 OPs layer 117 bytes | 0 OPs table 117 bytes | 0 OPs frame 117 bytes | 0 OPs style 117 bytes | 0 OPs Param 117 bytes | 0 OPs small 117 bytes | 0 OPs embed 117 bytes | 0 OPs font 116 bytes | 0 OPs span 116 bytes | 0 OPs exit 116 bytes | 0 OPs big 115 bytes | 0 OPs div 115 bytes | 0 OPs sup 115 bytes | 0 OPs Sub 115 bytes | 0 OPs TR 114 bytes | 0 OPs td 114 bytes | 0 OPs Tr 114 bytes | 0 OPs th 114 bytes | 0 OPs b 113 bytes | 0 OPs =back Note, that the above is correct if you didn't precompile all C's methods at server startup. Since if you did, the procedural interface in the second test will take up to 18k and not 5k as we saw. That's because the whole of C's namespace is inherited and it already has all its methods compiled, so it doesn't really matter whether you attempt to import only the symbols that you need. So if you have: use CGI qw(-compile :all); in the server startup script. Having: use CGI qw(header); or use CGI qw(:all); is essentially the same. You will have all the symbols precompiled at startup imported even if you ask for only one symbol. It seems to me like a bug, but probably that's how C works. BTW, you can check the number of opcodes in the code by a simple command line run. For example comparing S<'my %hash'> vs. S<'my %hash = ()'>. % perl -MO=Terse -e 'my %hash' | wc -l -e syntax OK 4 % perl -MO=Terse -e 'my %hash = ()' | wc -l -e syntax OK 10 The first one has less opcodes. Note that you shouldn't use C module on production server as it adds quite a bit of overhead for each request. =head1 Know Your Operating System In order to get the best performance it helps to get intimately familiar with the Operating System (OS) the web server is running on. There are many OS specific things that you may be able to optimize which will improve your web server's speed, reliability and security. The following sections will reveal some of the most important details you should know about your OS. =head2 Sharing Memory The sharing of memory is one very important factor. If your OS supports it (and most sane systems do), you might save memory by sharing it between child processes. This is only possible when you preload code at server startup. However, during a child process' life its memory pages tend to become unshared. There is no way we can make Perl allocate memory so that (dynamic) variables land on different memory pages from constants, so the B effect (we will explain this in a moment) will hit you almost at random. If you are pre-loading many modules you might be able to trade off the memory that stays shared against the time for an occasional fork by tuning C. Each time a child reaches this upper limit and dies it should release its unshared pages. The new child which replaces it will share its fresh pages until it scribbles on them. The ideal is a point where your processes usually restart before too much memory becomes unshared. You should take some measurements to see if it makes a real difference, and to find the range of reasonable values. If you have success with this tuning the value of C will probably be peculiar to your situation and may change with changing circumstances. It is very important to understand that your goal is not to have C to be 10000. Having a child serving 300 requests on precompiled code is already a huge overall speedup, so if it is 100 or 10000 it probably does not really matter if you can save RAM by using a lower value. Do not forget that if you preload most of your code at server startup, the newly forked child gets ready very fast, because it inherits most of the preloaded code and the perl interpreter from the parent process. During the life of the child its memory pages (which aren't really its own to start with, it uses the parent's pages) gradually get `dirty' - variables which were originally inherited and shared are updated or modified -- and the I happens. This reduces the number of shared memory pages, thus increasing the memory requirement. Killing the child and spawning a new one allows the new child to get back to the pristine shared memory of the parent process. The recommendation is that C should not be too large, otherwise you lose some of the benefit of sharing memory. See L for more about tuning the C parameter. =head3 How Shared Is My Memory? You've probably noticed that the word shared is repeated many times in relation to mod_perl. Indeed, shared memory might save you a lot of money, since with sharing in place you can run many more servers than without it. See L. How much shared memory do you have? You can see it by either using the memory utility that comes with your system or you can deploy the C module: use GTop (); print "Shared memory of the current process: ", GTop->new->proc_mem($$)->share,"\n"; print "Total shared memory: ", GTop->new->mem->share,"\n"; When you watch the output of the C utility, don't confuse the C (or C) columns with the C column. C is RESident memory, which is the size of pages currently swapped in. =head3 Calculating Real Memory Usage I have shown how to measure the size of the process' shared memory, but we still want to know what the real memory usage is. Obviously this cannot be calculated simply by adding up the memory size of each process because that wouldn't account for the shared memory. On the other hand we cannot just subtract the shared memory size from the total size to get the real memory usage numbers, because in reality each process has a different history of processed requests, therefore the shared memory is not the same for all processes. So how do we measure the real memory size used by the server we run? It's probably too difficult to give the exact number, but I've found a way to get a fair approximation which was verified in the following way. I have calculated the real memory used, by the technique you will see in the moment, and then have stopped the Apache server and saw that the memory usage report indicated that the total used memory went down by almost the same number I've calculated. Note that some OSs do smart memory pages caching so you may not see the memory usage decrease as soon as it actually happens when you quit the application. This is a technique I've used: =over =item 1 For each process sum up the difference between shared and system memory. To calculate a difference for a single process use: use GTop; my $proc_mem = GTop->new->proc_mem($$); my $diff = $proc_mem->size - $proc_mem->share; print "Difference is $diff bytes\n"; =item 2 Now if we add the shared memory size of the process with maximum shared memory, we will get all the memory that actually is being used by all httpd processes, except for the parent process. =item 3 Finally, add the size of the parent process. =back Please note that this might be incorrect for your system, so you use this number on your own risk. I've used this technique to display real memory usage in the module L, so instead of trying to manually calculate this number you can use this module to do it automatically. In fact in the calculations used in this module there is no separation between the parent and child processes, they are all counted indifferently using the following code: use GTop (); my $gtop = GTop->new; my $total_real = 0; my $max_shared = 0; # @mod_perl_pids is initialized by Apache::Scoreboard, irrelevant here my @mod_perl_pids = some_code(); for my $pid (@mod_perl_pids) my $proc_mem = $gtop->proc_mem($pid); my $size = $proc_mem->size($pid); my $share = $proc_mem->share($pid); $total_real += $size - $share; $max_shared = $share if $max_shared < $share; } my $total_real += $max_shared; So as you see we that we accumulate the difference between the shared and reported memory: $total_real += $size-$share; and at the end add the biggest shared process size: my $total_real += $max_shared; So now C<$total_real> contains approximately the really used memory. =head3 Are My Variables Shared? How do you find out if the code you write is shared between the processes or not? The code should be shared, except where it is on a memory page with variables that change. Some variables are read-only in usage and never change. For example, if you have some variables that use a lot of memory and you want them to be read-only. As you know the variable becomes unshared when the process modifies its value. So imagine that you have this 10Mb in-memory database that resides in a single variable, you perform various operations on it and want to make sure that the variable is still shared. For example if you do some matching regular expression (regex) processing on this variable and want to use the pos() function, will it make the variable unshared or not? The C module comes to rescue. Let's write a module called I which we preload at server startup, so all the variables of this module are initially shared by all children. MyShared.pm --------- package MyShared; use Apache::Peek; my $readonly = "Chris"; sub match { $readonly =~ /\w/g; } sub print_pos{ print "pos: ",pos($readonly),"\n";} sub dump { Dump($readonly); } 1; This module declares the package C, loads the C module and defines the lexically scoped C<$readonly> variable which is supposed to be a variable of large size (think about a huge hash data structure), but we will use a small one to simplify this example. The module also defines three subroutines: match() that does a simple character matching, print_pos() that prints the current position of the matching engine inside the string that was last matched and finally the dump() subroutine that calls the C module's Dump() function to dump a raw Perl data-type of the C<$readonly> variable. Now we write the script that prints the process ID (PID) and calls all three functions. The goal is to check whether pos() makes the variable I and therefore unshared. share_test.pl ------------- use MyShared; print "Content-type: text/plain\r\n\r\n"; print "PID: $$\n"; MyShared::match(); MyShared::print_pos(); MyShared::dump(); Before you restart the server, in I set: MaxClients 2 for easier tracking. You need at least two servers to compare the print outs of the test program. Having more than two can make the comparison process harder. Now open two browser windows and issue the request for this script several times in both windows, so you get different processes PIDs reported in the two windows and each process has processed a different number of requests to the I script. In the first window you will see something like that: PID: 27040 pos: 1 SV = PVMG(0x853db20) at 0x8250e8c REFCNT = 3 FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK) IV = 0 NV = 0 PV = 0x8271af0 "Chris"\0 CUR = 5 LEN = 6 MAGIC = 0x853dd80 MG_VIRTUAL = &vtbl_mglob MG_TYPE = 'g' MG_LEN = 1 And in the second window: PID: 27041 pos: 2 SV = PVMG(0x853db20) at 0x8250e8c REFCNT = 3 FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK) IV = 0 NV = 0 PV = 0x8271af0 "Chris"\0 CUR = 5 LEN = 6 MAGIC = 0x853dd80 MG_VIRTUAL = &vtbl_mglob MG_TYPE = 'g' MG_LEN = 2 We see that all the addresses of the supposedly big structures are the same, 0x8250e8c for SV, and 0x8271af0 for PV, therefore the variable data structure is almost completely shared. The only difference is in C record, which is not shared. So given that the C<$readonly> variable is a big one, its value is still shared between the processes, while part of the variable data structure is non-shared. But it's almost insignificant because it takes a very little memory space. Now if you need to compare more than variable, doing it by hand can be quite time consuming and error prune. Therefore it's better to correct the testing script to dump the Perl data-types into files (e.g I, where C<$$> is the PID of the process) and then using diff(1) utility to see whether there is some difference. So correcting the dump() function to write the info to the file will do the job. Notice that we use C and not C. The both are almost the same, but C prints it output directly to the opened socket so we cannot intercept and redirect the result to the file. Since C dumps results to the STDERR stream we can use the old trick of saving away the default STDERR handler, and open a new filehandler using the STDERR. In our example when C now prints to STDERR it actually prints to our file. When we are done, we make sure to restore the original STDERR filehandler. So this is the resulting code: MyShared2.pm --------- package MyShared2; use Devel::Peek; my $readonly = "Chris"; sub match { $readonly =~ /\w/g; } sub print_pos{ print "pos: ",pos($readonly),"\n";} sub dump{ my $dump_file = "/tmp/dump.$$"; print "Dumping the data into $dump_file\n"; open OLDERR, ">&STDERR"; open STDERR, ">".$dump_file or die "Can't open $dump_file: $!"; Dump($readonly); close STDERR ; open STDERR, ">&OLDERR"; } 1; When if we modify the code to use the modified module: share_test2.pl ------------- use MyShared2; print "Content-type: text/plain\r\n\r\n"; print "PID: $$\n"; MyShared2::match(); MyShared2::print_pos(); MyShared2::dump(); And run it as before (with S), two dump files will be created in the directory I. In our test these were created as I and I. When we run diff(1): % diff /tmp/dump.1224 /tmp/dump.1225 12c12 < MG_LEN = 1 --- > MG_LEN = 2 We see that the two padlists (of the variable C) are different, as we have observed before when we did a manual comparison. In fact we if we think about these results again, we get to a conclusion that there is no need for two processes to find out whether the variable gets modified (and therefore unshared). It's enough to check the datastructure before the script was executed and after that. You can modify the C module to dump the padlists into a different file after each invocation and than to run the diff(1) on the two files. If you want to watch whether some lexically scoped (with my ()) variables in your C script inside the same process get changed between invocations you can use the C module instead. Since it does exactly this: it makes a snapshot of the padlist before and after the code execution and shows the difference between the two. This specific module was written to work with C scripts so it won't work for loaded modules. Use the technique we have described above for any type of variables in modules and scripts. Surely another way of ensuring that a scalar is readonly and therefore sharable is to either use the C pragma or C pragma. But then you won't be able to make calls that alter the variable even a little, like in the example that we just showed, because it will be a true constant variable and you will get compile time error if you try this: MyConstant.pm ------------- package MyConstant; use constant readonly => "Chris"; sub match { readonly =~ /\w/g; } sub print_pos{ print "pos: ",pos(readonly),"\n";} 1; % perl -c MyConstant.pm Can't modify constant item in match position at MyConstant.pm line 5, near "readonly)" MyConstant.pm had compilation errors. However this code is just right: MyConstant1.pm ------------- package MyConstant1; use constant readonly => "Chris"; sub match { readonly =~ /\w/g; } 1; =head3 Preloading Perl Modules at Server Startup You can use the C and C directives to load commonly used modules such as C, C and etc., when the server is started. On most systems, server children will be able to share the code space used by these modules. Just add the following directives into I: PerlModule CGI PerlModule DBI But an even better approach is to create a separate startup file (where you code in plain perl) and put there things like: use DBI (); use Carp (); Don't forget to prevent importing of the symbols exported by default by the module you are going to preload, by placing empty parentheses C<()> after a module's name. Unless you need some of these in the startup file, which is unlikely. This will save you a few more memory bits. Then you C this startup file in I with the C directive, placing it before the rest of the mod_perl configuration directives: PerlRequire /path/to/start-up.pl C is a special case. Ordinarily C autoloads most of its functions on an as-needed basis. This speeds up the loading time by deferring the compilation phase. When you use mod_perl, FastCGI or another system that uses a persistent Perl interpreter, you will want to precompile the functions at initialization time. To accomplish this, call the package function compile() like this: use CGI (); CGI->compile(':all'); The arguments to C are a list of method names or sets, and are identical to those accepted by the C and C operators. Note that in most cases you will want to replace C<':all'> with the tag names that you actually use in your code, since generally you only use a subset of them. Let's conduct a memory usage test to prove that preloading, reduces memory requirements. In order to have an easy measurement we will use only one child process, therefore we will use this setting: MinSpareServers 1 MaxSpareServers 1 StartServers 1 MaxClients 1 MaxRequestsPerChild 100 We are going to use the C script I which consists of two parts: the first one preloads a bunch of modules (that most of them aren't going to be used), the second part reports the memory size and the shared memory size used by the single child process that we start. and of course it prints the difference between the two sizes. memuse.pl --------- use strict; use CGI (); use DB_File (); use LWP::UserAgent (); use Storable (); use DBI (); use GTop (); my $r = shift; $r->send_http_header('text/plain'); my $proc_mem = GTop->new->proc_mem($$); my $size = $proc_mem->size; my $share = $proc_mem->share; my $diff = $size - $share; printf "%10s %10s %10s\n", qw(Size Shared Difference); printf "%10d %10d %10d (bytes)\n",$size,$share,$diff; First we restart the server and execute this CGI script when none of the above modules preloaded. Here is the result: Size Shared Diff 4706304 2134016 2572288 (bytes) Now we take all the modules: use strict; use CGI (); use DB_File (); use LWP::UserAgent (); use Storable (); use DBI (); use GTop (); and copy them into the startup script, so they will get preloaded. The script remains unchanged. We restart the server and execute it again. We get the following. Size Shared Diff 4710400 3997696 712704 (bytes) Let's put the two results into one table: Preloading Size Shared Diff Yes 4710400 3997696 712704 (bytes) No 4706304 2134016 2572288 (bytes) -------------------------------------------- Difference 4096 1863680 -1859584 You can clearly see that when the modules weren't preloaded the shared memory pages size, were about 1864Kb smaller relative to the case where the modules were preloaded. Assuming that you have had 256M dedicated to the web server, if you didn't preload the modules, you could have: 268435456 = X * 2572288 + 2134016 X = (268435456 - 2134016) / 2572288 = 103 103 servers. Now let's calculate the same thing with modules preloaded: 268435456 = X * 712704 + 3997696 X = (268435456 - 3997696) / 712704 = 371 You can have almost 4 times more servers!!! Remember that we have mentioned before that memory pages gets dirty and the size of the shared memory gets smaller with time? So we have presented the ideal case where the shared memory stays intact. Therefore the real numbers will be a little bit different, but not far from the numbers in our example. Also it's obvious that in your case it's possible that the process size will be bigger and the shared memory will be smaller, since you will use different modules and a different code, so you won't get this fantastic ratio, but this example is certainly helps to feel the difference. =head3 Preloading Registry Scripts at Server Startup What happens if you find yourself stuck with Perl CGI scripts and you cannot or don't want to move most of the stuff into modules to benefit from modules preloading, so the code will be shared by the children. Luckily you can preload scripts as well. This time the C modules comes to aid. C compiles C scripts at server startup. For example to preload the script I which is in fact the file I you would do the following: use Apache::RegistryLoader (); Apache::RegistryLoader->new->handler("/perl/test.pl", "/home/httpd/perl/test.pl"); You should put this code either into CPerlE> sections or into a startup script. But what if you have a bunch of scripts located under the same directory and you don't want to list them one by one. Take the benefit of Perl modules and put them to a good use. The C module will do most of the work for you. The following code walks the directory tree under which all C scripts are located. For each encountered file with extension I<.pl>, it calls the C method to preload the script in the parent server, before pre-forking the child processes: use File::Find qw(finddepth); use Apache::RegistryLoader (); { my $scripts_root_dir = "/home/httpd/perl/"; my $rl = Apache::RegistryLoader->new; finddepth ( sub { return unless /\.pl$/; my $url = "$File::Find::dir/$_"; $url =~ s|$scripts_root_dir/?|/|; warn "pre-loading $url\n"; # preload $url my $status = $rl->handler($url); unless($status == 200) { warn "pre-load of `$url' failed, status=$status\n"; } }, $scripts_root_dir); } Note that we didn't use the second argument to C here, as in the first example. To make the loader smarter about the URI to filename translation, you might need to provide a C function to translate the URI to filename. URI to filename translation normally doesn't happen until HTTP request time, so the module is forced to roll its own translation. If filename is omitted and a C function was not defined, the loader will try using the URI relative to B. A simple trans() function can be something like that: sub mytrans { my $uri = shift; $uri =~ s|^/perl/|/home/httpd/perl/|; return $uri; } You can easily derive the right translation by looking at the C directive. The above mytrans() function is matching our C: Alias /perl/ /home/httpd/perl/ After defining the URI to filename translation function you should pass it during the creation of the C object: my $rl = Apache::RegistryLoader->new(trans => \&mytrans); I won't show any benchmarks here, since the effect is absolutely the same as with preloading modules. See also L =head3 Modules Initializing at Server Startup We have just learned that it's important to preload the modules and scripts at the server startup. It turns out that it's not enough for some modules and you have to prerun their initialization code to get more memory pages shared. Basically you will find an information about specific modules in their respective manpages. We will present a few examples of widely used modules where the code can be initialized. =head4 Initializing DBI.pm The first example is the C module. As you know C works with many database drivers falling into the C category, e.g. C. It's not enough to preload C, you should initialize C with driver(s) that you are going to use (usually a single driver is used), if you want to minimize memory use after forking the child processes. Note that you want to do this under mod_perl and other environments where the shared memory is very important. Otherwise you shouldn't initialize drivers. You probably know already that under mod_perl you should use the C module to get the connection persistence, unless you open a separate connection for each user--in this case you should not use this module. C automatically loads C and overrides some of its methods, so you should continue coding like there is only a C module. Just as with modules preloading our goal is to find the startup environment that will lead to the smallest I<"difference"> between the shared and normal memory reported, therefore a smaller total memory usage. And again in order to have an easy measurement we will use only one child process, therefore we will use this setting in I: MinSpareServers 1 MaxSpareServers 1 StartServers 1 MaxClients 1 MaxRequestsPerChild 100 We always preload these modules: use Gtop(); use Apache::DBI(); # preloads DBI as well We are going to run memory benchmarks on five different versions of the I file. =over =item option 1 Leave the file unmodified. =item option 2 Install MySQL driver (we will use MySQL RDBMS for our test): DBI->install_driver("mysql"); It's safe to use this method, since just like with C, if it can't be installed it'll die(). =item option 3 Preload MySQL driver module: use DBD::mysql; =item option 4 Tell C to connect to the database when the child process starts (C), no driver is preload before the child gets spawned! Apache::DBI->connect_on_init('DBI:mysql:test::localhost', "", "", { PrintError => 1, # warn() on errors RaiseError => 0, # don't die on error AutoCommit => 1, # commit executes # immediately } ) or die "Cannot connect to database: $DBI::errstr"; =item option 5 Options 2 and 4: using connect_on_init() and install_driver(). =back Here is the C test script that we have used: preload_dbi.pl -------------- use strict; use GTop (); use DBI (); my $dbh = DBI->connect("DBI:mysql:test::localhost", "", "", { PrintError => 1, # warn() on errors RaiseError => 0, # don't die on error AutoCommit => 1, # commit executes # immediately } ) or die "Cannot connect to database: $DBI::errstr"; my $r = shift; $r->send_http_header('text/plain'); my $do_sql = "show tables"; my $sth = $dbh->prepare($do_sql); $sth->execute(); my @data = (); while (my @row = $sth->fetchrow_array){ push @data, @row; } print "Data: @data\n"; $dbh->disconnect(); # NOP under Apache::DBI my $proc_mem = GTop->new->proc_mem($$); my $size = $proc_mem->size; my $share = $proc_mem->share; my $diff = $size - $share; printf "%8s %8s %8s\n", qw(Size Shared Diff); printf "%8d %8d %8d (bytes)\n",$size,$share,$diff; The script opens a opens a connection to the database I<'test'> and issues a query to learn what tables the databases has. When the data is collected and printed the connection would be closed in the regular case, but C overrides it with empty method. When the data is processed a familiar to you already code to print the memory usage follows. The server was restarted before each new test. So here are the results of the five tests that were conducted, sorted by the I column: =over =item 1 After the first request: Test type Size Shared Diff -------------------------------------------------------------- install_driver (2) 3465216 2621440 843776 install_driver & connect_on_init (5) 3461120 2609152 851968 preload driver (3) 3465216 2605056 860160 nothing added (1) 3461120 2494464 966656 connect_on_init (4) 3461120 2482176 978944 =item 2 After the second request (all the subsequent request showed the same results): Test type Size Shared Diff -------------------------------------------------------------- install_driver (2) 3469312 2609152 860160 install_driver & connect_on_init (5) 3481600 2605056 876544 preload driver (3) 3469312 2588672 880640 nothing added (1) 3477504 2482176 995328 connect_on_init (4) 3481600 2469888 1011712 =back Now what do we conclude from looking at these numbers. First we see that only after a second reload we get the final memory footprint for a specific request in question (if you pass different arguments the memory usage might and will be different). But both tables show the same pattern of memory usage. We can clearly see that the real winner is the I file's version where the MySQL driver was installed (2). Since we want to have a connection ready for the first request made to the freshly spawned child process, we generally use the version (5) which uses somewhat more memory, but has almost the same number of shared memory pages. The version (3) only preloads the driver which results in smaller shared memory. The last two versions having nothing initialized (1) and having only the connect_on_init() method used (4). The former is a little bit better than the latter, but both significantly worse than the first two versions. To remind you why do we look for the smallest value in the column I, recall the real memory usage formula: RAM_dedicated_to_mod_perl = diff * number_of_processes + the_processes_with_largest_shared_memory Notice that the smaller the diff is, the bigger the number of processes you can have using the same amount of RAM. Therefore every 100K difference counts, when you multiply it by the number of processes. If we take the number from the version (2) vs. (4) and assume that we have 256M of memory dedicated to mod_perl processes we will get the following numbers using the formula derived from the above formula: RAM - largest_shared_size N_of Procs = ------------------------- Diff 268435456 - 2609152 (ver 2) N = ------------------- = 309 860160 268435456 - 2469888 (ver 4) N = ------------------- = 262 1011712 So you can tell the difference (17% more child processes in the first version). =head4 Initializing CGI.pm C is a big module that by default postpones the compilation of its methods until they are actually needed, thus making it possible to use it under a slow mod_cgi handler without adding a big overhead. That's not what we want under mod_perl and if you use C you should precompile the methods that you are going to use at the server startup in addition to preloading the module. Use the compile method for that: use CGI; CGI->compile(':all'); where you should replace the tag group C<:all> with the real tags and group tags that you are going to use if you want to optimize the memory usage. We are going to compare the shared memory foot print by using the script which is back compatible with mod_cgi. You will see that you can improve performance of this kind of scripts as well, but if you really want a fast code think about porting it to use C for CGI interface and some other module for HTML generation. So here is the C script that we are going to use to make the comparison: preload_cgi_pm.pl ----------------- use strict; use CGI (); use GTop (); my $q = new CGI; print $q->header('text/plain'); print join "\n", map {"$_ => ".$q->param($_) } $q->param; print "\n"; my $proc_mem = GTop->new->proc_mem($$); my $size = $proc_mem->size; my $share = $proc_mem->share; my $diff = $size - $share; printf "%8s %8s %8s\n", qw(Size Shared Diff); printf "%8d %8d %8d (bytes)\n",$size,$share,$diff; The script initializes the C object, sends HTTP header and then print all the arguments and values that were passed to the script if at all. At the end as usual we print the memory usage. As usual we are going to use a single child process, therefore we will use this setting in I: MinSpareServers 1 MaxSpareServers 1 StartServers 1 MaxClients 1 MaxRequestsPerChild 100 We are going to run memory benchmarks on three different versions of the I file. We always preload this module: use Gtop(); =over =item option 1 Leave the file unmodified. =item option 2 Preload C: use CGI (); =item option 3 Preload C and pre-compile the methods that we are going to use in the script: use CGI (); CGI->compile(qw(header param)); =back The server was restarted before each new test. So here are the results of the five tests that were conducted, sorted by the I column: =over =item 1 After the first request: Version Size Shared Diff Test type -------------------------------------------------------------------- 1 3321856 2146304 1175552 not preloaded 2 3321856 2326528 995328 preloaded 3 3244032 2465792 778240 preloaded & methods+compiled =item 2 After the second request (all the subsequent request showed the same results): Version Size Shared Diff Test type -------------------------------------------------------------------- 1 3325952 2134016 1191936 not preloaded 2 3325952 2314240 1011712 preloaded 3 3248128 2445312 802816 preloaded & methods+compiled =back The first version shows the results of the script execution when C wasn't preloaded. The second version with module preloaded. The third when it's both preloaded and the methods that are going to be used are precompiled at the server startup. By looking at the version one of the second table we can conclude that, preloading adds about 20K of shared size. As we have mention at the beginning of this section that's how C was implemented--to reduce the load overhead. Which means that preloading CGI is almost hardly change a thing. But if we compare the second and the third versions we will see a very significant difference of 207K (1011712-802816), and we have used only a few methods (the I
method loads a few more method transparently for a user). Imagine how much memory we are going to save if we are going to precompile all the methods that we are using in other scripts that use C and do a little bit more than the script that we have used in the test. But even in our very simple case using the same formula, what do we see? (assuming that we have 256MB dedicated for mod_perl) RAM - largest_shared_size N_of Procs = ------------------------- Diff 268435456 - 2134016 (ver 1) N = ------------------- = 223 1191936 268435456 - 2445312 (ver 3) N = ------------------- = 331 802816 If we preload C and precompile a few methods that we use in the test script, we can have 50% more child processes than when we don't preload and precompile the methods that we are going to use. META: I've heard that the 3.x generation will be less bloated, so probably I'll have to rerun this using the new version. =head2 Increasing Shared Memory With mergemem C is an experimental utility for linux, which looks I interesting for us mod_perl users: http://www.complang.tuwien.ac.at/ulrich/mergemem/ It looks like it could be run periodically on your server to find and merge duplicate pages. It won't halt your httpds during the merge, this aspect has been taken into consideration already during the design of mergemem: Merging is not performed with one big systemcall. Instead most operation is in userspace, making a lot of small systemcalls. Therefore blocking of the system should not happen. And, if it really should turn out to take too much time you can reduce the priority of the process. The worst case that can happen is this: C merges two pages and immediately afterwards they will be split. The split costs about the same as the time consumed by merging. This software comes with a utility called C to tell you how much you might save. =head2 Forking and Executing Subprocesses from mod_perl It's desirable to avoid forking under mod_perl. Since when you do, you are forking the entire Apache server, lock, stock and barrel. Not only is your Perl code and Perl interpreter being duplicated, but so is mod_ssl, mod_rewrite, mod_log, mod_proxy, mod_speling (it's not a typo!) or whatever modules you have used in your server, all the core routines, etc. Modern Operating Systems come with a very light version of fork which adds a little overhead when called, since it was optimized to do the absolute minimum of memory pages duplications. The I technique is the one that allows to do so. The gist of this technique is as follows: the parent process memory pages aren't immediately copied to the child's space on fork(), but this is done only when the child or the parent modifies the data in some memory pages. Before the pages get modified they get marked as dirty and the child has no choice but to copy the pages that are to be modified since they cannot be shared any more. If you need to call a Perl program from your mod_perl code, it's better to try to covert the program into a module and call it a function without spawning a special process to do that. Of course if you cannot do that or the program is not written in Perl, you have to call via system() or is equivalent, which spawn a new process. If the program written in C, you may try to write a Perl glue code with help of XS or SWIG architectures, and then the program will be executed as a perl subroutine. Also by trying to spawn a sub-process, you might be trying to do the I<"wrong thing">. If what you really want is to send information to the browser and then do some post-processing, look into the C directive. The latter allows you to tell the child process after request has been processed and user has received the response. This doesn't release the mod_perl process to serve other requests, but it allows to send the response to the client faster. If this is the situation and you need to run some cleanup code, you may want to register this code during the request processing via: my $r = shift; $r->register_cleanup(\&do_cleanup); sub do_cleanup{ #some clean-up code here } But when a long term process needs to be spawned, there is not much choice, but to use fork(). We cannot just run this long term process within Apache process, since it'll first keep the Apache process busy, instead of letting it do the job it was designed for. And second, if Apache will be stopped the long term process might be terminated as well, unless coded properly to detach from Apache processes group. In the following sections we are going to discuss how to properly spawn new processes under mod_perl. =head3 Forking a New Process This is a typical way to call fork() under mod_perl: defined (my $kid = fork) or die "Cannot fork: $!\n"; if ($kid) { # Parent runs this block } else { # Child runs this block # some code comes here CORE::exit(0); } # possibly more code here usually run by the parent When using fork(), you should check its return value, since if it returns C it means that the call was unsuccessful and no process was spawned. Something that can happen when the system is running too many processes and cannot spawn new ones. When the process is successfully forked--the parent receives the PID of the newly spawned child as a returned value of the fork() call and the child receives 0. Now the program splits into two. In the above example the code inside the first block after I will be executed by the parent and the code inside the first block after I will be executed by the child process. It's important not to forget to explicitly call exit() at the end of the child code when forking. Since if you don't and there is some code outside the I, the child process will execute it as well. But under mod_perl there is another nuance--you must use C and not C, which would be automatically overridden by C if used in conjunction with C and similar modules. And we want the spawned process to quit when its work is done, otherwise it'll just stay alive use resources and do nothing. The parent process usually completes its execution path and enters the pool of free servers to wait for a new assignment. If the execution path is to be aborted earlier for some reason one should use Apache::exit() or die(), in the case of C or C handlers a simple exit() will do the right thing. The child shares with parent its memory pages until it has to modify some of them, which triggers a I process which copies these pages to the child's domain before the child is allowed to modify them. But this all happens afterwards. At the moment the fork() call executed, the only work to be done before the child process goes on its separate way is setting up the page tables for the virtual memory, which imposes almost no delay at all. =head3 Freeing the Parent Process In the child code you must also close all the pipes to the connection socket that were opened by the parent process (i.e. C and C) and inherited by the child, so the parent will be able to complete the request and free itself for serving other requests. If you need the C and/or C streams you should re-open them. You may need to close or re-open the C filehandle. It's opened to append to the I file as inherited from its parent, so chances are that you will want to leave it untouched. Under mod_perl, the spawned process also inherits the file descriptor that's tied to the socket through which all the communications between the server and the client happen. Therefore we need to free this stream in the forked process. If we don't do that, the server cannot be restarted while the spawned process is still running. If an attempt is made to restart the server you will get the following error: [Mon Dec 11 19:04:13 2000] [crit] (98)Address already in use: make_sock: could not bind to address 127.0.0.1 port 8000 C comes to help and provides a method cleanup_for_exec() which takes care of closing this file descriptor. So the simplest way is to freeing the parent process is to close all three STD* streams if we don't need them and untie the Apache socket. In addition you may want to change process' current directory to I so the forked process won't keep the mounted partition busy, if this is to be unmounted at a later time. To summarize all this issues, here is an example of the fork that takes care of freeing the parent process. use Apache::SubProcess; defined (my $kid = fork) or die "Cannot fork: $!\n"; if ($kid) { # Parent runs this block } else { # Child runs this block $r->cleanup_for_exec(); # untie the socket chdir '/' or die "Can't chdir to /: $!"; close STDIN; close STDOUT; close STDERR; # some code comes here CORE::exit(0); } # possibly more code here usually run by the parent Of course between the freeing the parent code and child process termination the real code is to be placed. =head3 Detaching the Forked Process Now what happens if the forked process is running and we decided that we need to restart the web-server? This forked process will be aborted, since when parent process will die during the restart it'll kill its child processes as well. In order to avoid this we need to detach the process from its parent session, by opening a new session with help of setsid() system call, provided by the C module: use POSIX 'setsid'; defined (my $kid = fork) or die "Cannot fork: $!\n"; if ($kid) { # Parent runs this block } else { # Child runs this block setsid or die "Can't start a new session: $!"; ... } Now the spawned child process has a life of its own, and it doesn't depend on the parent anymore. =head3 Avoiding Zombie Processes Now let's talk about zombie processes. Normally, every process has its parent. Many processes are children of the C process, whose C is C<1>. When you fork a process you must wait() or waitpid() for it to finish. If you don't wait() for it, it becomes a zombie. A zombie is a process that doesn't have a parent. When the child quits, it reports the termination to its parent. If no parent wait()s to collect the exit status of the child, it gets I<"confused"> and becomes a ghost process, that can be seen as a process, but not killed. It will be killed only when you stop the parent process that spawned it! Generally the ps(1) utility displays these processes with the CdefuncE> tag, and you will see the zombies counter increment when doing top(). These zombie processes can take up system resources and are generally undesirable. So the proper way to do a fork is: my $r = shift; $r->send_http_header('text/plain'); defined (my $kid = fork) or die "Cannot fork: $!"; if ($kid) { waitpid($kid,0); print "Parent has finished\n"; } else { # do something CORE::exit(0); } In most cases the only reason you would want to fork is when you need to spawn a process that will take a long time to complete. So if the Apache process that spawns this new child process has to wait for it to finish, you have gained nothing. You can neither wait for its completion (because you don't have the time to), nor continue because you will get yet another zombie process. This is called a blocking call, since the process is blocked to do anything else before this call gets completed. The simplest solution is to ignore your dead children. Just add this line before the fork() call: $SIG{CHLD} = 'IGNORE'; When you set the C (C in C) signal handler to C<'IGNORE'>, all the processes will be collected by the C process and are therefore prevented from becoming zombies. This doesn't work everywhere, however. It proved to work at least on Linux OS. Note that you cannot localize this setting with C. If you do, it won't have the desired effect. [META: Can anyone explain why localization doesn't work?] So now the code would look like this: my $r = shift; $r->send_http_header('text/plain'); $SIG{CHLD} = 'IGNORE'; defined (my $kid = fork) or die "Cannot fork: $!\n"; if ($kid) { print "Parent has finished\n"; } else { # do something time-consuming CORE::exit(0); } Note that waitpid() call has gone. The S<$SIG{CHLD} = 'IGNORE';> statement protects us from zombies, as explained above. Another, more portable, but slightly more expensive solution is to use a double fork approach. my $r = shift; $r->send_http_header('text/plain'); defined (my $kid = fork) or die "Cannot fork: $!\n"; if ($kid) { waitpid($kid,0); } else { defined (my $grandkid = fork) or die "Kid cannot fork: $!\n"; if ($grandkid) { CORE::exit(0); } else { # code here # do something long lasting CORE::exit(0); } } Grandkid becomes a I<"child of init">, i.e. the child of the process whose PID is 1. Note that the previous two solutions do allow you to know the exit status of the process, but in our example we didn't care about it. Another solution is to use a different I handler: use POSIX 'WNOHANG'; $SIG{CHLD} = sub { while( waitpid(-1,WNOHANG)>0 ) {} }; Which is useful when you fork() more than one process. The handler could call wait() as well, but for a variety of reasons involving the handling of stopped processes and the rare event in which two children exit at nearly the same moment, the best technique is to call waitpid() in a tight loop with a first argument of C<-1> and a second argument of C. Together these arguments tell waitpid() to reap the next child that's available, and prevent the call from blocking if there happens to be no child ready for reaping. The handler will loop until waitpid() returns a negative number or zero, indicating that no more reapable children remain. While you test and debug your code that uses one of the above examples, You might want to write some debug information to the error_log file so you know what happens. Read I manpage for more information about signal handlers. =head3 A Complete Fork Example Now let's put all the bits of code together and show a well written fork code that solves all the problems discussed so far. We will use an C script for this purpose: proper_fork1.pl --------------- use strict; use POSIX 'setsid'; use Apache::SubProcess; my $r = shift; $r->send_http_header("text/plain"); $SIG{CHLD} = 'IGNORE'; defined (my $kid = fork) or die "Cannot fork: $!\n"; if ($kid) { print "Parent $$ has finished, kid's PID: $kid\n"; } else { $r->cleanup_for_exec(); # untie the socket chdir '/' or die "Can't chdir to /: $!"; open STDIN, '/dev/null' or die "Can't read /dev/null: $!"; open STDOUT, '>/dev/null' or die "Can't write to /dev/null: $!"; open STDERR, '>/tmp/log' or die "Can't write to /tmp/log: $!"; setsid or die "Can't start a new session: $!"; my $oldfh = select STDERR; local $| = 1; select $oldfh; warn "started\n"; # do something time-consuming sleep 1, warn "$_\n" for 1..20; warn "completed\n"; CORE::exit(0); # terminate the process } The script starts with the usual declaration of the strict mode, loading the C and C modules and importing of the setsid() symbol from the C package. The HTTP header is sent next, with the I of I. The parent process gets ready to ignore the child, to avoid zombies and the fork is called. The program gets its personality split after fork and the if conditional evaluates to a true value for the parent process, and to a false value for the child process, therefore the first block is executed by the parent and the second by the child. The parent process announces his PID and the PID of the spawned process and finishes its block. If there will be any code outside it will be executed by the parent as well. The child process starts its code by disconnecting from the socket, changing its current directory to C, opening the STDIN and STDOUT streams to I, which in effect closes them both before opening. In fact in this example we don't need neither of these, so we could just close() both. The child process completes its disengagement from the parent process by opening the STDERR stream to I, so it could write there, and creating a new session with help of setsid(). Now the child process has nothing to do with the parent process and can do the actual processing that it has to do. In our example it performs a simple series of warnings, which are logged into I: my $oldfh = select STDERR; local $| = 1; select $oldfh; warn "started\n"; # do something time-consuming sleep 1, warn "$_\n" for 1..20; warn "completed\n"; The localized setting of C<$|=1> unbuffers the STDERR stream, so we can immediately see the debug output generated by the program. In fact this setting is not required when the output is generated by warn(). Finally the child process terminates by calling: CORE::exit(0); which make sure that it won't get out of the block and run some code that it's not supposed to run. This code example will allow you to verify that indeed the spawned child process has its own life, and its parent is free as well. Simply issue a request that will run this script, watch that the warnings are started to be written into the I file and issue a complete server stop and start. If everything is correct, the server will successfully restart and the long term process will still be running. You will know that it's still running, if the warnings will still be printed into the I file. You may need to raise the number of warnings to do above 20, to make sure that you don't miss the end of the run. If there are only 5 warnings to be printed, you should see the following output in this file: started 1 2 3 4 5 completed =head3 Starting a Long Running External Program But what happens if we cannot just run a Perl code from the spawned process and we have a compiled utility, i.e. a program written in C. Or we have a Perl program which cannot be easily converted into a module, and thus called as a function. Of course in this case we have to use system(), exec(), qx() or C<``> (back ticks) to start it. When using any of these methods and when the I mode is enabled, we must at least add the following code to untaint the I environment variable and delete a few other insecure environment variables. This information can be found in the I manpage. $ENV{'PATH'} = '/bin:/usr/bin'; delete @ENV{'IFS', 'CDPATH', 'ENV', 'BASH_ENV'}; Now all we have to do is to reuse the code from the previous section. First we move the core program into the I file, add the shebang first line so the program will be executed by Perl, tell the program to run under I mode (-T) and possibly enable the I mode (-w) and make it executable: external.pl ----------- #!/usr/bin/perl -Tw open STDIN, '/dev/null' or die "Can't read /dev/null: $!"; open STDOUT, '>/dev/null' or die "Can't write to /dev/null: $!"; open STDERR, '>/tmp/log' or die "Can't write to /tmp/log: $!"; my $oldfh = select STDERR; local $| = 1; select $oldfh; warn "started\n"; # do something time-consuming sleep 1, warn "$_\n" for 1..20; warn "completed\n"; Now we replace the code that moved into the external program with exec() to call it: proper_fork_exec.pl ------------------- use strict; use POSIX 'setsid'; use Apache::SubProcess; $ENV{'PATH'} = '/bin:/usr/bin'; delete @ENV{'IFS', 'CDPATH', 'ENV', 'BASH_ENV'}; my $r = shift; $r->send_http_header("text/html"); $SIG{CHLD} = 'IGNORE'; defined (my $kid = fork) or die "Cannot fork: $!\n"; if ($kid) { print "Parent has finished, kid's PID: $kid\n"; } else { $r->cleanup_for_exec(); # untie the socket chdir '/' or die "Can't chdir to /: $!"; open STDIN, '/dev/null' or die "Can't read /dev/null: $!"; open STDOUT, '>/dev/null' or die "Can't write to /dev/null: $!"; open STDERR, '>&STDOUT' or die "Can't dup stdout: $!"; setsid or die "Can't start a new session: $!"; exec "/home/httpd/perl/external.pl" or die "Cannot execute exec: $!"; } Notice that exec() never returns unless it fails to start the process. Therefore you shouldn't put any code after exec()--it will be not executed in the case of success. Use system() or back-ticks instead if you want to continue doing other things in the process. But then you probably will want to terminate the process after the program has finished. So you will have to write: system "/home/httpd/perl/external.pl" or die "Cannot execute system: $!"; CORE::exit(0); Another important nuance is that we have to close all STD* stream in the forked process, even if the called program does that. If the external program is written in Perl you may pass complicated data structures to it using one of the methods to serialize Perl data and then to restore it. The C and C modules come handy. Let's say that we have program I calling program I: master.pl --------- # we are within the mod_perl code use Storable (); my @params = (foo => 1, bar => 2); my $params = Storable::freeze(\@params); exec "./slave.pl", $params or die "Cannot execute exec: $!"; slave.pl -------- #!/usr/bin/perl -w use Storable (); my @params = @ARGV ? @{ Storable::thaw(shift)||[] } : (); # do something As you can see, I serializes the C<@params> data structure with C and passes it to I as a single argument. I restores the it with C, by shifting the first value of the C array if available. The C module does a very similar thing. =head3 Starting a Short Running External Program Sometimes you need to call an external program and you cannot continue before this program completes its run and optionally returns some result. In this case the fork solution doesn't help. But we have a few ways to execute this program. First using system(): system "perl -e 'print 5+5'" We believe that you will never call the perl interperter for doing this simple calculation, but for the sake of a simple example it's good enough. The problem with this approach is that we cannot get the results printed to C, and that's where back-ticks or qx() come to help. If you use either: my $result = `perl -e 'print 5+5'`; or: my $result = qx{perl -e 'print 5+5'}; the whole output of the external program will be stored in the C<$result> variable. Of course you can use other solutions, like opening a pipe (C<|> to the program) if you need to submit many arguments and more evolved solutions provided by other Perl modules like C which allows to open a process for both reading and writing. =head3 Executing system() or exec() in the Right Way The exec() and system() system calls behave identically in the way they spawn a program. For example let's use system() as an example. Consider the following code: system("echo","Hi"); Perl will use the first argument as a program to execute, find C along the search path, invoke it directly and pass the I string as an argument. Perl's system() is B the C call [C-library]. This is how the arguments to system() get interpreted. When there is a single argument to system(), it'll be checked for having shell metacharacters first (like C<*>,C), and if there are any--Perl interpreter invokes a real shell program (S on Unix platforms). If you pass a list of arguments to system(), they will be not checked for metacharacters, but split into words if required and passed directly to the C-level C system call, which is more efficient. That's a I nice optimization. In other words, only if you do: system "sh -c 'echo *'" will the operating system actually exec() a copy of C to parse your command. But even then since I is almost certainly already running somewhere, the system will notice that (via the disk inode reference) and replace your virtual memory page table with one pointing to the existing program code plus your data space, thus will not create this overhead. =head2 OS Specific Parameters for Proxying Most of the mod_perl enabled servers use a proxy front-end server. This is done in order to avoid serving static objects, and also so that generated output which might be received by slow clients does not cause the heavy but very fast mod_perl servers from idly waiting. There are very important OS parameters that you might want to change in order to improve the server performance. This topic is discussed in the section: L =head1 Performance Tuning by Tweaking Apache Configuration Correct configuration of the C, C, C, C, and C parameters is very important. There are no defaults. If they are too low, you will under-use the system's capabilities. If they are too high, the chances are that the server will bring the machine to its knees. All the above parameters should be specified on the basis of the resources you have. With a plain apache server, it's no big deal if you run many servers since the processes are about 1Mb and don't eat a lot of your RAM. Generally the numbers are even smaller with memory sharing. The situation is different with mod_perl. I have seen mod_perl processes of 20Mb and more. Now if you have C set to 50: 50x20Mb = 1Gb. Do you have 1Gb of RAM? Maybe not. So how do you tune the parameters? Generally by trying different combinations and benchmarking the server. Again mod_perl processes can be of much smaller size with memory sharing. Before you start this task you should be armed with the proper weapon. You need the B utility, which will load your server with the mod_perl scripts you possess. You need it to have the ability to emulate a multiuser environment and to emulate the behavior of multiple clients calling the mod_perl scripts on your server simultaneously. While there are commercial solutions, you can get away with free ones which do the same job. You can use the L B> utility which comes with the Apache distribution, the L which uses C, L or L. It is important to make sure that you run the load generator (the client which generates the test requests) on a system that is more powerful than the system being tested. After all we are trying to simulate Internet users, where many users are trying to reach your service at once. Since the number of concurrent users can be quite large, your testing machine must be very powerful and capable of generating a heavy load. Of course you should not run the clients and the server on the same machine. If you do, your test results would be invalid. Clients will eat CPU and memory that should be dedicated to the server, and vice versa. =head2 Configuration Tuning with ApacheBench We are going to use C (C) utility to tune our server's configuration. We will simulate 10 users concurrently requesting a very light script at C. Each simulated user makes 10 requests. % ./ab -n 100 -c 10 http://www.example.com/perl/access/access.cgi The results are: Document Path: /perl/access/access.cgi Document Length: 16 bytes Concurrency Level: 10 Time taken for tests: 1.683 seconds Complete requests: 100 Failed requests: 0 Total transferred: 16100 bytes HTML transferred: 1600 bytes Requests per second: 59.42 Transfer rate: 9.57 kb/s received Connnection Times (ms) min avg max Connect: 0 29 101 Processing: 77 124 1259 Total: 77 153 1360 The only numbers we really care about are: Complete requests: 100 Failed requests: 0 Requests per second: 59.42 Let's raise the request load to 100 x 10 (10 users, each makes 100 requests): % ./ab -n 1000 -c 10 http://www.example.com/perl/access/access.cgi Concurrency Level: 10 Complete requests: 1000 Failed requests: 0 Requests per second: 139.76 As expected, nothing changes -- we have the same 10 concurrent users. Now let's raise the number of concurrent users to 50: % ./ab -n 1000 -c 50 http://www.example.com/perl/access/access.cgi Complete requests: 1000 Failed requests: 0 Requests per second: 133.01 We see that the server is capable of serving 50 concurrent users at 133 requests per second! Let's find the upper limit. Using C<-n 10000 -c 1000> failed to get results (Broken Pipe?). Using C<-n 10000 -c 500> resulted in 94.82 requests per second. The server's performance went down with the high load. The above tests were performed with the following configuration: MinSpareServers 6 MaxSpareServers 8 StartServers 10 MaxClients 50 MaxRequestsPerChild 1500 Now let's kill each child after it serves a single request. We will use the following configuration: MinSpareServers 6 MaxSpareServers 8 StartServers 10 MaxClients 100 MaxRequestsPerChild 1 Simulate 50 users each generating a total of 20 requests: % ./ab -n 1000 -c 50 http://www.example.com/perl/access/access.cgi The benchmark timed out with the above configuration.... I watched the output of B> as I ran it, the parent process just wasn't capable of respawning the killed children at that rate. When I raised the C to 10, I got 8.34 requests per second. Very bad - 18 times slower! You can't benchmark the importance of the C, C and C with this kind of test. Now let's reset C to 1500, but reduce C to 10 and run the same test: MinSpareServers 6 MaxSpareServers 8 StartServers 10 MaxClients 10 MaxRequestsPerChild 1500 I got 27.12 requests per second, which is better but still 4-5 times slower. (I got 133 with C set to 50.) B I have tested a few combinations of the server configuration variables (C, C, C, C and C). The results I got are as follows: C, C and C are only important for user response times. Sometimes users will have to wait a bit. The important parameters are C and C. C should be not too big, so it will not abuse your machine's memory resources, and not too small, for if it is your users will be forced to wait for the children to become free to serve them. C should be as large as possible, to get the full benefit of mod_perl, but watch your server at the beginning to make sure your scripts are not leaking memory, thereby causing your server (and your service) to die very fast. Also it is important to understand that we didn't test the response times in the tests above, but the ability of the server to respond under a heavy load of requests. If the test script was heavier, the numbers would be different but the conclusions very similar. The benchmarks were run with: HW: RS6000, 1Gb RAM SW: AIX 4.1.5 . mod_perl 1.16, apache 1.3.3 Machine running only mysql, httpd docs and mod_perl servers. Machine was _completely_ unloaded during the benchmarking. After each server restart when I changed the server's configuration, I made sure that the scripts were preloaded by fetching a script at least once for every child. It is important to notice that none of the requests timed out, even if it was kept in the server's queue for more than a minute! That is the way B works, which is OK for testing purposes but will be unacceptable in the real world - users will not wait for more than five to ten seconds for a request to complete, and the client (i.e. the browser) will time out in a few minutes. Now let's take a look at some real code whose execution time is more than a few milliseconds. We will do some real testing and collect the data into tables for easier viewing. I will use the following abbreviations: NR = Total Number of Request NC = Concurrency MC = MaxClients MRPC = MaxRequestsPerChild RPS = Requests per second Running a mod_perl script with lots of mysql queries (the script under test is mysqld limited) (http://www.example.com/perl/access/access.cgi?do_sub=query_form), with the configuration: MinSpareServers 8 MaxSpareServers 16 StartServers 10 MaxClients 50 MaxRequestsPerChild 5000 gives us: NR NC RPS comment ------------------------------------------------ 10 10 3.33 # not a reliable figure 100 10 3.94 1000 10 4.62 1000 50 4.09 B Here I wanted to show that when the application is slow (not due to perl loading, code compilation and execution, but limited by some external operation) it almost does not matter what load we place on the server. The RPS (Requests per second) is almost the same. Given that all the requests have been served, you have the ability to queue the clients, but be aware that anything that goes into the queue means a waiting client and a client (browser) that might time out! Now we will benchmark the same script without using the mysql (code limited by perl only): (http://www.example.com/perl/access/access.cgi), it's the same script but it just returns the HTML form, without making SQL queries. MinSpareServers 8 MaxSpareServers 16 StartServers 10 MaxClients 50 MaxRequestsPerChild 5000 NR NC RPS comment ------------------------------------------------ 10 10 26.95 # not a reliable figure 100 10 30.88 1000 10 29.31 1000 50 28.01 1000 100 29.74 10000 200 24.92 100000 400 24.95 B This time the script we executed was pure perl (not limited by I/O or mysql), so we see that the server serves the requests much faster. You can see the number of requests per second is almost the same for any load, but goes lower when the number of concurrent clients goes beyond C. With 25 RPS, the machine simulating a load of 400 concurrent clients will be served in 16 seconds. To be more realistic, assuming a maximum of 100 concurrent clients and 30 requests per second, the client will be served in 3.5 seconds. Pretty good for a highly loaded server. Now we will use the server to its full capacity, by keeping all C clients alive all the time and having a big C, so that no child will be killed during the benchmarking. MinSpareServers 50 MaxSpareServers 50 StartServers 50 MaxClients 50 MaxRequestsPerChild 5000 NR NC RPS comment ------------------------------------------------ 100 10 32.05 1000 10 33.14 1000 50 33.17 1000 100 31.72 10000 200 31.60 Conclusion: In this scenario there is no overhead involving the parent server loading new children, all the servers are available, and the only bottleneck is contention for the CPU. Now we will change C and watch the results: Let's reduce C to 10. MinSpareServers 8 MaxSpareServers 10 StartServers 10 MaxClients 10 MaxRequestsPerChild 5000 NR NC RPS comment ------------------------------------------------ 10 10 23.87 # not a reliable figure 100 10 32.64 1000 10 32.82 1000 50 30.43 1000 100 25.68 1000 500 26.95 2000 500 32.53 B Very little difference! Ten servers were able to serve almost with the same throughput as 50 servers. Why? My guess is because of CPU throttling. It seems that 10 servers were serving requests 5 times faster than when we worked with 50 servers. In that case, each child received its CPU time slice five times less frequently. So having a big value for C, doesn't mean that the performance will be better. You have just seen the numbers! Now we will start drastically to reduce C: MinSpareServers 8 MaxSpareServers 16 StartServers 10 MaxClients 50 NR NC MRPC RPS comment ------------------------------------------------ 100 10 10 5.77 100 10 5 3.32 1000 50 20 8.92 1000 50 10 5.47 1000 50 5 2.83 1000 100 10 6.51 B When we drastically reduce C, the performance starts to become closer to plain mod_cgi. Here are the numbers of this run with mod_cgi, for comparison: MinSpareServers 8 MaxSpareServers 16 StartServers 10 MaxClients 50 NR NC RPS comment ------------------------------------------------ 100 10 1.12 1000 50 1.14 1000 100 1.13 B: mod_cgi is much slower. :) In the first test, when NR/NC was 100/10, mod_cgi was capable of 1.12 requests per second. In the same circumstances, mod_perl was capable of 32 requests per second, nearly 30 times faster! In the first test each client waited about 100 seconds to be served. In the second and third tests they waited 1000 seconds! =head2 Choosing MaxClients The C directive sets the limit on the number of simultaneous requests that can be supported. No more than this number of child server processes will be created. To configure more than 256 clients, you must edit the C entry in C and recompile. In our case we want this variable to be as small as possible, because in this way we can limit the resources used by the server children. Since we can restrict each child's process size (see L), the calculation of C is pretty straightforward: Total RAM Dedicated to the Webserver MaxClients = ------------------------------------ MAX child's process size So if I have 400Mb left for the webserver to run with, I can set C to be of 40 if I know that each child is limited to 10Mb of memory (e.g. with L|guide::performance/Preventing_Your_Processes_from_Growing>). You will be wondering what will happen to your server if there are more concurrent users than C at any time. This situation is signified by the following warning message in the C: [Sun Jan 24 12:05:32 1999] [error] server reached MaxClients setting, consider raising the MaxClients setting There is no problem -- any connection attempts over the C limit will normally be queued, up to a number based on the C directive. When a child process is freed at the end of a different request, the connection will be served. It B because clients are being put in the queue rather than getting served immediately, despite the fact that they do not get an error response. The error can be allowed to persist to balance available system resources and response time, but sooner or later you will need to get more RAM so you can start more child processes. The best approach is to try not to have this condition reached at all, and if you reach it often you should start to worry about it. It's important to understand how much real memory a child occupies. Your children can share memory between them when the OS supports that. You must take action to allow the sharing to happen - See L. If you do this, the chances are that your C can be even higher. But it seems that it's not so simple to calculate the absolute number. If you come up with a solution please let us know! If the shared memory was of the same size throughout the child's life, we could derive a much better formula: Total_RAM + Shared_RAM_per_Child * (MaxClients - 1) MaxClients = --------------------------------------------------- Max_Process_Size which is: Total_RAM - Shared_RAM_per_Child MaxClients = --------------------------------------- Max_Process_Size - Shared_RAM_per_Child Let's roll some calculations: Total_RAM = 500Mb Max_Process_Size = 10Mb Shared_RAM_per_Child = 4Mb 500 - 4 MaxClients = --------- = 82 10 - 4 With no sharing in place 500 MaxClients = --------- = 50 10 With sharing in place you can have 64% more servers without buying more RAM. If you improve sharing and keep the sharing level, let's say: Total_RAM = 500Mb Max_Process_Size = 10Mb Shared_RAM_per_Child = 8Mb 500 - 8 MaxClients = --------- = 246 10 - 8 392% more servers! Now you can feel the importance of having as much shared memory as possible. =head2 Choosing MaxRequestsPerChild The C directive sets the limit on the number of requests that an individual child server process will handle. After C requests, the child process will die. If C is 0, then the process will live forever. Setting C to a non-zero limit solves some memory leakage problems caused by sloppy programming practices, whereas a child process consumes more memory after each request. If left unbounded, then after a certain number of requests the children will use up all the available memory and leave the server to die from memory starvation. Note that sometimes standard system libraries leak memory too, especially on OSes with bad memory management (e.g. Solaris 2.5 on x86 arch). If this is your case you can set C to a small number. This will allow the system to reclaim the memory that a greedy child process consumed, when it exits after C requests. But beware -- if you set this number too low, you will lose some of the speed bonus you get from mod_perl. Consider using C if this is the case. Another approach is to use the L modules. By using either of these modules you should be able to discontinue using the C, although for some developers, using both in combination does the job. In addition these modules allow you to kill httpd processes whose shared memory size drops below a specified limit or unshared memory size crosses a specified threshold. See also L and L. =head2 Choosing MinSpareServers, MaxSpareServers and StartServers With mod_perl enabled, it might take as much as 20 seconds from the time you start the server until it is ready to serve incoming requests. This delay depends on the OS, the number of preloaded modules and the process load of the machine. It's best to set C and C to high numbers, so that if you get a high load just after the server has been restarted the fresh servers will be ready to serve requests immediately. With mod_perl, it's usually a good idea to raise all 3 variables higher than normal. In order to maximize the benefits of mod_perl, you don't want to kill servers when they are idle, rather you want them to stay up and available to handle new requests immediately. I think an ideal configuration is to set C and C to similar values, maybe even the same. Having the C close to C will completely use all of your resources (if C has been chosen to take the full advantage of the resources), but it'll make sure that at any given moment your system will be capable of responding to requests with the maximum speed (assuming that number of concurrent requests is not higher than C). Let's try some numbers. For a heavily loaded web site and a dedicated machine I would think of (note 400Mb is just for example): Available to webserver RAM: 400Mb Child's memory size bounded: 10Mb MaxClients: 400/10 = 40 (larger with mem sharing) StartServers: 20 MinSpareServers: 20 MaxSpareServers: 35 However if I want to use the server for many other tasks, but make it capable of handling a high load, I'd think of: Available to webserver RAM: 400Mb Child's memory size bounded: 10Mb MaxClients: 400/10 = 40 StartServers: 5 MinSpareServers: 5 MaxSpareServers: 10 These numbers are taken off the top of my head, and shouldn't be used as a rule, but rather as examples to show you some possible scenarios. Use this information with caution! =head2 Summary of Benchmarking to tune all 5 parameters OK, we've run various benchmarks -- let's summarize the conclusions: =over 4 =item * MaxRequestsPerChild If your scripts are clean and don't leak memory, set this variable to a number as large as possible (10000?). If you use C or C, you can set this parameter to 0 (treated as infinity). =item * StartServers If you keep a small number of servers active most of the time, keep this number low. Keep it low especially if C is also low, as if there is no load Apache will kill its children before they have been utilized at all. If your service is heavily loaded, make this number close to C, and keep C equal to C. =item * MinSpareServers If your server performs other work besides web serving, make this low so the memory of unused children will be freed when the load is light. If your server's load varies (you get loads in bursts) and you want fast response for all clients at any time, you will want to make it high, so that new children will be respawned in advance and are waiting to handle bursts of requests. =item * MaxSpareServers The logic is the same as for C - low if you need the machine for other tasks, high if it's a dedicated web host and you want a minimal delay between the request and the response. =item * MaxClients Not too low, so you don't get into a situation where clients are waiting for the server to start serving them (they might wait, but not for very long). However, do not set it too high. With a high MaxClients, if you get a high load the server will try to serve all requests immediately. Your CPU will have a hard time keeping up, and if the child size * number of running children is larger than the total available RAM your server will start swapping. This will slow down everything, which in turn will make things even slower, until eventually your machine will die. It's important that you take pains to ensure that swapping does not normally happen. Swap space is an emergency pool, not a resource to be used routinely. If you are low on memory and you badly need it, buy it. Memory is cheap. But based on the test I conducted above, even if you have plenty of memory like I have (1Gb), increasing C sometimes will give you no improvement in performance. The more clients are running, the more CPU time will be required, the less CPU time slices each process will receive. The response latency (the time to respond to a request) will grow, so you won't see the expected improvement. The best approach is to find the minimum requirement for your kind of service and the maximum capability of your machine. Then start at the minimum and test like I did, successively raising this parameter until you find the region on the curve of the graph of latency and/or throughput against MaxClients where the improvement starts to diminish. Stop there and use it. When you make the measurements on a production server you will have the ability to tune them more precisely, since you will see the real numbers. Don't forget that if you add more scripts, or even just modify the existing ones, the processes will grow in size as you compile in more code. Probably the parameters will need to be recalculated. =back =head2 KeepAlive If your mod_perl server's I includes the following directives: KeepAlive On MaxKeepAliveRequests 100 KeepAliveTimeout 15 you have a real performance penalty, since after completing the processing for each request, the process will wait for C seconds before closing the connection and will therefore not be serving other requests during this time. With this configuration you will need many more concurrent processes on a server with high traffic. If you use some server status reporting tools, you will see the process in I status when it's in C status. The chances are that you don't want this feature enabled. Set it Off with: KeepAlive Off the other two directives don't matter if C is C. You might want to consider enabling this option if the client's browser needs to request more than one object from your server for a single HTML page. If this is the situation the by setting C C then for each page you save the HTTP connection overhead for all requests but the first one. For example if you have a page with 10 ad banners, which is not uncommon today, you server will work more effectively if a single process serves them all during a single connection. However, your client will see a slightly slower response, since banners will be brought one at a time and not concurrently as is the case if each C tag opens a separate connection. Since keepalive connections will not incur the additional three-way TCP handshake they are kinder to the network. SSL connections benefit the most from C in case you didn't configure the server to cache session ids. You have probably followed the advice to send all the requests for static objects to a plain Apache server. Since most pages include more than one unique static image, you should keep the default C setting of the non-mod_perl server, i.e. keep it C. It will probably be a good idea also to reduce the timeout a little. One option would be for the proxy/accelerator to keep the connection open to the client but make individual connections to the server, read the response, buffer it for sending to the client and close the server connection. Obviously you would make new connections to the server as required by the client's requests. =head2 PerlSetupEnv Off C is another optimization you might consider. This directive requires mod_perl 1.25 or later. When this option is enabled, I fiddles with the environment to make it appear as if the code is called under the mod_cgi handler. For example, the C<$ENV{QUERY_STRING}> environment variable is initialized with the contents of I, and the value returned by I is put into C<$ENV{SERVER_NAME}>. But C<%ENV> population is expensive. Those who have moved to the Perl Apache API no longer need this extra C<%ENV> population, and can gain by turning it C. Scripts using the C module require C because that module relies on a properly populated CGI environment table. By default it is turned C. Note that you can still set enviroment variables when C is turned C. For example when you use the following configuration: PerlSetupEnv Off PerlModule Apache::RegistryNG PerlSetEnv TEST hi SetHandler perl-script PerlHandler Apache::RegistryNG Options +ExecCGI and you issue a request for this script: setupenvoff.pl -------------- use Data::Dumper; my $r = Apache->request(); $r->send_http_header('text/plain'); print Dumper(\%ENV); you should see something like this: $VAR1 = { 'GATEWAY_INTERFACE' => 'CGI-Perl/1.1', 'MOD_PERL' => 'mod_perl/1.25', 'PATH' => '/usr/lib/perl5/5.00503:... snipped ...', 'TEST' => 'hi' }; Note that we got the value of the I environment variable we set in I. =head2 Reducing the Number of stat() Calls Made by Apache If you watch the system calls that your server makes (using I or I while processing a request, you will notice that a few stat() calls are made. For example when I fetch http://localhost/perl-status and I have my DocRoot set to I I see: [snip] stat("/home/httpd/docs/perl-status", 0xbffff8cc) = -1 ENOENT (No such file or directory) stat("/home/httpd/docs", {st_mode=S_IFDIR|0755, st_size=1024, ...}) = 0 [snip] If you have some dynamic content and your virtual relative URI is something like I (i.e., there is no such directory on the web server, the path components are only used for requesting a specific report), this will generate five(!) stat() calls, before the C is found. You will see something like this: stat("/home/httpd/docs/news/perl/mod_perl/summary", 0xbffff744) = -1 ENOENT (No such file or directory) stat("/home/httpd/docs/news/perl/mod_perl", 0xbffff744) = -1 ENOENT (No such file or directory) stat("/home/httpd/docs/news/perl", 0xbffff744) = -1 ENOENT (No such file or directory) stat("/home/httpd/docs/news", 0xbffff744) = -1 ENOENT (No such file or directory) stat("/home/httpd/docs", {st_mode=S_IFDIR|0755, st_size=1024, ...}) = 0 How expensive those calls are? Let's use the C module to find out. stat_call_sample.pl ------------------- use Time::HiRes qw(gettimeofday tv_interval); my $calls = 1_000_000; my $start_time = [ gettimeofday ]; stat "/foo" for 1..$calls; my $end_time = [ gettimeofday ]; my $elapsed = tv_interval($start_time,$end_time) / $calls; print "The average execution time: $elapsed seconds\n"; This script takes a time sample at the beginnig, then does 1_000_000 C calls to a non-existing file, samples the time at the end and prints the average time it took to make a single C call. I'm sampling a 1M stats, so I'd get a correct average result. Before we actually run the script one should distinguish between two different situation. When the server is idle the time between the first and the last system call will be much shorter than the same time measured on the loaded system. That is because on the idle system, a process can use CPU very often, and on the loaded system lots of processes compete over it and each process has to wait for a longer time to get the same amount of CPU time. So first we run the above code on the unloaded system: % perl stat_call_sample.pl The average execution time: 4.209645e-06 seconds So it takes about 4 microseconds to execute a stat() call. Now let start a CPU intensive process in one console. The following code keeps CPU busy all the time. % perl -e '1**1 while 1' And now run the I script in the other console. % perl stat_call_sample.pl The average execution time: 8.777301e-06 seconds You can see that the average time has doubled (about 8 microseconds). And this is obvious, since there were two processes competing over CPU. Now if run 4 occurrences of the above code: % perl -e '1**1 while 1' & % perl -e '1**1 while 1' & % perl -e '1**1 while 1' & % perl -e '1**1 while 1' & And when running our script in parallel with these processes, we get: % perl stat_call_sample.pl 2.0853558e-05 seconds about 20 microseconds. So the average stat() system call is 5 times longer now. Now if you have 50 mod_perl processes that keep the CPU busy all the time, the stat() call will be 50 times slower and it'll take 0.2 milliseconds to complete a series of call. If you have five redundant calls as in the strace example above, they adds up to one millisecond. If you have more processes constantly consuming CPU, this time adds up. Now multiply this time by the number of processes that you have and you get a few seconds lost. As usual, for some services this loss is insignificant, while for others a very significant one. So why Apache does all these redundant C calls? You can blame the default installed C for this inefficiency. Of course you could supply your own, which will be smart enough not to look for this virtual path and immediately return C. But in cases where you have a virtual host that serves only dynamically generated documents, you can override the default C with this one: PerlModule Apache::Constants ... PerlTransHandler Apache::Constants::OK ... As you see it affects only this specific virtual host. This has the effect of short circuiting the normal C processing of trying to find a filesystem component that matches the given URI -- no more 'stat's! Watching your server under strace/truss can often reveal more performance hits than trying to optimize the code itself! For example unless configured correctly, Apache might look for the I<.htaccess> file in many places, if you don't have one and add many open() calls. Let's start with this simple configuration, and will try to reduce the number of irrelevant system calls. DocumentRoot "/home/httpd/docs" SetHandler perl-script PerlHandler Apache::Foo The above configuration allows us to make a request to I and the Perl handler() defined in C will be executed. Notice that in the test setup there is no file to be executed (like in C). There is no I<.htaccess> file as well. This is a typical generated trace. stat("/home/httpd/docs/foo/test", 0xbffff8fc) = -1 ENOENT (No such file or directory) stat("/home/httpd/docs/foo", 0xbffff8fc) = -1 ENOENT (No such file or directory) stat("/home/httpd/docs", {st_mode=S_IFDIR|0755, st_s