multidim is a set of command line tools to extract, manipulate, and present multidimensional data. These tools are targeted specifically at summarizing results of simulation experiments. multidim is written in Perl and may be downloaded off SourceForge.
A lot of scientific and engineering work involves generating experimental data. Computer architects in particular often run simulation experiments to test out their ideas. Multiple experiments are often needed to gauge the effect of various parameters on the simulation outcome.
For example, a new prefetching mechanism may be evaluated in multiple configurations, varying cache size from 1Mb to 4Mb and varying memory latency from 200 to 400 cycles. Both the original architecture and the one with the new mechanism need to be simulated using multiple benchmarks. As the experiments complete, they typically produce statistics files which summarize simulation outcomes. In this case, the number of cache misses and the execution time of the benchmark could be of interest.
Each of these experiments represents a point in a six-dimensional space, with the following dimensions:
The goal of multidim is to help the architect visualize this data. To this end, multidim contains several tools that can be cascaded using pipes. These tools are extract, normalize, and tabulate. More tools, such addmean, are hopefully yet to come.
Extracting the data is the first step. Simulation results are often stored in a directory hierarchy and the names of the directories and files include the parameter values of the corresponding experiments. The simulation results from the example above could be stored as follows:
exp_baseline_1MB_200/gcc/stats exp_baseline_1MB_200/gzip/stats exp_baseline_1MB_200/raytrace/stats exp_baseline_1MB_300/gcc/stats ... exp_baseline_4MB_400/raytrace/stats exp_newpref_1MB_200/gcc/stats ... exp_newpref_4MB_400/raytrace/stats
The example/data directory of multidim distribution contains this sample data.
The extract command allows the architect to collect the multidimensional data from these disjoint files:
./extract -dim 'cache_size=1MB 2MB 4MB' -dim 'mem_latency=200 300 400' -dim 'architecture=baseline newpref' -dim 'benchmark < example/benchmarks.list' -dim 'stat=cache_misses execution_time' -dim value -source 'example/data/exp_${architecture}_${cache_size}_${mem_latency}/${benchmark}/stats :${stat}:\s*${value}'
The output is many lines of data, with the first line listing the dimensions and the rest representing a single point in the six-dimensional space:
cache_size mem_latency architecture benchmark stat value 1MB 200 baseline gcc cache_misses 1118033 2MB 200 baseline gcc cache_misses 745355 4MB 200 baseline gcc cache_misses 512989 1MB 300 baseline gcc cache_misses 1195228 2MB 300 baseline gcc cache_misses 766964 ...
This is a simple format for multidimensional data; its main advantage is that gnuplot understands it. We don't, however, need graphics to summarize the data; a table could do the job. This is what the tabulate command is for.
./tabulate value -y architecture cache_size mem_latency -x benchmark stat
The output is a table summarizing the results of the experiments:
benchmark : gcc gzip raytrace stat : cache_misses execution_time cache_misses execution_time cache_misses execution_time architecture cache_size mem_latency baseline 1MB 200 : 1118033 415737733 33541019 4707429984 44721359 6091880124 baseline 1MB 300 : 1195228 505712266 35856858 7406665717 47809144 9690861058 baseline 1MB 400 : 1290994 610931733 38729833 10563249584 51639777 13899639458 baseline 2MB 200 : 745355 366047333 22360679 3216717984 29814239 4104264124 baseline 2MB 300 : 766964 420059466 23008949 4837083917 30678599 6264752058 baseline 2MB 400 : 790569 477485066 23717082 6559849317 31622776 8561772524 baseline 4MB 200 : 512989 335065200 15389675 2287250784 20519567 2864974524 baseline 4MB 300 : 519875 370641666 15596257 3354545517 20795009 4288034058 baseline 4MB 400 : 527046 407212266 15811388 4451664250 21081851 5750859191 newpref 1MB 200 : 659380 354584000 18973665 2765116117 20254787 2829670524 newpref 1MB 300 : 674199 401506466 19364916 4108277317 20519567 4232945658 newpref 1MB 400 : 690065 450684000 19781414 5510337850 20795009 5674367991 newpref 2MB 200 : 550481 340064133 16035674 2373383984 18070158 2538386658 newpref 2MB 300 : 559016 378469866 16269784 3489250917 18257418 3780515858 newpref 2MB 400 : 567961 418122933 16514456 4639149050 18450624 5049198658 newpref 4MB 200 : 434372 324582933 12792042 1940899717 15227739 2159397458 newpref 4MB 300 : 438529 354372466 12909944 2817282917 15339299 3196892058 newpref 4MB 400 : 442807 384748533 13031167 3710271984 15453348 4249925058
While this table contains all of the experimental results the architect needs, it's hard to analyze the data from these raw numbers. Thus, means to manipulate the data into a more accessible form are needed. So far, multidim contains a tool for only one kind of manipulation: normalization.
We can insert the following normalize command between the extract and the tabulate commands to normalize the data to the baseline:
./normalize value architecture=baseline
We should also specify a formatting option to tabulate so that the output does not get too wide; here's the full pipeline command:
./extract -dim 'cache_size=1MB 2MB 4MB' -dim 'mem_latency=200 300 400' -dim 'architecture=baseline newpref' -dim 'benchmark < example/benchmarks.list' -dim 'stat=cache_misses execution_time' -dim value -source 'example/data/exp_${architecture}_${cache_size}_${mem_latency}/${benchmark}/stats :${stat}:\s*${value}' | ./normalize value architecture=baseline | ./tabulate value -y architecture cache_size mem_latency -x benchmark stat -format "%.2f"
And here's the output:
benchmark : gcc gzip raytrace stat : cache_misses execution_time cache_misses execution_time cache_misses execution_time architecture cache_size mem_latency baseline 1MB 200 : 1.00 1.00 1.00 1.00 1.00 1.00 baseline 1MB 300 : 1.00 1.00 1.00 1.00 1.00 1.00 baseline 1MB 400 : 1.00 1.00 1.00 1.00 1.00 1.00 baseline 2MB 200 : 1.00 1.00 1.00 1.00 1.00 1.00 baseline 2MB 300 : 1.00 1.00 1.00 1.00 1.00 1.00 baseline 2MB 400 : 1.00 1.00 1.00 1.00 1.00 1.00 baseline 4MB 200 : 1.00 1.00 1.00 1.00 1.00 1.00 baseline 4MB 300 : 1.00 1.00 1.00 1.00 1.00 1.00 baseline 4MB 400 : 1.00 1.00 1.00 1.00 1.00 1.00 newpref 1MB 200 : 0.59 0.85 0.57 0.59 0.45 0.46 newpref 1MB 300 : 0.56 0.79 0.54 0.55 0.43 0.44 newpref 1MB 400 : 0.53 0.74 0.51 0.52 0.40 0.41 newpref 2MB 200 : 0.74 0.93 0.72 0.74 0.61 0.62 newpref 2MB 300 : 0.73 0.90 0.71 0.72 0.60 0.60 newpref 2MB 400 : 0.72 0.88 0.70 0.71 0.58 0.59 newpref 4MB 200 : 0.85 0.97 0.83 0.85 0.74 0.75 newpref 4MB 300 : 0.84 0.96 0.83 0.84 0.74 0.75 newpref 4MB 400 : 0.84 0.94 0.82 0.83 0.73 0.74
So far, this is the limit of multidim can do.