multidim

multidim is a set of command line tools to extract, manipulate, and present multidimensional data. These tools are targeted specifically at summarizing results of simulation experiments. multidim is written in Perl and may be downloaded off SourceForge.

Example

A lot of scientific and engineering work involves generating experimental data. Computer architects in particular often run simulation experiments to test out their ideas. Multiple experiments are often needed to gauge the effect of various parameters on the simulation outcome.

For example, a new prefetching mechanism may be evaluated in multiple configurations, varying cache size from 1Mb to 4Mb and varying memory latency from 200 to 400 cycles. Both the original architecture and the one with the new mechanism need to be simulated using multiple benchmarks. As the experiments complete, they typically produce statistics files which summarize simulation outcomes. In this case, the number of cache misses and the execution time of the benchmark could be of interest.

Each of these experiments represents a point in a six-dimensional space, with the following dimensions:

The goal of multidim is to help the architect visualize this data. To this end, multidim contains several tools that can be cascaded using pipes. These tools are extract, normalize, and tabulate. More tools, such addmean, are hopefully yet to come.

Tools

extract

Extracting the data is the first step. Simulation results are often stored in a directory hierarchy and the names of the directories and files include the parameter values of the corresponding experiments. The simulation results from the example above could be stored as follows:

exp_baseline_1MB_200/gcc/stats
exp_baseline_1MB_200/gzip/stats
exp_baseline_1MB_200/raytrace/stats
exp_baseline_1MB_300/gcc/stats
    ...
exp_baseline_4MB_400/raytrace/stats
exp_newpref_1MB_200/gcc/stats
    ...
exp_newpref_4MB_400/raytrace/stats

The example/data directory of multidim distribution contains this sample data.

The extract command allows the architect to collect the multidimensional data from these disjoint files:

./extract -dim 'cache_size=1MB 2MB 4MB' -dim 'mem_latency=200 300 400' -dim 'architecture=baseline newpref' -dim 'benchmark < example/benchmarks.list' -dim 'stat=cache_misses execution_time' -dim value -source 'example/data/exp_${architecture}_${cache_size}_${mem_latency}/${benchmark}/stats :${stat}:\s*${value}'

The output is many lines of data, with the first line listing the dimensions and the rest representing a single point in the six-dimensional space:

cache_size      mem_latency    architecture    benchmark       stat    value
1MB     200     baseline        gcc     cache_misses    1118033
2MB     200     baseline        gcc     cache_misses    745355
4MB     200     baseline        gcc     cache_misses    512989
1MB     300     baseline        gcc     cache_misses    1195228
2MB     300     baseline        gcc     cache_misses    766964
...

This is a simple format for multidimensional data; its main advantage is that gnuplot understands it. We don't, however, need graphics to summarize the data; a table could do the job. This is what the tabulate command is for.

tabulate

This command takes the data generated by extract and summarizes it in a table. Let's pipe the output of the above extract command into the following tabulate command:

./tabulate value -y architecture cache_size mem_latency -x benchmark stat

The output is a table summarizing the results of the experiments:

                          benchmark :                         gcc                        gzip                    raytrace
                               stat : cache_misses execution_time cache_misses execution_time cache_misses execution_time
architecture cache_size mem_latency                                                                                      
    baseline        1MB         200 :      1118033      415737733     33541019     4707429984     44721359     6091880124
    baseline        1MB         300 :      1195228      505712266     35856858     7406665717     47809144     9690861058
    baseline        1MB         400 :      1290994      610931733     38729833    10563249584     51639777    13899639458
    baseline        2MB         200 :       745355      366047333     22360679     3216717984     29814239     4104264124
    baseline        2MB         300 :       766964      420059466     23008949     4837083917     30678599     6264752058
    baseline        2MB         400 :       790569      477485066     23717082     6559849317     31622776     8561772524
    baseline        4MB         200 :       512989      335065200     15389675     2287250784     20519567     2864974524
    baseline        4MB         300 :       519875      370641666     15596257     3354545517     20795009     4288034058
    baseline        4MB         400 :       527046      407212266     15811388     4451664250     21081851     5750859191
     newpref        1MB         200 :       659380      354584000     18973665     2765116117     20254787     2829670524
     newpref        1MB         300 :       674199      401506466     19364916     4108277317     20519567     4232945658
     newpref        1MB         400 :       690065      450684000     19781414     5510337850     20795009     5674367991
     newpref        2MB         200 :       550481      340064133     16035674     2373383984     18070158     2538386658
     newpref        2MB         300 :       559016      378469866     16269784     3489250917     18257418     3780515858
     newpref        2MB         400 :       567961      418122933     16514456     4639149050     18450624     5049198658
     newpref        4MB         200 :       434372      324582933     12792042     1940899717     15227739     2159397458
     newpref        4MB         300 :       438529      354372466     12909944     2817282917     15339299     3196892058
     newpref        4MB         400 :       442807      384748533     13031167     3710271984     15453348     4249925058

While this table contains all of the experimental results the architect needs, it's hard to analyze the data from these raw numbers. Thus, means to manipulate the data into a more accessible form are needed. So far, multidim contains a tool for only one kind of manipulation: normalization.

normalize

We can insert the following normalize command between the extract and the tabulate commands to normalize the data to the baseline:

./normalize value architecture=baseline

We should also specify a formatting option to tabulate so that the output does not get too wide; here's the full pipeline command:

./extract -dim 'cache_size=1MB 2MB 4MB' -dim 'mem_latency=200 300 400' -dim 'architecture=baseline newpref' -dim 'benchmark < example/benchmarks.list' -dim 'stat=cache_misses execution_time' -dim value -source 'example/data/exp_${architecture}_${cache_size}_${mem_latency}/${benchmark}/stats :${stat}:\s*${value}' | ./normalize value architecture=baseline | ./tabulate value -y architecture cache_size mem_latency -x benchmark stat -format "%.2f"

And here's the output:

                          benchmark :                         gcc                        gzip                    raytrace
                               stat : cache_misses execution_time cache_misses execution_time cache_misses execution_time
architecture cache_size mem_latency                                                                                      
    baseline        1MB         200 :         1.00           1.00         1.00           1.00         1.00           1.00
    baseline        1MB         300 :         1.00           1.00         1.00           1.00         1.00           1.00
    baseline        1MB         400 :         1.00           1.00         1.00           1.00         1.00           1.00
    baseline        2MB         200 :         1.00           1.00         1.00           1.00         1.00           1.00
    baseline        2MB         300 :         1.00           1.00         1.00           1.00         1.00           1.00
    baseline        2MB         400 :         1.00           1.00         1.00           1.00         1.00           1.00
    baseline        4MB         200 :         1.00           1.00         1.00           1.00         1.00           1.00
    baseline        4MB         300 :         1.00           1.00         1.00           1.00         1.00           1.00
    baseline        4MB         400 :         1.00           1.00         1.00           1.00         1.00           1.00
     newpref        1MB         200 :         0.59           0.85         0.57           0.59         0.45           0.46
     newpref        1MB         300 :         0.56           0.79         0.54           0.55         0.43           0.44
     newpref        1MB         400 :         0.53           0.74         0.51           0.52         0.40           0.41
     newpref        2MB         200 :         0.74           0.93         0.72           0.74         0.61           0.62
     newpref        2MB         300 :         0.73           0.90         0.71           0.72         0.60           0.60
     newpref        2MB         400 :         0.72           0.88         0.70           0.71         0.58           0.59
     newpref        4MB         200 :         0.85           0.97         0.83           0.85         0.74           0.75
     newpref        4MB         300 :         0.84           0.96         0.83           0.84         0.74           0.75
     newpref        4MB         400 :         0.84           0.94         0.82           0.83         0.73           0.74

So far, this is the limit of multidim can do.