Compilercache

Compilercache
Download here: compilercache-1.0.10.tar.gz
compilercache now also has a sourceforge entry: http://sourceforge.net/projects/compilercache/
there is a FAQ available: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/~checkout~/compilercache/faq/FAQ
here is the README file included in the distribution:
compilercache Version 1.0.10
----------------------------


Table of Contents
-----------------

1 .... What is compilercache and Why do I want it ?
2 .... Is it dangerous to use compilercache ?
3 .... How do I install compilercache ?
4 .... How do I configure compilercache ?
5 .... Show me an example of compilercache usage !
6 .... How do I clean all those Megabytes of Cache files ?
7 .... Technical insights and problems
8 .... Some performance statistics !


1.  What is compilercache and Why do I want it ?
------------------------------------------------

Compilercache is under the GNU General Public License (GPL).
Compilercache is a wrapperscript around your C and C++ compilers. Each
time you compile something, the wrapperscript puts the result of the
compilation into a cache. And once you compile the same thing again,
the result will be picked from the cache instead of being recompiled.

You might wonder why you need this, since there seems to be another
tool for this purpose, "make". But to get "make" working you need to
create a Makefile. You need to take care of your dependencies
manually. If you make a mistake, wrong code will be generated.

Another drawback with "make" is that if you normally compile your
project with -O2 (optimizations) and now want to debug it, you will
have to recompile the whole stuff with -g (debugging). now with "make"
you have to do a "make clean", then change the options, and then
recompile everything. With compilercache you basically do the same,
but if your project has already been compiled with -g in the past, and
now currently you run it with -O2 and want to switch back to -g, the
old compilation results will be picked from the cache. i.e. switching
compiler options goes fast! not like "make" which forces you into a
complete recompilation.

Since compilercache is just a wrapper around your compiler, you can
still use "make" if you want. compilercache does no harm. All it does
is sometimes speed up the compilation run by getting the result out of
the cache.

There is another interesting advantage: suppose you download
foo-1.0.0.tar.gz from the net, unpack, ./configure, make. now the
authors release foo-1.0.1.tar.gz. what do you do? Well, first you
delete your old foo-1.0.0, then you unpack foo-1.0.1, ./configure,
make. but hey with compilercache the compilation goes extremly fast,
because only the changed sources will be recompiled! Now ain't that an
advantage? Just think of linux kernel recompilation! You can now
always do a "make mrproper" and be sure there will be no dependency
problems since you always recompile from scratch. compilercache will
take care of speeding up ;)

I got a report from a compilercache user that shows another usage:
> I admit my first thought was "huh, makes no sense!" -- but then, I
> discovered really useful application for it: building RPM packages.
> RPM has an annoying misfeature -- if something goes wrong during
> compilation, it allows you to continue after fixing .spec file with
> --short-circuit option, but it does *not* produce .rpm file in such
> case. In other words, you can debug the .spec file but at the end you
> have to recompile everything from scratch anyway. And if you're
> unlucky enough to compile Mozilla... ;-) That's when compilercache
> comes handy.

And finally as another advantage, if you just fix small typos in your
comments, compilercache will not recompile, even though the sources
have "changed" !


2. Is it dangerous to use compilercache ?
-----------------------------------------

First let's define dangerous. Dangerous means that the compilercache
program returns another result than the normal compiler would have
returned. For example you write a foo.c program, compile it with
compilercache and get a foo.o file that's not equal to what you would
have got if you called the original compiler directly.

I am quite sure that it is not dangerous to use compilercache. If
someone finds a situation where it is dangerous, then please mail me
the constellation so that I can fix compilercache.

But what about dependencies? Well. if foo.c includes bar.h, and you
change bar.h, and then recompile foo.c with compilercache, it will
recompile foo.c, because compilercache takes the complete preprocessor
output for deciding if it already compiled this source. The
preprocessor removes all the #include directives and just generates
one large source file. This source file is taken into account when
deciding if recompilation needs to take place. So there is no problem
with dependencies.

But if I change compiler options like -D_REENTRANT_ or -O2 or -g ?
That's actually an easy point. compilercache puts the commandline
options also into the cache. If you use other commandline options,
recompilation will take place.


3. How do I install compilercache ?
-----------------------------------

you need the "md5sum" program. if you don't have it, install
GNU textutils from your favourite GNU mirror.

type this in a bash shell:

DUMMY="compilercache-1.0.9"
CURDIR="$(pwd)"
tar xzvf "$DUMMY".tar.gz
cd "$DUMMY"/src
./compile.sh
cd "$HOME"
echo COMPILERCACHEBINDIR="$CURDIR"/"$DUMMY"/bin > .compilercacherc

the following line must be put into your login scripts or typed in
manually, each time you want to use the compilercache:

export PATH="$HOME"/compilercache-1.0.9/bin:"$PATH"

By setting the PATH the way shown above (compilercache is in front of
your other PATH settings, so that it is preferred over the normal
compiler!) you activated the compilercache. Now just continue your
work as usual and watch the speedup.


4. How do I configure compilercache ?
-------------------------------------

if you set the NOCOMPILERCACHE environment variable, the
cache will always bypass. this is easier than changing the
PATH each time.

The script first initializes it's configuration variables to default
values. then /etc/compilercacherc is sourced, and afterwards
$HOME/.compilercacherc. both are only sourced if they actually exist.
Remember that compilercache is a bash script, so you can put any kind
of bash commands in the configuration files.

It is perfectly ok to share the cache directory over NFS between
multiple users. it's also ok to install compilercache systemwide with
a /etc/compilercacherc. if users don't like the options they can
override them. For systemwide installation, create a
/usr/bin/compilercache subdirectory and move the bin/ subdirectory of
the compilercache in there. Then adjust the system wide login scripts
of the users for correct PATH settings, i.e. preference of
/usr/bin/compilercache before /usr/bin. finally create
/etc/compilercacherc from the template below.

The documentation of all options follows as a configuration file:
you may just cut and paste this if you want to.


# CACHEDIR is the directory where the cachefiles will be stored.
# (default is "$HOME/.compilercache/cache")
# you can always erase those files if you want to.
# they will be rebuilt as needed.
# you can also create a global compilercache directory in your company,
# possibly on NFS. this makes compilation results of a user available
# to all the other users!
# beware! this is an enourmous security risk! if multiple users share
# the same cache directory, they MUST trust each other! That's so
# because one user can manipulate compilation results of other users!
# so it's probably better to leave every user with his private cache
# directory!
CACHEDIR="$HOME/.compilercache/cache"

# TEMPDIR is the directory where temporary files will be placed
# (default is "$HOME/.compilercache/temp")
# you can erase those files, when no compilercache instance is
# currently running. make sure that only trusted users have access to
# this directory as they can give you wrong .o files if they do some
# tricks with link files, so it's probably better to leave every user
# with his private temporary directory!
TEMPDIR="$HOME/.compilercache/temp"

# SHALLDEBUG can be "yes" or "no"
# (default is "no")
# here you can choose if you want to see the debug messages
# of the cache. if you say "no" you won't even notice the presence
# of the cache. (apart from the speedup)
# be careful with "yes", because a normal compiler won't write debug
# output, so saying "yes" here might break some scripts like
# ./configure scripts which expect certain output behaviour
SHALLDEBUG="no"

# LINKOUTPUT can be "yes" or "no"
# (default is "no")
# if you say "yes" here, the resulting output files will be symbolic
# links into the CACHEDIR.
# if you say "no" here, the resulting output files will be copied
# from the CACHEDIR.
# saying "yes" can be an enourmous speedup (especially on NFS),
# you are also saving disk space,
# but you must take care of the following things !!!
# please leave LINKOUTPUT=no if you ain't ABSOLUTLEY sure that
# the issues presented here do not affect you.
# (that's why default LINKOUTPUT is "no")
# - if your sources are on NFS, but your CACHEDIR is local,
#   then for other hosts the object file links are all invalid.
# - if somebody erases files in the CACHEDIR, links will become
#   invalid! (but recompilation will rebuild them correctly)
# - a normal compiler creates output files, not output linkfiles
#   so the cache does not behave like a normal compiler.
#   this should be no problem though.
# - getting a file from the cache will update the modification time of
#   all .o files that link into this cacheobject! this will mean useless
#   linker runs because make thinks that the .o files have changed.
#   it is no harm, just useless operations.
LINKOUTPUT="no"

# PATH is a colon separated list of directories
# (default is "/bin:/usr/bin")
# it has the usual systems meaning. the script will look here for all
# kinds of programs, but NOT for the compiler
PATH="/bin:/usr/bin"

# COMPILERPATH is a colon separated list of directories
# (default is "/bin:/usr/bin")
# compilercache will search for the compiler ONLY in the
# COMPILERPATH directories and NOT in the PATH directories !
COMPILERPATH="/bin:/usr/bin"

# COMPILERNAMES is a space separated list of compiler binary names
# (default is "c++ g++ cc gcc")
# compilercache will only work together with the listed compilers.
# for example you just have to install a link file from c++-3.0 to
# compilercache in the COMPILERCACHEBINDIR directory and compilercache
# will now also run as c++-3.0
COMPILERNAMES="c++ g++ cc gcc"

# COMPILERCACHEBINDIR is the /bin subdirectory of your
# compilercache installation
# (default is "/usr/bin/compilercache")
# you NEED to set this appropriately otherwise the compilercache
# unifier won't be found
COMPILERCACHEBINDIR="/usr/bin/compilercache"


5. Show me an example of compilercache usage !
----------------------------------------------

~/compiler > ls
a.c  a.h  clean.sh  compile.sh  main.c

~/compiler > cat compile.sh
#!/bin/bash
set -e
set -v
gcc -c a.c -o a.o
gcc -c main.c -o main.o
gcc -o foo main.o a.o

~/compiler > ./compile.sh
gcc -c a.c -o a.o
(compiling into cache)
gcc -c main.c -o main.o
(compiling into cache)
gcc -o foo main.o a.o
(cannot understand this command, cache bypass)

~/compiler > ls
a.c  a.h  a.o  clean.sh  compile.sh  foo  main.c  main.o

~/compiler > rm -f *.o foo

~/compiler > ls
a.c  a.h  clean.sh  compile.sh  main.c

~/compiler > ./compile.sh
gcc -c a.c -o a.o
(getting result from cache)
gcc -c main.c -o main.o
(getting result from cache)
gcc -o foo main.o a.o
(cannot understand this command, cache bypass)

~/compiler > ls
a.c  a.h  a.o  clean.sh  compile.sh  foo  main.c  main.o


6. How do I clean all those Megabytes of Cache files ?
------------------------------------------------------

If you want to you can just erase them all.
if you want to keep all entries used the last 10 days, do
find /tmp/mycachedir -mtime +10 | xargs rm

you could also do this in a cron job on a daily basis.  please make
sure there are no compilers running while performing this task.
otherwise there is a little chance that they might fail with an
internal error, but this does no harm, you can just re-run them
afterwards.


7. Technical insights and problems
----------------------------------

first compilercache checks what kind of action it shall perform.  only
if the compiler is called to actually compile a single C sourcefile,
the script continues its work. Otherwise everything is bypassed and
the normal compiler called instead. This happens for instance if you
call the compiler as a linker.

Ok. compilercache shall compile a single sourcefile. First it creates
two sets of commandline arguments from the original given commandline
arguments.

STRIPPEDARGS is the set of commandline arguments without -c and -o and
the filenames.

IDENTARGS is the set of commandline arguments that are neccessary to
uniquely identify the corresponding output file. i.e. STRIPPEDARGS
without include paths, macro definitions, library paths. the rule for
the design of the IDENTARGS set is to include as many options as
needed and as few as possible to produce as many cache hits as
possible and still produce the correct output files.

you can see the values of the options by uncommenting the various
debugging blocks inside the compilercache script.

now the preprocessor is called. this happens with the -E option and
the STRIPPEDARGS option and the sourcefilename.

now the output of the preprocessor, the version of the compiler, the
basename of the sourcefile and the IDENTARGS are put into a file. Then
the md5sum of this file is computed. this md5sum is the filename of
the cache entry. so now the compilercache checks if such a file is
already inside the cache directory. if yes, then this file is taken as
the output of the compiler run (i.e the .o file) and compilercache is
done. If not, the normal compiler is run and if it produced no
warnings and no errors, the result is put into the cache aswell as
into the output file and the compilercache has finished its work.

That's it. There are three design criterias in compilercache, shown in
descending order of priority:

1) the compilercache may NEVER return an output file that is not
bitwise the same as if the original compiler would have been run.

2) the compilercache shall do exactly the same thing as the original
compiler, only the time consumation is sometimes much less.

3) the compilercache shall use the files from the cache as often as
possible and not wastedly recompile.

It is absolutely top priority that 1) and 2) are ALWAYS met under ALL
circumstances.

To explain 3) further, consider a sourcecode with some added newlines
at the end of the file. of course the .o output file will be exactly
the same, even though the preprocessor output is different.

Ok. now that you know the basic operation, let's discuss the advanced
topic of cache hit increasing. A compiler is a tool to perform a
mapping. you have an infinite set of input sourcefiles denoted as
S=(S1, S2, ...). You also have an infinite set of output object files
denoted as O=(O1, O2, ...). The compiler is nothing more but a
relation that connects elements of S to elements of O. Let's discuss
this with an example:

S1 -> O1

S2 -> O2

S3 --\
S4 --->--> O345
S5 --/

what's shown here is that multiple different (infinitely many) source
files map to the same output file. The compilercache shall not only
know that S1 -> O1, S2 -> O2 and S3 -> O3, but it shall also know that
S4 and S5 will produce the same output than S3. This way if you ever
compiled S3, the result of the compilation of S4 and S5 will come from
the cache instead of a recompilation.

But, is this practically relevant? Well, consider an extremly big
project like the linux kernel. Maybe there is a single include file
that almost everybody includes. Let's say it defines some very
fundamental basic integer types. Now, if a developer fixes a typo in a
comment ( /* this is myy integer */ --> /* this is my integer */ ), a
complete recompilation of the whole project is needed, even though
absolutely nothing changed in the output files! Often this leads
programmers to not fix typos in comments, which is not a good thing.

Another example, also from the linux kernel, is a central include file
(autoconf.h) containing macros for the kernel configuration. This
single file contains definitions for ALL drivers in the kernel. The
drivers themselves are separate C source files, and each of them
considers only a few of the macros in the central include file. Now if
you change a definition in the central include file, or add and remove
whitespace (like the linux kernel configuration tools do) only a few
driver sourcefiles would need to be recompiled theoretically. But
practically everything will be recompiled. The cache should stop with
this and use the cached values whereever possible.

so, know that you see that many sourcefiles map to the same output
file, and you are also convinced that it would be very useful for the
cache to detect those situations to produce much much more cache hits,
let's discuss the techniques used to reach the target.

For a first, it should be clear that the output of the preprocessor
plus the IDENTARGS commandline options plus the compiler version
uniquely specify the corresponding output file. (now you understand
why IDENTARGS does not contain paths and macro definition commandline
arguments. all this information is not needed anymore after the
preprocessor finished its job)

Ok. how can we further compactify the preprocessor output so that the
corresponding output file is still uniquely specified, but more source
files match ?

well first we must make sure that the output file has no direct links
into the source file, what I mean is you cannot reformat the
sourcefile if the output file contains debugging information that
directly refers to line numbers in the sourcefile. This would mean if
you add newlines in the sourcefile and recompile, the same output
would be generated. if you now call your debugger, the line number
information will be wrong. So the following technique is only
activated if debugging options are turned off.

the preprocessor output still contains lines starting with a '#'. This
seems to break the design, but it is true anyway. The only # lines
that have an effect on the resulting output file are the ones that
start with #pragma.

The preprocessor output is fed through a program called "unifier".
The output of the unifier is then taken for the md5sum. The unifier
works like a C/C++/ObjC lexer. All it does is write each token on a
single line. it takes # lines (like #pragma) like one big token (see
below for example). But it ignores # lines that start with a number
(like "# 1 "foo.c" 30")

for example, if the following is the input to the unifier:
--------
# 1 "/home/erik/kernel-source-2.4.2/include/linux/autoconf.h" 1


static inline int spin_trylock(spinlock_t *lock)
#pragma implementation
--------
then the following will be the output:
--------
static
inline
int
spin_trylock
(
spinlock_t
*
lock
)
#pragma implementation
--------

the unifier is the real power of the compilercache.  That's what make
and all the other tools can not do. Cache the linux kernel and the
mozilla project :-)


8. Some performance statistics !
--------------------------------

*** linux kernel compilation. kernel version 2.4.3

kernel          compilercache           time
------------------------------------------------------
default         no                      5m28.860s
default         yes, but empty          6m56.490s
default         yes, filled             2m51.900s
modified        yes, filled             3m58.730s

in each run the kernel source was completely removed and freshly
unpacked. default means the configuration is unchanged. modified
means the following changes:

parallel port Y
PC style hardware Y
FDDI Y
Digital DEFEA and DEFPA adapter Y
adaptec aha1740 Y
advansys SCSI support Y
Ensoniq audio PCI ES1370 Y
Socket Filtering Y

the time measured was the time needed for "make dep; make". the cache
does not perform so well here, because make dep needs plenty of time
and is not cacheable. on the other hand the kernel consists of very
many small files with short compilation runs. but anyway, the cache
gives us an improvement.


*** omniORB_303 compilation

(be careful, you have to manually adjust the makefiles, because they
explicitly refer to /usr/bin/gcc and not to the compilercache)

omniORB has many C++ files with long compilation times. Unfortunately
many of them produce warnings so that they won't be cached. There are
also many IDL compiler runs, which won't be cached. But look at the
results!

empty compilercache:
19 minutes 27 seconds

full compilercache:
3 minutes 11 seconds

ain't that fantastic?


*** qt-x11-2.3.0.tar.gz complete compilation including examples

empty compilercache:
21 minutes 53 seconds

full compilercache:
4 minutes 5 seconds

here too the most time is spend on recompiling warning producing C++
files...


--
erikyyy at erikyyy dot de, Erik Thiele
(erikyyy at erikyyy dot de, Erik Thiele)
back