Friday, April 29, 2011

Code available!


The code so far can be found in SVN on Sourceforge.
It does something already, but is not ready for release yet.

Friday, April 22, 2011

Remove superfluous include files


After the introduction to dependency analysis using prolog in a previous post, we can discuss some more advanced applications of the prolog database that is generated by the tool.

Remove superfluous include files


Approaches to include file pruning

Before we dive into the code it's important to understand the limitations of what we are about to do. In order to remove #include statements from a C/C++ program, I'm aware of three techniques:

  1. full parsing and semantic analysis (I'm not aware of any tool that implements this for C++ code). This would be the best approach (but it's very hard to implement. Edited to add: check out the CLang based include-what-you-use project!). It would be able to point out things like include files that can be removed because none of their contents are being referenced, or because a forward declaration suffices.
  2. systematically comment out #include statements one at a time, and recompile the code. If compilation fails, keep the #include statement. If compilation succeeds remove the #include statement. This is implemented in the open source tool deheader. It is less general than approach 1. in that it cannot say which include files can be replaced with forward declarations. It will allow to remove include files whose contents are not referenced. Recompiling the code after removing an #include file can be time consuming. Note that it is possible to come up with situations in which removing an #include file can lead to subtle bugs without causing compiler errors (like selecting a different overloaded method).
  3. remove #include statements that are implied by other #include statements. Start from file F1.h. Let F1.h include a file F2.h. And let F2.h include F3.h. There's not much point in having file F1.h also include F3.h because its inclusion is implied by including F2.h already. It is this technique that we can implement using analysis of the include file dependencies. It can serve as a preprocessing step to technique 2, to reduce the number of compilations that have to take place.

Prolog code

Feel free to point out bugs and suggest improvements :)
The following won't work correctly in the presence of cycles, i.e. include files that via other files eventually include themselves again (#ifdef guards then prevents the actual double inclusion). This happens in real code bases, but it is considered bad practice. Detecting include file cycles probably will be the subject of another dependency analysis blog post.
Note also that the tool currently ignores #ifdef and friends.



The starting point is minimize_includes. minimize_includes expects a File as input, and produces three lists: the first list called OringinalIncludes are the files it currently directly includes; the second list called MinimizedIncludes are the files it should include; the third list ToBeRemoved are the #include files which can be removed from file File. If nothing can be removed, minimize_includes will fail.

minimize_includes delegates the hard work to remove_implied_includes. remove_implied_includes iterates over the includes files of file File twice: once from front to end, removing each file that is already included by one of the files that follow in the list. This yields an IntermediateResult. Then this list is reversed, and we iterate again over the list, again removing each file that is already included by one of the files that follow in the list. Note that often multiple solutions are possible, but we stop searching after finding one solution.

I should also mention that on code bases with precompiled header files (e.g. stdafx.h on windows systems), the precompiled header files can end up being considered "removable". In practice the compiler expects to find them and they should not be removed from the code base.

Sunday, April 17, 2011

C/C++ include file resolution


Another why-oh-why experience today. I'm trying to find out how the preprocessor is resolving include files, like: what paths will be searched first.

I expected the following behaviour:
  • #include <somefile.h> : only search the paths that were passed to the compiler using the -I switch
  • #include "somefile.h" : first search the folder in which the file being examined lives, only then start searching the folders passed to the compiler using -I.
Instead I found that the behaviour is compiler dependendent (hence the why-oh-why?!) According to this article:
  • the #include <somefile.h> variant:

    StepVisual studio GCC
    1 paths passed by /I paths passed by -I
    2 paths specified in INCLUDE environment variable system directories
  • the #include "somefile.h" variant

    StepVisual studio GCC
    1 the same directory as the file that contains the #include statement the same directory as the file that contains the #include statement
    2 the directories of any previously opened include files in the reverse order in which they were opened directories passed in by the -iquote option
    3 paths passed by /I inside the quote directories
    4 paths in the INCLUDE environment variable
To clarify the "inside the quotes" method: for example if /usr/include/sys/stat.h contains #include "types.h", GCC looks for types.h first in /usr/include/sys, then in its usual search path. So what now? Do we really need separate project resolution algorithms that mimick these approaches? Since I'm not trying to keep the tool's command line options compatible with those of the compiler, I think I can get away with a simpler scheme (counter examples are welcome of course). For now include files will be resolved as follows:
  • Step 1: check the folder of the file being examined (only if the "" style of include was used)
  • Step 2: check the include folders that were passed to the tool using -I like options (in the order in which they were passed in)

Wednesday, April 13, 2011

Cross-platform file name handling



The challenge

On many operating systems file names are case sensitive. The windows OS, however, has case insensitive file names. The tool preferably should be cross-platform, which in my opinion means that we should be able to analyze:

  • code written on a linux system with a tool running under windows
  • code written on a linux system with a tool running under linux
  • code written on a windows system with a tool running under windows
  • code written on a windows system with a tool running under linux

The first option is not possible in the most general case. In linux it would be perfectly possible to have two different files in the same folder: includeme.h and IncludeMe.h. On windows those two would be the same file. Too bad, nothing we can do about it. The second option also presents no problems.

The headache starts when trying to implement scenarios 3 and 4. Code written on windows systems tends to be rather sloppy when it comes to specifying include files. The compiler certainly won't enforce using correct case for file names, and writing


on windows means exactly the same as writing:


The approach

To be able to handle all four scenarios transparantly, I have opted to add a command-line option to the tool to explicitly inform it about the required case sensitivity mode of operation. File names are wrapped in a FileNode object that always holds the case sensitive file name (including full path).

The FileNode object then exposes two methods: file_name (which returns the case sensitive name), and analysis_name which, for the case insensitive operation mode, is the all-lowercase version of the file, and for case sensitive operation mode is the case sensitive version of the file name, possibly truncated to a maximum length (no one wants to write queries involving a file name like /home/username/development/cpp/largeproject/library/sublibrary/longfilename.cpp, right?)



Apparently path.py only recognizes os.sep as path separator. This makes it a bit harder to analyze windows software, where both forward slash / and backslash \ are valid path separators. Path.py seems to only recognize \ on windows and / on linux. This issue explains the kludgy lines where path separators are replaced at the top of method FileNode.__pp(self, for_display).
To make the tool work cross-platform as defined in this post, we pass the allowed path separators into the tool via a command line option.

Sunday, April 10, 2011

Dependency analysis using prolog


Prolog?


According to Wikipedia (emphasis is mine):

Prolog is a general purpose logic programming language associated with artificial intelligence and computational linguistics.

Prolog has its roots in first-order logic, a formal logic, and unlike many other programming languages, Prolog is declarative: The program logic is expressed in terms of relations, represented as facts and rules. A computation is initiated by running a query over these relations.

It sounds ideal for our purpose: we can express dependencies between files as relations, and examine these relations using queries. If you hear the word query and automatically think "relational database and SQL" I will have to disappoint you. Prolog is much more powerful than SQL. Contrary to SQL, prolog can be used a general purpose language. Prolog especially shines when it comes to recursive queries, something that SQL is not exactly famous for. Dependencies between include files are recursive in nature, and prolog will be a natural fit.

Representing facts and rules in prolog


To represent the fact that John is the father of Pete, in prolog one would write:
father(john, pete).

This single fact already forms a database which we can run a few queries on. For example, we could ask the system who is the father of Pete:
father(X, pete).

Note that facts are represented in lowercase, whereas queries are defined by using capitals. By asking father(X, pete), we ask prolog: what value should X have so that the relation father(X, pete) is true according to the information in our database.

As we expect, prolog answers:
X = john.

Now suppose that we extend our database with another fact, namely that Pete is also the father of Mary.
father(pete, john).
father(pete, mary).

We can ask prolog who are the children of pete using the query:
father(pete, X).

This query has two possible answers: according to the database both john and mary are children of pete. Prolog therefore will return the first possible answer, and then ask you what to do.
If you press the ";" key, prolog will go on a search another answer. If you press the ".", prolog stops searching.
X = john ;
X = mary

So far we've only represented and queried facts. This could still just as well be done in SQL.
We turn to the more interesting question about finding siblings. Our database doesn't directly represent facts about siblings, but we can add knowledge by defining rules. E.g. we could add the rule that X and Y are siblings provided that they both have the same father. Our prolog program becomes:
father(pete, john).
father(pete, mary).
sibling(X, Y) :- father(Z, X), father(Z, Y).

You can read the last line as: X is a sibling of Y provided that there's a Z that is the father of X, and Z is also the father of Y. We could use the following query to ask prolog who are the siblings of mary:
sibling(mary, X).

And prolog searches its database to return the following answers:
X = john ;
X = mary

It may come as a surprise that Mary is considered a sibling of Mary, but that's exactly what we taught our system: X and Y are siblings if they have the same parent. If you substitute Mary for X and Mary for Y, then both X and Y have the same parent, so according to our rule, they are siblings. We could extend the rule, to demand that X and Y cannot be equal:
father(pete, john).
father(pete, mary).
sibling(X, Y) :- father(Z, X), father(Z, Y), X \= Y.

If we now ask who's the sibling of Mary
sibling(mary, X).

We get only John as an answer
X = john

This is, in a nutshell, how prolog can be taught to reason about facts. Of course here you only see the tip of the iceberg that are the possibilities of prolog. To find out more, you should consider reading a book.

From siblings to dependencies


Now that we understand the basics of prolog, we can define some relations that will come in handy for our dependency analysis. We could e.g. represent the fact that a file XXX.cpp includes a file XXX.h as follows:
includes('XXX.cpp', 'XXX.h').

Note that normally words with capitals are considered Variables that are to be filled in by prolog, but if you want to use capitalized words in facts, you can put them in between apostrophes (').

For users who are not familiar with prolog, interpreting a relation like "includes('XXX.cpp', 'XXX.h')" is not very intuitive. Wouldn't it be nice if we could write instead:
'XXX.cpp' includes 'XXX.h'.

Thanks to user defined operators in prolog this becomes possible. Since this post isn't meant to be a prolog tutorial, without further delay here's some code to accomplish this. Hardcore prolog users will not always appreciate user defined operators, since they introduce a domain specific language which may make the code harder to read. For the uninitiated, I think the DSL we introduce here will make prolog look less daunting.
:- op(970, xfx, includes).

After defining the user defined operator "includes" we can now write:
'src/XXX.cpp' includes 'src/XXX.h'.
'src/XXX.h' includes 'lib/YYY.h'.
to express the fact that file 'src/XXX.cpp' includes 'src/XXX.h', and to express that 'src/XXX.h' includes 'lib/YYY.h'. Who said that prolog was difficult and mind-stretching? ;-) (Trust me, it can be, just not yet at this level.)

Knowledge representation for dependency analysis

So we already understand that we can write facts and rules to reason about the facts. Now the question arises what facts do we need in order to do meaningful include file analysis? We will make a distinction between cppfiles and include files. In a typical programming language, you might be tempted to examine the file name to decide whether or not you are dealing with a cppfile or a headerfile. In prolog this is not impossible, but awkward. Better is to directly represent such knowledge using facts. We define two user defined operators as follows:
:- op(990, fy, headerfile).
:- op(990, fy, cppfile).
Those operators allow us to represent that 'XXX.cpp' is a cppfile and 'YYY.h' is a headerfile as follows:
cppfile 'src/XXX.cpp'.
headerfile 'src/XXX.h'.
headerfile 'lib/YYY.h'.

You may be afraid that by representing such facts explicitly in your prolog database, the prolog database will grow to monstrous sizes. Don't worry! The prolog engine is incredibly fast, parsing hundreds of thousands of lines of prolog code in a few seconds. (Who said prolog was slow? ;-) Trust me, it can be, if you work with an inadequate knowledge representation, or if you write bad or unoptimized queries.)

Note also that we used the name 'src/XXX.cpp' instead of 'XXX.cpp'. The idea is to use enough path prefix so that there can be no duplicate file names.

What other facts will we use? We need a concept of project, to check for hierarchy violations. With the following definitions
:- op(990, fy, project).
:- op(990, fy, to).
:- op(990, xfy, belongs).
we can write facts like:
'src/XXX.cpp' belongs to project 'src'.
'src/XXX.h' belongs to project 'src'.
'lib/YYY.h' belongs to project 'lib'.

We want to express constraints between projects:
:- op(800, fy, from).
:- op(800, fy, include).
:- op(800, xfy, cannot).
'src' cannot include from 'lib'.

Finally, we can also represent parsing cost (number of statements) for each file:
:- op(800, xfx, costs).
'src/XXX.cpp' costs 123.
'src/XXX.h' costs 12.
'lib/YYY.h' costs 23.

We will define some rules to reason about these facts. An interesting rule is the "path" rule, which calculates a Path from file A to file B. If one or more Paths exist, they represent how file A eventually includes file B. So suppose that 'src/XXX.cpp' includes 'src/XXX.h' and 'src/XXX.h' in turn includes 'lib/YYY.h', then Path('src/XXX.cpp', 'lib/YYY.h', P) would return a P = ['src/XXX.cpp', 'src/XXX.h', 'lib/YYY.h']. Often there will be multiple paths from one file to another file, and prolog will find all of them for you. There's nothing wrong with you if you don't understand how it works without reading a prolog book. You also don't need to understand it in depth to understand what follows.
path(A,B,Path) :-
    travel(A,B,[A],Q),
    reverse(Q,Path).
travel(A,B,P,[B|P]) :-
    A includes B.
travel(A,B,Visited,Path) :-
    A includes C,
    C \== B,
    \+member(C,Visited),
    travel(C,B,[C|Visited],Path).

Using our path rule we can now define less mind melting queries, e.g. we define a rule that can tell us if file A depends on file B:
:- op(650, fy, on).
:- op(650, xfy, depends).
:- op(670, yfx, via).

X depends on Y :- Path(X, Y, _).
Here we say: file X depends on file Y if there's a path _ from X to Y. The underscore denotes that we are not really interested in the actual Path.
X depends on Y via P :- Path(X, Y, P).
With the above rule we can write a query like:
'src/XXX.cpp' depends on 'lib/YYY.h' via P.
and prolog will answer:
P = ['src/XXX.cpp', 'src/XXX.h', 'lib/YYY.h'].

Some useful queries

You may not realize it, but by now (even if we haven't fully developed everything we will eventually use) we already have a lot of power at our finger tips.

Find all header files

headerfile X.

Find all cpp files

cppfile X.

Find out which files are directly included by file 'src/XXX.h'

'src/XXX.h' includes F.

Find out which files directly include 'lib/YYY.h'

F includes 'src/YYY.h'.

Find out which files recursively are included by 'src/XXX.cpp'

'src/XXX.cpp' depends on F.

Find out which files eventually include 'lib/YYY.h' (recursively)

F depends on 'lib/YYY.h'.

Find out if someone includes cpp files

cppfile F, F2 includes F.

Find out if there are header files that are included by no one

(note: \+ is the prolog way of writing "not" )
headerfile F, \+(F2 includes F).

Some more rules

We add a rule to detect include file violations, taking into account constraints between projects:
violations(Project, X, Y) :-
    X includes Y, 
    X belongs to project Project,
    Y belongs to project Py,
    Project cannot include from Py.

Now we can ask to get all include file violations:
violations(P, X, Y).

If you're not convinced about the power of prolog for dependency analysis, you'd better stop reading here.

The next question that arises is: how will we extract all those facts from our actual source code? Are we supposed to manually enter all information in a prolog database?
The answer, of course, is no. In a next post, we will introduce a python program that extracts all these facts from our source code and dumps them into a prolog database similar to the one we developed here. I used python instead of prolog to extract those facts from c/cpp files because I'm much more familiar with python. Prolog is great for reasoning about facts, but text processing probably is better left to a language like python.

Once everything is ready, I'll make the full code available for download.

Saturday, April 9, 2011

Analyzing C/C++ include file dependencies



Have you ever been looking for a tool that you can query for things like:
  • Which include files does file XXX.cpp include directly ?
  • Which include files does file XXX.cpp include recursively ?
  • Which files include XXX.h ?
  • How does file XXX.cpp include file YYY.h ?
  • How much lines of code have to be recompiled if I touch file YYY.h ?
  • How does project P1 depend on project P2 ?
  • What include statements can be removed from file F because they are implied by other include statements in that same file ?
  • ...and more
Would you also like to 
  • Visualize dependency graphs ?
  • Automatically track violations in case certain projects are not supposed to include from other projects ?
  • Have a tool that is free under GPLv3, open source and cross-platform ?
In a series of blog posts, we will systematically develop exactly such a tool using swi-prolog and python 2.