560 lines
20 KiB
ReStructuredText
560 lines
20 KiB
ReStructuredText
===============================================================
|
|
Tutorial for building tools using LibTooling and LibASTMatchers
|
|
===============================================================
|
|
|
|
This document is intended to show how to build a useful source-to-source
|
|
translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is
|
|
explicitly aimed at people who are new to Clang, so all you should need
|
|
is a working knowledge of C++ and the command line.
|
|
|
|
In order to work on the compiler, you need some basic knowledge of the
|
|
abstract syntax tree (AST). To this end, the reader is incouraged to
|
|
skim the :doc:`Introduction to the Clang
|
|
AST <IntroductionToTheClangAST>`
|
|
|
|
Step 0: Obtaining Clang
|
|
=======================
|
|
|
|
As Clang is part of the LLVM project, you'll need to download LLVM's
|
|
source code first. Both Clang and LLVM are maintained as Subversion
|
|
repositories, but we'll be accessing them through the git mirror. For
|
|
further information, see the `getting started
|
|
guide <http://llvm.org/docs/GettingStarted.html>`_.
|
|
|
|
.. code-block:: console
|
|
|
|
mkdir ~/clang-llvm && cd ~/clang-llvm
|
|
git clone http://llvm.org/git/llvm.git
|
|
cd llvm/tools
|
|
git clone http://llvm.org/git/clang.git
|
|
cd clang/tools
|
|
git clone http://llvm.org/git/clang-tools-extra.git extra
|
|
|
|
Next you need to obtain the CMake build system and Ninja build tool. You
|
|
may already have CMake installed, but current binary versions of CMake
|
|
aren't built with Ninja support.
|
|
|
|
.. code-block:: console
|
|
|
|
cd ~/clang-llvm
|
|
git clone https://github.com/martine/ninja.git
|
|
cd ninja
|
|
git checkout release
|
|
./bootstrap.py
|
|
sudo cp ninja /usr/bin/
|
|
|
|
cd ~/clang-llvm
|
|
git clone git://cmake.org/stage/cmake.git
|
|
cd cmake
|
|
git checkout next
|
|
./bootstrap
|
|
make
|
|
sudo make install
|
|
|
|
Okay. Now we'll build Clang!
|
|
|
|
.. code-block:: console
|
|
|
|
cd ~/clang-llvm
|
|
mkdir build && cd build
|
|
cmake -G Ninja ../llvm -DLLVM_BUILD_TESTS=ON # Enable tests; default is off.
|
|
ninja
|
|
ninja check # Test LLVM only.
|
|
ninja clang-test # Test Clang only.
|
|
ninja install
|
|
|
|
And we're live.
|
|
|
|
All of the tests should pass, though there is a (very) small chance that
|
|
you can catch LLVM and Clang out of sync. Running ``'git svn rebase'``
|
|
in both the llvm and clang directories should fix any problems.
|
|
|
|
Finally, we want to set Clang as its own compiler.
|
|
|
|
.. code-block:: console
|
|
|
|
cd ~/clang-llvm/build
|
|
ccmake ../llvm
|
|
|
|
The second command will bring up a GUI for configuring Clang. You need
|
|
to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on
|
|
advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to
|
|
``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to
|
|
configure, then ``'g'`` to generate CMake's files.
|
|
|
|
Finally, run ninja one last time, and you're done.
|
|
|
|
Step 1: Create a ClangTool
|
|
==========================
|
|
|
|
Now that we have enough background knowledge, it's time to create the
|
|
simplest productive ClangTool in existence: a syntax checker. While this
|
|
already exists as ``clang-check``, it's important to understand what's
|
|
going on.
|
|
|
|
First, we'll need to create a new directory for our tool and tell CMake
|
|
that it exists. As this is not going to be a core clang tool, it will
|
|
live in the ``tools/extra`` repository.
|
|
|
|
.. code-block:: console
|
|
|
|
cd ~/clang-llvm/llvm/tools/clang
|
|
mkdir tools/extra/loop-convert
|
|
echo 'add_subdirectory(loop-convert)' >> tools/extra/CMakeLists.txt
|
|
vim tools/extra/loop-convert/CMakeLists.txt
|
|
|
|
CMakeLists.txt should have the following contents:
|
|
|
|
::
|
|
|
|
set(LLVM_LINK_COMPONENTS support)
|
|
|
|
add_clang_executable(loop-convert
|
|
LoopConvert.cpp
|
|
)
|
|
target_link_libraries(loop-convert
|
|
clangTooling
|
|
clangBasic
|
|
clangASTMatchers
|
|
)
|
|
|
|
With that done, Ninja will be able to compile our tool. Let's give it
|
|
something to compile! Put the following into
|
|
``tools/extra/loop-convert/LoopConvert.cpp``. A detailed explanation of
|
|
why the different parts are needed can be found in the `LibTooling
|
|
documentation <LibTooling.html>`_.
|
|
|
|
.. code-block:: c++
|
|
|
|
// Declares clang::SyntaxOnlyAction.
|
|
#include "clang/Frontend/FrontendActions.h"
|
|
#include "clang/Tooling/CommonOptionsParser.h"
|
|
#include "clang/Tooling/Tooling.h"
|
|
// Declares llvm::cl::extrahelp.
|
|
#include "llvm/Support/CommandLine.h"
|
|
|
|
using namespace clang::tooling;
|
|
using namespace llvm;
|
|
|
|
// Apply a custom category to all command-line options so that they are the
|
|
// only ones displayed.
|
|
static llvm::cl::OptionCategory MyToolCategory("my-tool options");
|
|
|
|
// CommonOptionsParser declares HelpMessage with a description of the common
|
|
// command-line options related to the compilation database and input files.
|
|
// It's nice to have this help message in all tools.
|
|
static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage);
|
|
|
|
// A help message for this specific tool can be added afterwards.
|
|
static cl::extrahelp MoreHelp("\nMore help text...");
|
|
|
|
int main(int argc, const char **argv) {
|
|
CommonOptionsParser OptionsParser(argc, argv, MyToolCategory);
|
|
ClangTool Tool(OptionsParser.getCompilations(),
|
|
OptionsParser.getSourcePathList());
|
|
return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>().get());
|
|
}
|
|
|
|
And that's it! You can compile our new tool by running ninja from the
|
|
``build`` directory.
|
|
|
|
.. code-block:: console
|
|
|
|
cd ~/clang-llvm/build
|
|
ninja
|
|
|
|
You should now be able to run the syntax checker, which is located in
|
|
``~/clang-llvm/build/bin``, on any source file. Try it!
|
|
|
|
.. code-block:: console
|
|
|
|
echo "int main() { return 0; }" > test.cpp
|
|
bin/loop-convert test.cpp --
|
|
|
|
Note the two dashes after we specify the source file. The additional
|
|
options for the compiler are passed after the dashes rather than loading
|
|
them from a compilation database - there just aren't any options needed
|
|
right now.
|
|
|
|
Intermezzo: Learn AST matcher basics
|
|
====================================
|
|
|
|
Clang recently introduced the :doc:`ASTMatcher
|
|
library <LibASTMatchers>` to provide a simple, powerful, and
|
|
concise way to describe specific patterns in the AST. Implemented as a
|
|
DSL powered by macros and templates (see
|
|
`ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're
|
|
curious), matchers offer the feel of algebraic data types common to
|
|
functional programming languages.
|
|
|
|
For example, suppose you wanted to examine only binary operators. There
|
|
is a matcher to do exactly that, conveniently named ``binaryOperator``.
|
|
I'll give you one guess what this matcher does:
|
|
|
|
.. code-block:: c++
|
|
|
|
binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0))))
|
|
|
|
Shockingly, it will match against addition expressions whose left hand
|
|
side is exactly the literal 0. It will not match against other forms of
|
|
0, such as ``'\0'`` or ``NULL``, but it will match against macros that
|
|
expand to 0. The matcher will also not match against calls to the
|
|
overloaded operator ``'+'``, as there is a separate ``operatorCallExpr``
|
|
matcher to handle overloaded operators.
|
|
|
|
There are AST matchers to match all the different nodes of the AST,
|
|
narrowing matchers to only match AST nodes fulfilling specific criteria,
|
|
and traversal matchers to get from one kind of AST node to another. For
|
|
a complete list of AST matchers, take a look at the `AST Matcher
|
|
References <LibASTMatchersReference.html>`_
|
|
|
|
All matcher that are nouns describe entities in the AST and can be
|
|
bound, so that they can be referred to whenever a match is found. To do
|
|
so, simply call the method ``bind`` on these matchers, e.g.:
|
|
|
|
.. code-block:: c++
|
|
|
|
variable(hasType(isInteger())).bind("intvar")
|
|
|
|
Step 2: Using AST matchers
|
|
==========================
|
|
|
|
Okay, on to using matchers for real. Let's start by defining a matcher
|
|
which will capture all ``for`` statements that define a new variable
|
|
initialized to zero. Let's start with matching all ``for`` loops:
|
|
|
|
.. code-block:: c++
|
|
|
|
forStmt()
|
|
|
|
Next, we want to specify that a single variable is declared in the first
|
|
portion of the loop, so we can extend the matcher to
|
|
|
|
.. code-block:: c++
|
|
|
|
forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl()))))
|
|
|
|
Finally, we can add the condition that the variable is initialized to
|
|
zero.
|
|
|
|
.. code-block:: c++
|
|
|
|
forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
|
|
hasInitializer(integerLiteral(equals(0))))))))
|
|
|
|
It is fairly easy to read and understand the matcher definition ("match
|
|
loops whose init portion declares a single variable which is initialized
|
|
to the integer literal 0"), but deciding that every piece is necessary
|
|
is more difficult. Note that this matcher will not match loops whose
|
|
variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of
|
|
zero besides the integer 0.
|
|
|
|
The last step is giving the matcher a name and binding the ``ForStmt``
|
|
as we will want to do something with it:
|
|
|
|
.. code-block:: c++
|
|
|
|
StatementMatcher LoopMatcher =
|
|
forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
|
|
hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
|
|
|
|
Once you have defined your matchers, you will need to add a little more
|
|
scaffolding in order to run them. Matchers are paired with a
|
|
``MatchCallback`` and registered with a ``MatchFinder`` object, then run
|
|
from a ``ClangTool``. More code!
|
|
|
|
Add the following to ``LoopConvert.cpp``:
|
|
|
|
.. code-block:: c++
|
|
|
|
#include "clang/ASTMatchers/ASTMatchers.h"
|
|
#include "clang/ASTMatchers/ASTMatchFinder.h"
|
|
|
|
using namespace clang;
|
|
using namespace clang::ast_matchers;
|
|
|
|
StatementMatcher LoopMatcher =
|
|
forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl(
|
|
hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop");
|
|
|
|
class LoopPrinter : public MatchFinder::MatchCallback {
|
|
public :
|
|
virtual void run(const MatchFinder::MatchResult &Result) {
|
|
if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop"))
|
|
FS->dump();
|
|
}
|
|
};
|
|
|
|
And change ``main()`` to:
|
|
|
|
.. code-block:: c++
|
|
|
|
int main(int argc, const char **argv) {
|
|
CommonOptionsParser OptionsParser(argc, argv, MyToolCategory);
|
|
ClangTool Tool(OptionsParser.getCompilations(),
|
|
OptionsParser.getSourcePathList());
|
|
|
|
LoopPrinter Printer;
|
|
MatchFinder Finder;
|
|
Finder.addMatcher(LoopMatcher, &Printer);
|
|
|
|
return Tool.run(newFrontendActionFactory(&Finder).get());
|
|
}
|
|
|
|
Now, you should be able to recompile and run the code to discover for
|
|
loops. Create a new file with a few examples, and test out our new
|
|
handiwork:
|
|
|
|
.. code-block:: console
|
|
|
|
cd ~/clang-llvm/llvm/llvm_build/
|
|
ninja loop-convert
|
|
vim ~/test-files/simple-loops.cc
|
|
bin/loop-convert ~/test-files/simple-loops.cc
|
|
|
|
Step 3.5: More Complicated Matchers
|
|
===================================
|
|
|
|
Our simple matcher is capable of discovering for loops, but we would
|
|
still need to filter out many more ourselves. We can do a good portion
|
|
of the remaining work with some cleverly chosen matchers, but first we
|
|
need to decide exactly which properties we want to allow.
|
|
|
|
How can we characterize for loops over arrays which would be eligible
|
|
for translation to range-based syntax? Range based loops over arrays of
|
|
size ``N`` that:
|
|
|
|
- start at index ``0``
|
|
- iterate consecutively
|
|
- end at index ``N-1``
|
|
|
|
We already check for (1), so all we need to add is a check to the loop's
|
|
condition to ensure that the loop's index variable is compared against
|
|
``N`` and another check to ensure that the increment step just
|
|
increments this same variable. The matcher for (2) is straightforward:
|
|
require a pre- or post-increment of the same variable declared in the
|
|
init portion.
|
|
|
|
Unfortunately, such a matcher is impossible to write. Matchers contain
|
|
no logic for comparing two arbitrary AST nodes and determining whether
|
|
or not they are equal, so the best we can do is matching more than we
|
|
would like to allow, and punting extra comparisons to the callback.
|
|
|
|
In any case, we can start building this sub-matcher. We can require that
|
|
the increment step be a unary increment like this:
|
|
|
|
.. code-block:: c++
|
|
|
|
hasIncrement(unaryOperator(hasOperatorName("++")))
|
|
|
|
Specifying what is incremented introduces another quirk of Clang's AST:
|
|
Usages of variables are represented as ``DeclRefExpr``'s ("declaration
|
|
reference expressions") because they are expressions which refer to
|
|
variable declarations. To find a ``unaryOperator`` that refers to a
|
|
specific declaration, we can simply add a second condition to it:
|
|
|
|
.. code-block:: c++
|
|
|
|
hasIncrement(unaryOperator(
|
|
hasOperatorName("++"),
|
|
hasUnaryOperand(declRefExpr())))
|
|
|
|
Furthermore, we can restrict our matcher to only match if the
|
|
incremented variable is an integer:
|
|
|
|
.. code-block:: c++
|
|
|
|
hasIncrement(unaryOperator(
|
|
hasOperatorName("++"),
|
|
hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger())))))))
|
|
|
|
And the last step will be to attach an identifier to this variable, so
|
|
that we can retrieve it in the callback:
|
|
|
|
.. code-block:: c++
|
|
|
|
hasIncrement(unaryOperator(
|
|
hasOperatorName("++"),
|
|
hasUnaryOperand(declRefExpr(to(
|
|
varDecl(hasType(isInteger())).bind("incrementVariable"))))))
|
|
|
|
We can add this code to the definition of ``LoopMatcher`` and make sure
|
|
that our program, outfitted with the new matcher, only prints out loops
|
|
that declare a single variable initialized to zero and have an increment
|
|
step consisting of a unary increment of some variable.
|
|
|
|
Now, we just need to add a matcher to check if the condition part of the
|
|
``for`` loop compares a variable against the size of the array. There is
|
|
only one problem - we don't know which array we're iterating over
|
|
without looking at the body of the loop! We are again restricted to
|
|
approximating the result we want with matchers, filling in the details
|
|
in the callback. So we start with:
|
|
|
|
.. code-block:: c++
|
|
|
|
hasCondition(binaryOperator(hasOperatorName("<"))
|
|
|
|
It makes sense to ensure that the left-hand side is a reference to a
|
|
variable, and that the right-hand side has integer type.
|
|
|
|
.. code-block:: c++
|
|
|
|
hasCondition(binaryOperator(
|
|
hasOperatorName("<"),
|
|
hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))),
|
|
hasRHS(expr(hasType(isInteger())))))
|
|
|
|
Why? Because it doesn't work. Of the three loops provided in
|
|
``test-files/simple.cpp``, zero of them have a matching condition. A
|
|
quick look at the AST dump of the first for loop, produced by the
|
|
previous iteration of loop-convert, shows us the answer:
|
|
|
|
::
|
|
|
|
(ForStmt 0x173b240
|
|
(DeclStmt 0x173afc8
|
|
0x173af50 "int i =
|
|
(IntegerLiteral 0x173afa8 'int' 0)")
|
|
<<>>
|
|
(BinaryOperator 0x173b060 '_Bool' '<'
|
|
(ImplicitCastExpr 0x173b030 'int'
|
|
(DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int'))
|
|
(ImplicitCastExpr 0x173b048 'int'
|
|
(DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int')))
|
|
(UnaryOperator 0x173b0b0 'int' lvalue prefix '++'
|
|
(DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int'))
|
|
(CompoundStatement ...
|
|
|
|
We already know that the declaration and increments both match, or this
|
|
loop wouldn't have been dumped. The culprit lies in the implicit cast
|
|
applied to the first operand (i.e. the LHS) of the less-than operator,
|
|
an L-value to R-value conversion applied to the expression referencing
|
|
``i``. Thankfully, the matcher library offers a solution to this problem
|
|
in the form of ``ignoringParenImpCasts``, which instructs the matcher to
|
|
ignore implicit casts and parentheses before continuing to match.
|
|
Adjusting the condition operator will restore the desired match.
|
|
|
|
.. code-block:: c++
|
|
|
|
hasCondition(binaryOperator(
|
|
hasOperatorName("<"),
|
|
hasLHS(ignoringParenImpCasts(declRefExpr(
|
|
to(varDecl(hasType(isInteger())))))),
|
|
hasRHS(expr(hasType(isInteger())))))
|
|
|
|
After adding binds to the expressions we wished to capture and
|
|
extracting the identifier strings into variables, we have array-step-2
|
|
completed.
|
|
|
|
Step 4: Retrieving Matched Nodes
|
|
================================
|
|
|
|
So far, the matcher callback isn't very interesting: it just dumps the
|
|
loop's AST. At some point, we will need to make changes to the input
|
|
source code. Next, we'll work on using the nodes we bound in the
|
|
previous step.
|
|
|
|
The ``MatchFinder::run()`` callback takes a
|
|
``MatchFinder::MatchResult&`` as its parameter. We're most interested in
|
|
its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext``
|
|
class to represent contextual information about the AST, as the name
|
|
implies, though the most functionally important detail is that several
|
|
operations require an ``ASTContext*`` parameter. More immediately useful
|
|
is the set of matched nodes, and how we retrieve them.
|
|
|
|
Since we bind three variables (identified by ConditionVarName,
|
|
InitVarName, and IncrementVarName), we can obtain the matched nodes by
|
|
using the ``getNodeAs()`` member function.
|
|
|
|
In ``LoopConvert.cpp`` add
|
|
|
|
.. code-block:: c++
|
|
|
|
#include "clang/AST/ASTContext.h"
|
|
|
|
Change ``LoopMatcher`` to
|
|
|
|
.. code-block:: c++
|
|
|
|
StatementMatcher LoopMatcher =
|
|
forStmt(hasLoopInit(declStmt(
|
|
hasSingleDecl(varDecl(hasInitializer(integerLiteral(equals(0))))
|
|
.bind("initVarName")))),
|
|
hasIncrement(unaryOperator(
|
|
hasOperatorName("++"),
|
|
hasUnaryOperand(declRefExpr(
|
|
to(varDecl(hasType(isInteger())).bind("incVarName")))))),
|
|
hasCondition(binaryOperator(
|
|
hasOperatorName("<"),
|
|
hasLHS(ignoringParenImpCasts(declRefExpr(
|
|
to(varDecl(hasType(isInteger())).bind("condVarName"))))),
|
|
hasRHS(expr(hasType(isInteger())))))).bind("forLoop");
|
|
|
|
And change ``LoopPrinter::run`` to
|
|
|
|
.. code-block:: c++
|
|
|
|
void LoopPrinter::run(const MatchFinder::MatchResult &Result) {
|
|
ASTContext *Context = Result.Context;
|
|
const ForStmt *FS = Result.Nodes.getStmtAs<ForStmt>("forLoop");
|
|
// We do not want to convert header files!
|
|
if (!FS || !Context->getSourceManager().isFromMainFile(FS->getForLoc()))
|
|
return;
|
|
const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>("incVarName");
|
|
const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>("condVarName");
|
|
const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>("initVarName");
|
|
|
|
if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar))
|
|
return;
|
|
llvm::outs() << "Potential array-based loop discovered.\n";
|
|
}
|
|
|
|
Clang associates a ``VarDecl`` with each variable to represent the variable's
|
|
declaration. Since the "canonical" form of each declaration is unique by
|
|
address, all we need to do is make sure neither ``ValueDecl`` (base class of
|
|
``VarDecl``) is ``NULL`` and compare the canonical Decls.
|
|
|
|
.. code-block:: c++
|
|
|
|
static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) {
|
|
return First && Second &&
|
|
First->getCanonicalDecl() == Second->getCanonicalDecl();
|
|
}
|
|
|
|
If execution reaches the end of ``LoopPrinter::run()``, we know that the
|
|
loop shell that looks like
|
|
|
|
.. code-block:: c++
|
|
|
|
for (int i= 0; i < expr(); ++i) { ... }
|
|
|
|
For now, we will just print a message explaining that we found a loop.
|
|
The next section will deal with recursively traversing the AST to
|
|
discover all changes needed.
|
|
|
|
As a side note, it's not as trivial to test if two expressions are the same,
|
|
though Clang has already done the hard work for us by providing a way to
|
|
canonicalize expressions:
|
|
|
|
.. code-block:: c++
|
|
|
|
static bool areSameExpr(ASTContext *Context, const Expr *First,
|
|
const Expr *Second) {
|
|
if (!First || !Second)
|
|
return false;
|
|
llvm::FoldingSetNodeID FirstID, SecondID;
|
|
First->Profile(FirstID, *Context, true);
|
|
Second->Profile(SecondID, *Context, true);
|
|
return FirstID == SecondID;
|
|
}
|
|
|
|
This code relies on the comparison between two
|
|
``llvm::FoldingSetNodeIDs``. As the documentation for
|
|
``Stmt::Profile()`` indicates, the ``Profile()`` member function builds
|
|
a description of a node in the AST, based on its properties, along with
|
|
those of its children. ``FoldingSetNodeID`` then serves as a hash we can
|
|
use to compare expressions. We will need ``areSameExpr`` later. Before
|
|
you run the new code on the additional loops added to
|
|
test-files/simple.cpp, try to figure out which ones will be considered
|
|
potentially convertible.
|