Static code analysis refers to the technique of approximating the runtime behaviour of a program. In other words, it is the process of predicting the output of a program without actually executing it.
Lately, however, the term “Static Code Analysis” is more commonly used to refer to one of the applications of this technique rather than the technique itself — program comprehension — understanding the program and detecting issues in it (anything from syntax errors to type mismatches, performance hogs likely bugs, security loopholes, etc.). This is the usage we’d be referring to throughout this post.
“The refinement of techniques for the prompt discovery of error serves as well as any other as a hallmark of what we mean by science.”
We cover a lot of ground in this post. The aim is to build an understanding of static code analysis and to equip you with the basic theory, and the right tools so that you can write analyzers on your own.
We start our journey with laying down the essential parts of the pipeline which a compiler follows to understand what a piece of code does. We learn where to tap points in this pipeline to plug in our analyzers and extract meaningful information. In the latter half, we get our feet wet, and write four such static analyzers, completely from scratch, in Python.
Note that although the ideas here are discussed in light of Python, static code
analyzers across all programming languages are carved out along similar lines.
We chose Python because of the availability of an easy to use ast
module, and
wide adoption of the language itself.
Before a computer can finally “understand” and execute a piece of code, it goes through a series of complicated transformations:
As you can see in the diagram (go ahead, zoom it!), the static analyzers feed on the output of these stages. To be able to better understand the static analysis techniques, let’s look at each of these steps in some more detail:
The first thing that a compiler does when trying to understand a piece of code is to break it down into smaller chunks, also known as tokens. Tokens are akin to what words are in a language.
A token might consist of either a single character, like (
, or literals (like
integers, strings, e.g., 7
, Bob
, etc.), or reserved keywords of that
language (e.g, def
in Python). Characters which do not contribute towards the
semantics of a program, like trailing whitespace, comments, etc. are often
discarded by the scanner.
Python provides the tokenize
module in its standard library to let you play
around with tokens:
import io
import tokenize
code = b"color = input('Enter your favourite color: ')"
for token in tokenize.tokenize(io.BytesIO(code).readline):
print(token)
TokenInfo(type=62 (ENCODING), string='utf-8')
TokenInfo(type=1 (NAME), string='color')
TokenInfo(type=54 (OP), string='=')
TokenInfo(type=1 (NAME), string='input')
TokenInfo(type=54 (OP), string='(')
TokenInfo(type=3 (STRING), string="'Enter your favourite color: '")
TokenInfo(type=54 (OP), string=')')
TokenInfo(type=4 (NEWLINE), string='')
TokenInfo(type=0 (ENDMARKER), string='')
(Note that for the sake of readability, I’ve omitted a few columns from the result above — metadata like starting index, ending index, a copy of the line on which a token occurs, etc.)
At this stage, we only have the vocabulary of the language, but the tokens by themselves don’t reflect anything about the grammar of the language. This is where the parser comes into play.
A parser takes these tokens, validates that the sequence in which they appear conforms to the grammar, and organizes them in a tree-like structure, representing a high-level structure of the program. It’s aptly called an Abstract Syntax Tree (AST).
“Abstract” because it abstracts away low-level insignificant details like parenthesis, indentation, etc, allowing the user to focus only on the logical structure of the program — which is what makes it the most suitable choice for conducting static analysis onto.
A syntax tree can get quite vast and complex, thus making it is difficult to write code for analyzing it. Thankfully, since this is something that all compilers (or interpreters) do themselves, some tooling to simplify this process generally exists.
Python ships with an ast
module as a part of its standard library which we’d
be using heavily while writing the analyzers later.
If you don’t have prior experience of working with ASTs, here’s how the ast
module works:
ast
module, e.g., for loops are characterized by the ast.For
object.ast.parse
function.ast
module offers two walkers:
ast.NodeVisitor
(doesn’t allow modification to the input tree)ast.NodeTransformer
(allows modification)ast.For
nodes.visit_
+ <NODE_TYPE>
, e.g., to add a visitor
for “for loops”, the method should be named visit_For
.visit
method which recursively visits the input node,
i.e. it first visits itself, then all of its children nodes, then the
children nodes of children nodes, and so forth.Just to give you a sense of how this works, let’s write code for visiting all for loops:
import ast
# Demo code to parse
code = """\
sheep = ['Shawn', 'Blanck', 'Truffy']
def get_herd():
herd = []
for a_sheep in sheep:
herd.append(a_sheep)
return Herd(herd=herd)
class Herd:
def __init__(self, herd):
self.herd = herd
def shave(self, setting='SMOOTH'):
for sheep in self.herd:
print(f"Shaving sheep {sheep} on a {setting} setting")
"""
class Example(ast.NodeVisitor):
def visit_For(self, node):
print(f"Visiting for loop at line {node.lineno}")
tree = ast.parse(code)
visitor = Example()
visitor.visit(tree)
This outputs:
Visiting for loop at line 5
Visiting for loop at line 14
ast.Module
node.ast.Assign
, ast.FunctionDef
and
ast.ClassDef
node.ast.For
loop is finally encountered, the visit_For
method is called. Notice, that a copy of node
is also passed onto this
method — which contains all the metadata about it — children (if
any), line number, column, etc.Python also has several other third-party modules like astroid
, astmonkey
,
astor
which provide additional abstract modules to make our lives easier.
But, in this post, we’ll confine ourselves to the barebones ast
module so
that we get to see the real, ugly operations behind the scenes.
Although this blog post is only an introduction to static code analysis, we’d be writing scripts to detect issues which are highly relevant in real-world scenarios as well (chances are, that your IDE already warns you if you violate one). This shows just how powerful static code analysis is, and what it enables you to do with so little code:
list()
is used instead of []
Here are how the examples would work:
Here, we write a script which would raise a warning whenever it detects that single quotes have been used in the Python files given as input.
This example may be considered rudimentary compared to other modern-day static code analysis techniques, but it is still included here because of historical significance — this was pretty much how early code analyzers worked1. Another reason, it makes sense to include this technique here is that it is heavily used by many popular static tools, like black.
import sys
import tokenize
class DoubleQuotesChecker:
msg = "single quotes detected, use double quotes instead"
def __init__(self):
self.violations = []
def find_violations(self, filename, tokens):
for token_type, token, (line, col), _, _ in tokens:
if (
token_type == tokenize.STRING
and (
token.startswith("'''")
or token.startswith("'")
)
):
self.violations.append((filename, line, col))
def check(self, files):
for filename in files:
with tokenize.open(filename) as fd:
tokens = tokenize.generate_tokens(fd.readline)
self.find_violations(filename, tokens)
def report(self):
for violation in self.violations:
filename, line, col = violation
print(f"{filename}:{line}:{col}: {self.msg}")
if __name__ == '__main__':
files = sys.argv[1:]
checker = DoubleQuotesChecker()
checker.check(files)
checker.report()
Here’s a breakdown of what is happening:
check
method, which generates tokens
for each file and passes them onto the find_violations
method.find_violations
method iterates through the list of tokens and looks
for “string type” tokens whose value is either '''
, or '
. If it finds one,
it flags the line by appending it to self.violations
.report
method then reads all the issues from self.violations
and
prints them out with a helpful error message.def simulate_quote_warning():
'''
The docstring intentionally uses single quotes.
'''
if isinstance(shawn, 'sheep'):
print('Shawn the sheep!')
example.py:2:4: single quotes detected, use double quotes instead
example.py:5:25: single quotes detected, use double quotes instead
example.py:6:14: single quotes detected, use double quotes instead
Note that for the sake of brevity, error-handling has been omitted entirely from these examples, but needless to say, they are an essential part of any production system.
Boilerplate for further examples
The previous example was the only example where we were working directly with tokens. For all others, we’d limit our interaction to the generated ASTs only.
Since a lot of code would be duplicated across these checkers, and this post is already so long, let’s first get some boilerplate code in place, which we can later reuse for all examples. Defining boilerplate code at once also allows me to discuss only relevant details under each checker and get away with all the business logic at once:
import ast
from collections import defaultdict
import sys
import tokenize
def read_file(filename):
with tokenize.open(filename) as fd:
return fd.read()
class BaseChecker(ast.NodeVisitor):
def __init__(self):
self.violations = []
def check(self, paths):
for filepath in paths:
self.filename = filepath
tree = ast.parse(read_file(filepath))
self.visit(tree)
def report(self):
for violation in self.violations:
filename, lineno, msg = violation
print(f"{filename}:{lineno}: {msg}")
if __name__ == '__main__':
files = sys.argv[1:]
checker = <CHECKER_NAME>()
checker.check(files)
checker.report()
Most of the code works the same way as we saw in the previous example, except that:
read_file
to read the contents of the given file.check
method, instead of tokenizing, reads the contents of all the file
paths one by one and then parses its AST using the ast.parse
method. It
then uses the visit
method to visit the top-level node (an ast.Module
)
and thereby, all of its children nodes recursively. It also sets the value of
self.filename
to the current file being analyzed — so that we can add
the filename in the error message when we find a violation later.You might notice that there are a couple of unused imports — they’d be used
later on. Also, the placeholder <CHECKER_NAME>
needs to be replaced
with the actual name of the checker class when running the code.
For the entire ready-to-run code for all checkers in this post, see this GitHub Gist
list()
It is advised to use an empty literal []
instead of list()
for an empty
list because it tends to be slower — the name list
must be looked up in
the global scope before calling it. Also, it might result into a bug in case
the name list
is rebound to another object.
list()
resides as an ast.Call
node. Thus, we start with defining the
visit_Call
method for our new ListDefinitionChecker
class:
class ListDefinitionChecker(BaseChecker):
msg = "usage of 'list()' detected, use '[]' instead"
def visit_Call(self, node):
name = getattr(node.func, "id", None)
if name and name == list.__name__ and not node.args:
self.violations.append((self.filename, node.lineno, self.msg))
Here’s briefly what we’re doing:
Call
node, we first try to get the name of the function
being called.list.__name__
.list(...)
is being made.list
function, i.e. the call being made is indeed list()
. If so, we flag this
line by adding an issue.Running this file on some example code (ensure that you have
updated the <CHECKER_NAME>
in the boilerplate to ListDefinitionChecker
):
def build_herd():
herd = list()
for a_sheep in sheep:
herd.append(a_sheep)
return Herd(herd)
example.py:2: usage of 'list()' detected, use '[]' instead
“For loops” which are nested for more than 3 levels are unpleasant to look at, difficult for the brain to comprehend with, and a headache to maintain at the very least.
Thusly, let’s write a check to detect whenever more than 3 levels of nested for loops are encountered.
Here’s what we’d do: We begin counting as soon as an ast.For
node is
encountered. We also mark this node as a ‘parent’ node. We, then check if any
of its children are also ast.For
nodes. If yes, we increment the count and
repeat the same procedure for the child node again.
class TooManyForLoopChecker(BaseChecker):
msg = "too many nested for loops"
def visit_For(self, node, parent=True):
if parent:
self.current_loop_depth = 1
else:
self.current_loop_depth += 1
for child in node.body:
if type(child) == ast.For:
self.visit_For(child, parent=False)
if parent and self.current_loop_depth > 3:
self.violations.append((self.filename, node.lineno, self.msg))
self.current_loop_depth = 0
The workflow might look a little skewed at first, but here’s basically what we’re doing:
visit
method is called (from BaseChecker
class), it starts looking
for any ast.For
nodes in the AST. As soon, as it finds one, it calls the
method visit_For
with default keyword argument parent=True
.parent
as a flag to track the outermost loop — in
which case, we initialize self.current_loop_depth
to 1, else, we just
increment its value by 1.ast.For
nodes. If we find one, we call visit_For
with parent=False
.Let’s run our script on some examples:
for _ in range(10):
for _ in range(5):
for _ in range(3):
for _ in range(1):
print("Baa, Baa, black sheep")
for _ in range(4):
for _ in range(3):
print("Have you any wool?")
for _ in range(10):
for _ in range(5):
for _ in range(3):
if True:
for _ in range(3):
print("Yes, sir, yes, sir!")
example.py:1: too many nested for loops
Did you notice the caveat here? If the nested for loop is not a direct child of the parent loop, it is never visited, and hence not reported. However, getting our code to work on that edge case is nuanced, and is out of scope for this post.
Detecting unused imports is different from the previous cases because we can’t flag the violations immediately while visiting nodes — we don’t have the complete information about what all ‘names’ are gonna be used in the entire module. Therefore, we implement this analyzer in two passes:
ast.Import
, ast.ImportFrom
), collecting the names of all the
modules which have been imported.ast.Name
.class UnusedImportChecker(BaseChecker):
def __init__(self):
self.import_map = defaultdict(set)
self.name_map = defaultdict(set)
def _add_imports(self, node):
for import_name in node.names:
# Store only top-level module name ("os.path" -> "os").
# We can't easily detect when "os.path" is used.
name = import_name.name.partition(".")[0]
self.import_map[self.filename].add((name, node.lineno))
def visit_Import(self, node):
self._add_imports(node)
def visit_ImportFrom(self, node):
self._add_imports(node)
def visit_Name(self, node):
# We only add those nodes for which a value is being read from.
if isinstance(node.ctx, ast.Load):
self.name_map[self.filename].add(node.id)
def report(self):
for path, imports in self.import_map.items():
for name, line in imports:
if name not in self.name_map[path]:
print(f"{path}:{line}: unused import '{name}'")
Import
or ImportFrom
node is encountered, we store its name
in a set.ast.Name
nodes: for each such node, we check if a value is being read from it —
which implies that a reference to an already existing name is being made,
rather than creating a new object. (If it is an import name, it has to exist
already) — if yes, we add the name to the set.report
method traverses over the list of all the import names in a
file and checks if they’re present in the set of used names. If not, it
prints an error message reporting the violation.Let’s go ahead and run this script on a few examples:
import antigravity
import os.path.join
import sys
import this
tmpdir = os.path.join(sys.path[0], 'tmp')
example.py:1: unused import 'antigravity'
example.py:4: unused import 'this'
Please note that for the sake of brevity, I went with the simplest version of
the code possible. This choice has a side effect that our code doesn’t handle
some tricky corner cases (e.g., when imports are aliased - import foo as bar
,
or when the name is read from the locals()
dict, etc.).
Phew! This was a whole list of things to wrap your head around with. But, our reward is that the next time we identify a pattern of bug-causing-code, we can go right ahead and write a script to automatically detect it.
strcpy()
that were easy to misuse and should have been inspected as part of a manual source code review.
[return]