Analyzing Python Code with Python
A first dip into DIY static code analysis
One of the special things about software engineering as a profession is the possibility of ars-poetic work: part of our work is building tools that target our own work; perhaps a few surgeons around the globe can design and meld their own scalpel, but for software engineers building our own tooling is a day to day reality.
Recently, I’ve been working on migrating a large repository of code to be built with Bazel. To do this properly I had to generate well over a hundred different BUILD files. Doing so by hand would have been slow, error-prone and exhausting, so instead I opted to write a tool that would do this automatically for me.
When describing a BUILD module for Bazel, each test file must be defined as a
*_test rule, which explicitly defines any dependencies it has. For example, in order to get Bazel to run a python test, one would write something similar to:
py_test( name = "test_context", srcs = ["test_context.py"], main = "test_context.py", tags = , deps = [ ":context", "//libs/config", ], )
The first step to writing a program that will generate this block for each test file in our repository is to figure out which of the files in our repository are test files. In the context of the repository I was working on a test file could be defined as:
- The file has a
- The file contains a class which inherits from
- The class contains at least one method who’s name begins with
Using ASTs for Fun and Profit
Sounds simple enough, surely something we can do with some regular expressions, right? Well, we could, but there are a lot of edge cases to consider, and wouldn’t it be great if we could see our code the same way the Python interpreter sees it?
Turns out there’s an easy way to do that, the Python standard library contains a neat little package named
ast - shorthand for Abstract Syntax Tree. Wikipedia defines ASTs as:
In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code.
ast package documentation describes it as:
The ast module helps Python applications to process trees of the Python abstract syntax grammar. The abstract syntax itself might change with each Python release; this module helps to find out programmatically what the current grammar looks like.
In software engineering, the practice of building programs that analyze another program’s source code without actually executing it is called static code analysis. By using an AST parser for the language we are interested in, we transform that source code into data that we can process just like any other. Using the Python standard-library’s
ast package, we can parse a block of python code into a data structure which we can traverse and analyze. This will be very handy in answering whether a file contains a unit test!
Finding all test files in the repo
The first step, is walking the filesystem to find all candidate files:
import os def is_test_file(path) -> bool: # TODO: impl pass def find_test_files(repo_root): test_files =  for root, dirs, files in os.walk(repo_root): for file in files: if not file.endswith(".py"): continue path = os.path.join(root, file) if is_test_file(path): test_files.append(path) return test_files
Next, let’s see what we can do with
Python 3.7.7 (default, Jul 15 2020, 21:51:02) [Clang 10.0.1 (clang-1001.0.46.4)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> source = """ ... class Test(unittest.TestCase): ... def test_hello(self): ... pass ... """ >>> import ast >>> tree = ast.parse(source) >>> print(tree) <_ast.Module object at 0x10379c6d0>
Using this neat visualizer, this is what the python AST looks like:
So when parsing a block of source code with
ast.parse we get back a tree-like data structure that has:
Moduleat its root
- With a
bodyattribute which is a list of elements
- In our file there is only one element a
ClassDefobject with a
ClassDefhas two attributes,
bodyhas a single
FunctionDef(our test method definition) which has a name of
basesis a list of base-classes our
ClassDefinherits from, in our case, something that if we squint a little we can see is
Great, there’s everything we need in here to make our decision. Our final method will look like:
import ast def is_test_file(abspath) -> bool: with open(abspath, "r") as f: data = f.read() source_ast = ast.parse(data) for node in source_ast.body: if isinstance(node, ast.ClassDef) and _is_testcase_class(node): for class_node in node.body: if isinstance(class_node, ast.FunctionDef) and class_node.name.startswith("test_"): return True return False def _is_testcase_class(self, classdef: ast.ClassDef) -> bool: for base_class in classdef.bases: # if the test class looks like class Test(TestCase) if isinstance(base_class, ast.Name): base_name = base_class.id # if the test class looks like class Test(unittest.TestCase): elif isinstance(base_class, ast.Attribute): base_name = base_class.attr else: continue if base_name == 'TestCase': return True return False
- We read the source file into a string, then parse it with
- We then scan the module’s body looking for
ast.ClassDef(class definition) nodes.
- If we find one, we check whether it inherits from
unittest.TestCaseby looking at the class definition’s
- If we find a TestCase class we iterate through that ClassDef’s
bodylooking for a
FunctionDefwhos’ name begins with
Sure, we probably could have gotten OK results using a shell script with some heuristic-based regular expressions, but using AST’s to parse the source code as the interpreter sees it we can get definitive answers that are robust to any edge cases such as odd formatting and placing code inside comments or strings. Being able to create the tools to make your job easier is one of the best things about being a software engineer, being able to programmatically parse and analyze your own source code takes your tool building possibilities to the next level!