Subscribe   

The Story of PIH Part I: The Problem

First published on: 26 January 2016 by J. David Giese

One of the things I love about being a computer programmer, is that usually when something becomes boring or tedious, you can automate or abstract it. This is because the same qualities that make a task boring make it easy to automate or abstract. There are many examples of this in the software world:

Software development tools can also automate mundane tasks. Integrated Development Environments (IDEs) are a prime example:

I don’t use an IDE when programming, but instead prefer to use VIM and Unix command line tools synergistically. There are certainly advantages and disadvantages to this approach. One of the disadvantages is the lack of good refactoring command line tools.

In this series of posts, I describe how I wrote a command line tool for refactoring Python code. This first post focuses on my initial research and subsequent definition of the problem.

Refactoring Imports

A common refactoring task is to move classes and functions from one module to another. This typically occurs when a module has grown too large over time, and needs to be split up into several smaller modules. Imagine you move a function c from module a.b to a.d. Every file that imports c must now change the import statements from import c from a.b to import c from a.d. And there may be a lot of files!

Other import-related refactors include cleaning out unused imports, removing duplicate imports, and adding in missing imports. Hence I would like a command line tool that can:

  1. Move the import location of python objects
  2. Clean out unused imports
  3. Clean out duplicate imports
  4. Add in missing imports

The first three tasks are relatively simple compared to the last task; this is because the last task must have a knowledge of your full project environment beyond just which files you want to refactor. Hence, for my first pass, I am only attempting to create a tool accomplishes the first three tasks.

I should note that IDEs like PyCharm and Eclipse can perform all of these refactors out of the box.

Existing Tools

Coding is fun, and side projects are always a great way to learn new things, but I would rather work on something new as opposed to rebuilding an existing tool. So I started doing research into existing tools. After some digging, I only was able to find one an existing python refactoring library, called Rope. I also found a Vim-plugin which uses rope to add refactoring functionality to Vim.

In the end I decided the Rope-Vim plugin wasn’t what I was looking for. The primary reason is because I do not think it is appropriate to try and turn Vim into a full-fledged IDE. I went down that road before when I was in graduate school when I attempted to make a Vim plugin that integrates well with IPython. In the end it was a failure for a few reasons. First, Vim is not written for such tight integrations (NeoVim may change this). Second, and more fundamentally, the UI of a terminal is better suited for running separate commands than attempting to behave like an IDE. There are certainly Vim plugins like Fugitive and NERDTree that are awesome and deal with the limitations of Vim’s UI, but I do think it is a fundamental limitation of Vim. Finally, and most importantly, I am a believer in the Unix philosophy of “do one thing and do it well”. IDEs have the opposite philosophy of “do everything”.

The next question I had to answer was, can I use rope as a core dependency for my command line tool? After testing out how rope works, I decided that unfortunately it would not suite my purposes in this way either. Besides not working with the newer versions of Python 3.5 (and having a quiet pulse), the API of rope is really designed to work within an IDE. Without getting into details, it seemed like it would require a lot of work, and a deep understanding of Rope’s internals, to make it work for my purposes (e.g. Rope creates a cache database and some other files to represent your project, all of which I wouldn’t want to include).

All of that aside, Rope looks like a great project, and looking through the source code definitely taught me quite a bit about the problem I was trying to solve.

The Python Import Helper

I decided to call my tool the python import helper, or PIH, and settled on the following command line interface for the tool:

pih(1)

NAME
    pih - refactor python import statements

SYNOPSIS
    pih [-uad] [-m before after] file ...
    pih [-uad] [-m before after]

DESCRIPTION
    pih refactors import statements in the python source code files.  It
    can remove unused imports, remove duplicate imports, and update import
    statements to reflect the new location of global python objects or
    modules.

    pih expects all the provided files to be syntactically valid python
    source code.

    If no source files are provided, it will read from standard input,
    and direct the refactored source code to standard output.
    
OPTIONS
    -u    Remove unused imports unless there is a comment on the same
          line as the import statement.

    -a    Remove all unused or duplicate imports, even if there are
          comments on the same line as the import statement.

    -d    Remove duplicate imports unless there is a comment on the
          same line as the import statement.  When an object is imported
          multiple times in a file, only the last import is kept, as
          determined by the position in the file.

    -m  before after
          Update import statements to reflect a change in a module or
          global object's location.  Locations are specified as
          period-separated python identifiers that are resolved relative
          to the current working directory.  The final identifier may
          either be a module, or a global object in a module.

EXAMPLES
    o   Remove all duplicate and unused imports in your project:

          $ pih -du project/**/*.py

    o   Update import statements after renaming function from c to d in
        a/b.py:

          $ pih -m a.b.c a.b.d project/**/*.py

    o   Update import statements after moving a a/b.py to a/x.py:

          $ pih -m a.b a.x project/**/*.py

    o   Use pih as a filter to clear out unused imports:

          $ cat file.py | pih -u > file.py

Comparison to an IDE

Now that the specification for my command line tool is set, it is worth noting a few differences between how an IDE and PIH will be used.

PIH must be executed manually; an IDE will usually be highlighting unused and duplicate imports as you type. Some IDEs will also add missing imports as you type! Refactoring import locations, however is always manually executed.

PIH will probably be less efficient; this is because it will have to tokenize and parse all of the files every time you execute the tool. An IDE will be able to keep these structures in memory and perform caching of some sort.

PIH (because it is not integrated with your text editor) is not able to actually move the global object or module to its new location; in other words it can refactor the import statements, but not actually move the imported object. This step will need to be performed manually by a separate text editor. An IDE, because it includes a text editor, can perform this step itself.

The advantage of the command line tool is simplicity. The tool doesn’t need to know anything about the project, and it has not state. It also can be used independently or can be combined with other tools.

Finally, in an IDE you will probably be prompted with all of the locations that you may want to change, and you will be able to pre-screen it for errors or places where you may not want to apply the refactor. PIH won’t let you do this, although assuming you are using a source control tool you should be able to see the changes quickly, and revert any changes you didn’t intend to make.

Thus, to summarize, I would say PIH is less integrated than an integrated development environment! Several other tools would probably need to be used in addition to PIH to perform the refactor.

IDE PIH
Execution Time Automatic as you edit the code Must be run executed manually (e.g. using a Vim filter)
Project Structure Configured when project is setup Files must be selected using a Bash glob or a tool like find
Moving the Object Custom UI for performing this step A separate text editor must be used (e.g. Vim or Emacs)
Accept/Reject Changes IDE has special UI for this Must be handled using another tool (e.g. git diff)

First Attempt: Regular Expression

At first, I was thinking I would use regular expressions to detect import statements, and a bunch of text-rewriting tools to then apply the changes. This proved difficult for a number of reasons. Even a simple import statement can take many forms:

import a

from a import b

from a import b as c

from a import (b, c, d, e)

from a import (b, c, d, e,
    f, h)

from a import b, c, d, e, \
    f, h

from a import b, c  # comment
    
from a import b, c, d, \
    e, f  # comment

from a import b; from c import d

from a import \
b, \
c, \
d

# and so on.

You also need to be able to ignore comments that look like imports

#from a import b

and you need to be able to detect imports in other places in the code (which are sometimes necessary to avoid circular imports).

So it was starting to look like regular expressions were not a good way to go if I wanted the tool to be robust.

Using a Parser

Naturally, the next step was to use a full-on parser so that I could modify the abstract syntax tree itself. Fortunately, python includes a parser as part of the standard library. Besides saving me a lot of work, this is great because it means that I can use an appropriate parser for whichever version of python is currently available on the path.

After switching PIH to use the abstract syntax tree of the source modules, I was quickly able to build out the unused import and duplicate import features of PIH, however I soon realized there was another large problem with my approach: even though I was able to apply my refactors to the abstract syntax tree quite easily, I had no reliable way of applying these changes back to the original source files while preserving whitespace and comments!

Once again, I tried a heuristically approach at first, but soon realized there were simply too many variations to deal with in a robust manner.

To Be Continued …

I will be posting a second article describing the algorithm and implementation that I ended up using to build PIH, and get around the problem of re-generating source code from abstract syntax trees.

As a sneak peak, it is based on an algorithm from this 2011 paper, An Algorithm for Layout Preservation in Refactoring Transformations. You may enjoy reading this excellent thesis on refactoring algorithms and code analysis Models and Algorithms for Refactoring Statements (although it is very focused on Java, and thus is less generally applicable to more dynamic and modern programming languages like Python).




Was this article interesting?

We publish technical articles and coding case studies about topics we run into in the field. Follow us on Twitter, or subscribe to our email list: