Contact Us

Case: Code Proximity and Utility Modules

Writing clean, maintainable code is an art and a science---it is a challenging skill that is difficult to learn except through experience. We have collected a number of coding examples that demonstrate good coding practices.

Background

Functions (and classes) that are used by several modules should be placed in a separate module, typically called a “utility” module. Sometimes, even when a function is only used in a single location, it should still be extracted into a utility module. In this case study, we consider when it makes sense to split out single-use functions into a utility module and when it should be left near its invocation.

Problem

Here is a small utility module that is part of a html parser codebase consisting of a set of scripts used to extract information from large html pages.

'''
parse_relations.py

Convenience functions for common relations in the DICOM HTML.
Many of these relations are obscure-looking nested accesses,
but they are consistent relations across the HTML source.
'''

def table_rows(table_div):
    return table_div.find('tbody').find_all('tr')

def table_name(table_div):
    return table_div.p.strong.get_text()

def table_id(table_div):
    return table_div.a.get('id')

def table_description(table_div):
    return table_div.parent.find('p')

def section_div_from_id(id_div):
    return id_div.parent.parent.parent.parent.parent

def figure_div_from_id(id_div):
    return id_div.parent

Here is a fragment from a parser script that uses two of the utilities functions from parse_relations.py.

'''
extract_sections.py
'''

import sys
import re
import os
from bs4 import BeautifulSoup

from parse_lib import parse_html_file, write_pretty_json
from parse_relations import section_div_from_id, figure_div_from_id

REFERENCED_IDS_RE = re.compile(r'(sect.*)|(figure.*)|(biblio.*)|(table.*)|(note.*)')


def extract_section_ids(standard):
    return {page: referenced_id_anchors(html) for page, html in standard.items()}


def referenced_id_anchors(html):
    return html.find_all('a', attrs={'id': REFERENCED_IDS_RE})


def section_html_from_id_anchor(sect_id_anchor):
    if re.match(r'sect.*', sect_id_anchor['id']):
        return section_div_from_id(sect_id_anchor)
    elif re.match(r'(biblio.*)|(table.*)|(note.*)|(figure.*)', sect_id_anchor['id']):
        return figure_div_from_id(sect_id_anchor)
    else:
        raise Exception(sect_id_anchor.parent + " didn't match a known pattern.")


def normalize_sections(all_sections):
    return {section['id']: str(section_html_from_id_anchor(section)) for section in all_sections}


if __name__ == '__main__':
    standard = {os.path.basename(f): parse_html_file(f) for f in sys.argv[1:]}
    section_ids = extract_section_ids(standard)
    sections = {page: normalize_sections(html) for page, html in section_ids.items()}
    write_pretty_json(sections)

The functions section_div_from_id and figure_div_from_id are only used once in the entire codebase. Do you think that these functions belong in the parse_relations.py utility module, or in extract_sections.py where they are used? Explain your reasoning.

Solution

As we already mentioned, functions and classes that are used in multiple separate modules should be placed in a utility module. Doing so:

  • makes the utility code easier to find (how would you pick which module to place it in otherwise—any choice would seem arbitrary and illogical)
  • reduces the likelihood of circular imports
  • can make the code easier to read.

Here are some reasons why it could make sense to place even a single-use function in a utility module:

  • It is reasonably likely it will be used in another module in the future—possibly even in other codebases. Moving the function to a utility module makes it more visible to other developers. It is common to have multiple—similar but subtly different—implementations of the same function in a large code base. This extra visibility can prevent this by making the utility more discoverable.
  • There is a clear abstraction separating the utility code from the rest of the module. Splitting it out can help manage complexity because a developer reading the module has less information to keep in their head at once. Splitting the code into separate modules also will help highlight and preserve the abstraction.
  • Even though there isn’t a clean abstraction, the current module is very long. I usually don’t like this reason. If there is a large module containing a lot of very related code, it doesn’t seem like splitting it in two makes it any more readable—all it does is make the developer jump back and forth between two files. Still, there is an argument to be made in extreme cases that it can be better to have multiple files.

If you can think of other reasons besides these, let us know.

On the other hand, splitting code into a separate module can make the code harder to read. If there is not a clean abstraction, splitting the code in two forces the developer to jump back and forth between multiple files unnecessarily.

In this example, section_div_from_id and figure_div_from_id do not have a clean abstraction separating it from the rest of the code in extract_sections.py. Both bits of code rely on the same underlying HTML structure. A developer reading extract_sections.py will likely want to know what these two functions do. The functions parse_html_file and write_pretty_json are examples of good abstractions—a developer reading extract_sections.py doesn’t need to worry or care about how they work.

Although it may not be apparent from the provided snippets, section_div_from_id and figure_div_from_id are unlikely to ever be needed in other parts of the codebase, now or in the future.

Note that parse_relations.py contains functions for walking parse trees. This similarity is a somewhat superficial, perhaps like placing all log-statement-generating functions in a single module. Perhaps one argument that could be made for such a grouping is that it would make it easier to change HTML parser libraries. In this example, however, parser code is spread throughout the code base so this would not be a good justification.

In conclusion, these two functions should not be placed in a utility module—they should stay close to where they are used. Placing them in a utility module makes the code harder to read.