Writing clean, maintainable code is an art and a science---it is a challenging skill that is difficult to learn except through experience. We have collected a number of coding examples that demonstrate good coding practices.
Functions (and classes) that are used by several modules should be placed in a separate module, typically called a “utility” module. Sometimes, even when a function is only used in a single location, it should still be extracted into a utility module. In this case study, we consider when it makes sense to split out single-use functions into a utility module and when it should be left near its invocation.
Here is a small utility module that is part of a html parser codebase consisting of a set of scripts used to extract information from large html pages.
'''
parse_relations.py
Convenience functions for common relations in the DICOM HTML.
Many of these relations are obscure-looking nested accesses,
but they are consistent relations across the HTML source.
'''
def table_rows(table_div):
return table_div.find('tbody').find_all('tr')
def table_name(table_div):
return table_div.p.strong.get_text()
def table_id(table_div):
return table_div.a.get('id')
def table_description(table_div):
return table_div.parent.find('p')
def section_div_from_id(id_div):
return id_div.parent.parent.parent.parent.parent
def figure_div_from_id(id_div):
return id_div.parent
Here is a fragment from a parser script that uses two of the utilities functions from parse_relations.py
.
'''
extract_sections.py
'''
import sys
import re
import os
from bs4 import BeautifulSoup
from parse_lib import parse_html_file, write_pretty_json
from parse_relations import section_div_from_id, figure_div_from_id
REFERENCED_IDS_RE = re.compile(r'(sect.*)|(figure.*)|(biblio.*)|(table.*)|(note.*)')
def extract_section_ids(standard):
return {page: referenced_id_anchors(html) for page, html in standard.items()}
def referenced_id_anchors(html):
return html.find_all('a', attrs={'id': REFERENCED_IDS_RE})
def section_html_from_id_anchor(sect_id_anchor):
if re.match(r'sect.*', sect_id_anchor['id']):
return section_div_from_id(sect_id_anchor)
elif re.match(r'(biblio.*)|(table.*)|(note.*)|(figure.*)', sect_id_anchor['id']):
return figure_div_from_id(sect_id_anchor)
else:
raise Exception(sect_id_anchor.parent + " didn't match a known pattern.")
def normalize_sections(all_sections):
return {section['id']: str(section_html_from_id_anchor(section)) for section in all_sections}
if __name__ == '__main__':
standard = {os.path.basename(f): parse_html_file(f) for f in sys.argv[1:]}
section_ids = extract_section_ids(standard)
sections = {page: normalize_sections(html) for page, html in section_ids.items()}
write_pretty_json(sections)
The functions section_div_from_id
and figure_div_from_id
are only used once in the entire codebase. Do you think that these functions belong in the parse_relations.py
utility module, or in extract_sections.py
where they are used? Explain your reasoning.
As we already mentioned, functions and classes that are used in multiple separate modules should be placed in a utility module. Doing so:
Here are some reasons why it could make sense to place even a single-use function in a utility module:
If you can think of other reasons besides these, let us know.
On the other hand, splitting code into a separate module can make the code harder to read. If there is not a clean abstraction, splitting the code in two forces the developer to jump back and forth between multiple files unnecessarily.
In this example, section_div_from_id
and figure_div_from_id
do not have a clean abstraction separating it from the rest of the code in extract_sections.py
. Both bits of code rely on the same underlying HTML structure. A developer reading extract_sections.py
will likely want to know what these two functions do. The functions parse_html_file
and write_pretty_json
are examples of good abstractions—a developer reading extract_sections.py
doesn’t need to worry or care about how they work.
Although it may not be apparent from the provided snippets, section_div_from_id
and figure_div_from_id
are unlikely to ever be needed in other parts of the codebase, now or in the future.
Note that parse_relations.py
contains functions for walking parse trees. This similarity is a somewhat superficial, perhaps like placing all log-statement-generating functions in a single module. Perhaps one argument that could be made for such a grouping is that it would make it easier to change HTML parser libraries. In this example, however, parser code is spread throughout the code base so this would not be a good justification.
In conclusion, these two functions should not be placed in a utility module—they should stay close to where they are used. Placing them in a utility module makes the code harder to read.