Object-oriented programming for scientists

22 Mar 2015

Sandcastle — Using a bucket to create sandcastles. That is what object-oriented programming is all about.

Introduction

For anyone not familiar with object-oriented programming it can sometimes come across as something mysterious that is used by expert coders. Indeed, any respectable text book on object-oriented programming will try to overwhelm the reader with concepts such as “abstraction”, “encapsulation”, “inheritance” and “polymorphism”.

However, object-oriented programming is not that difficult and can be very useful when dealing with complex data structures. In this post I will illustrate some object-oriented principles using a bioinformatics example, the parsing of FASTA files.

The code will be written in Python as I like it, it has built-in support for object-oriented programming and its syntax is relatively easy to understand.

An example using procedural programming

To set the scene let us write some code using procedural programming to parse the example.fasta file below.

>sp|O76074|PDE5A_HUMAN cGMP-specific 3',5'-cyclic phosphodiesterase OS=Homo sapiens GN=PDE5A PE=1 SV=2
MERAGPSFGQQRQQQQPQQQKQQQRDQDSVEAWLDDHWDFTFSYFVRKATREMVNAWFAE
RVHTIPVCKEGIRGHTESCSCPLQQSPRADNSAPGTPTRKISASEFDRPLRPIVVKDSEG
TVSFLSDSEKKEQMPLTPPRFDHDEGDQCSRLLELVKDISSHLDVTALCHKIFLHIHGLI
SADRYSLFLVCEDSSNDKFLISRLFDVAEGSTLEEVSNNCIRLEWNKGIVGHVAALGEPL
NIKDAYEDPRFNAEVDQITGYKTQSILCMPIKNHREEVVGVAQAINKKSGNGGTFTEKDE
KDFAAYLAFCGIVLHNAQLYETSLLENKRNQVLLDLASLIFEEQQSLEVILKKIAATIIS
FMQVQKCTIFIVDEDCSDSFSSVFHMECEELEKSSDTLTREHDANKINYMYAQYVKNTME
PLNIPDVSKDKRFPWTTENTGNVNQQCIRSLLCTPIKNGKKNKVIGVCQLVNKMEENTGK
VKPFNRNDEQFLEAFVIFCGLGIQNTQMYEAVERAMAKQMVTLEVLSYHASAAEEETREL
QSLAAAVVPSAQTLKITDFSFSDFELSDLETALCTIRMFTDLNLVQNFQMKHEVLCRWIL
SVKKNYRKNVAYHNWRHAFNTAQCMFAALKAGKIQNKLTDLEILALLIAALSHDLDHRGV
NNSYIQRSEHPLAQLYCHSIMEHHHFDQCLMILNSPGNQILSGLSIEEYKTTLKIIKQAI
LATDLALYIKRRGEFFELIRKNQFNLEDPHQKELFLAMLMTACDLSAITKPWPIQQRIAE
LVATEFFDQGDRERKELNIEPTDLMNREKKNKIPSMQVGFIDAICLQLYEALTHVSEDCF
PLLDGCRKNRQKWQALAEQQEKMLINGESGQAKRN
>sp|Q9Y233|PDE10_HUMAN cAMP and cAMP-inhibited cGMP 3',5'-cyclic phosphodiesterase 10A OS=Homo sapiens GN=PDE10A PE=1 SV=1
MRIEERKSQHLTGLTDEKVKAYLSLHPQVLDEFVSESVSAETVEKWLKRKNNKSEDESAP
KEVSRYQDTNMQGVVYELNSYIEQRLDTGGDNQLLLYELSSIIKIATKADGFALYFLGEC
NNSLCIFTPPGIKEGKPRLIPAGPITQGTTVSAYVAKSRKTLLVEDILGDERFPRGTGLE
SGTRIQSVLCLPIVTAIGDLIGILELYRHWGKEAFCLSHQEVATANLAWASVAIHQVQVC
RGLAKQTELNDFLLDVSKTYFDNIVAIDSLLEHIMIYAKNLVNADRCALFQVDHKNKELY
SDLFDIGEEKEGKPVFKKTKEIRFSIEKGIAGQVARTGEVLNIPDAYADPRFNREVDLYT
GYTTRNILCMPIVSRGSVIGVVQMVNKISGSAFSKTDENNFKMFAVFCALALHCANMYHR
IRHSECIYRVTMEKLSYHSICTSEEWQGLMQFTLPVRLCKEIELFHFDIGPFENMWPGIF
VYMVHRSCGTSCFELEKLCRFIMSVKKNYRRVPYHNWKHAVTVAHCMYAILQNNHTLFTD
LERKGLLIACLCHDLDHRGFSNSYLQKFDHPLAALYSTSTMEQHHFSQTVSILQLEGHNI
FSTLSSSEYEQVLEIIRKAIIATDLALYFGNRKQLEEMYQTGSLNLNNQSHRDRVIGLMM
TACDLCSVTKLWPVTKLTANDIYAEFWAEGDEMKKLGIQPIPMMDRDKKDEVPQGQLGFY
NAVAIPCYTTLTQILPPTEPLLKACRDNLSQWEKVIRGEETATWISSPSVAQKAAASED

The aim is to find and print out the FASTA record with the UniProt identifier Q9Y233 (the second entry). The code below achieves this using procedural programming.

with open('example.fasta') as fh:
    match = False
    for line in fh:
        line = line.strip()  # Remove newline at the end of the line.
        if line.startswith('>'):
            # We have encountered a description line.
            # That means the start of a new FASTA record.
            if line.find('Q9Y233') != -1:
                # We have matched our search criteria.
                match = True
            else:
                # We have encountered a new entry and it does
                # not match the search criteria.
                match = False
        if match:
            # We are currently in a section of the FASTA file
            # that matches our search criteria.
            print(line)

It is worth noting that I felt the need to add quite a few comments to explain what was going on, a sign that everything is not as clear as it should be. However, on the whole the code does a job and it works.

Now imagine that you wanted to add a filter based on the length of the sequence. Is it immediately obvious what you would do? How can you ensure that the code remains understandable?

Object-oriented programming to the rescue

Object-oriented programming is all about grouping data and functionality together. This allows one to abstract away some of the complexities of the processing logic and to encapsulate the data.

Let us start by creating an object representing a FASTA record. Save the code below to a file named fasta.py.

class FastaRecord(object):
    """Class representing a FASTA record."""

    def __init__(self, description_line):
        """Initialise an instance of the FastaRecord class."""
        self.description = description_line.strip()
        self.sequences = []

    def add_sequence_line(self, sequence_line):
        """
        Add a sequence line to the FastaRecord instance.
        This function can be called more than once.
        """
        self.sequences.append( sequence_line.strip() )

    def __repr__(self):
        """Representation of the FastaRecord instance."""
        lines = [self.description,]
        lines.extend(self.sequences)
        return '\n'.join(lines)

There are a few things to note in the code above. Particularly if you are new to object-oriented programming and/or Python.

First of all we inherit functionality from the base class object (the first line). This is kind of historical where in Python 2.1 “new style” classes were added. To remain backwards compatible with the “classic” or “old style” classes it was decided that one would have to inherit from object to access the goodness of the new style class. There are more details on the Python wiki.

Secondly, we make use of the “magic” method __init__. This is used to create an instance of a class.

Classes, objects, instances, what is up with all this terminology? What does it all mean?

Okay, let us take a slight detour. You can think of classes as moulds, for example a plastic bucket that you bring to the beach to make a sand castle. You fill the bucket with sand and tip it up-side down, pat it on the top and lift it up. What remains is a tower made out of sand. This sand castle is an “instance” of your bucket “class”. Finally, the term “object”, as in object-oriented programming, tends to be used to refer to classes and instances interchangeably.

Back to the __init__ method, which is used to initialise an instance of the class. The instance created is accessible via the self argument. During the initialisation of the FastaRecord class we also provide the description line.

>>> from fasta import FastaRecord
>>> fasta_record = FastaRecord('>sp|O76074|PDE5A_HUMAN')

Note that the fasta_record variable above is an instance of the FastaRecord class. We can access the description attribute of the FastaRecord instance directly.

>>> fasta_record.description
'>sp|O76074|PDE5A_HUMAN'

The add_sequence_line method simply adds a sequence line to the sequences (list) attribute.

>>> fasta_record.add_sequence_line('MERAGPSFGQQRQQQQPQQQKQQQRDQDSVEAWLDDHWDFTFSYFVRKATREMVNAWFAE')
>>> fasta_record.add_sequence_line('RVHTIPVCKEGIRGHTESCSCPLQQSPRADNSAPGTPTRKISASEFDRPLRPIVVKDSEG')

Finally, we have the “magic” __repr__ method. At this point you are probably screaming out loud, what is a “magic” method? A “magic” method is basically a way to make an object behave like a built-in Python object. For example the __repr__ method is used to describe how the instance should be represented. Let us illustrate this below.

>>> fasta_record
>sp|O76074|PDE5A_HUMAN
MERAGPSFGQQRQQQQPQQQKQQQRDQDSVEAWLDDHWDFTFSYFVRKATREMVNAWFAE
RVHTIPVCKEGIRGHTESCSCPLQQSPRADNSAPGTPTRKISASEFDRPLRPIVVKDSEG

For more information on “magic” methods have a look at Rafe Kettler’s blog post A Guide to Python’s Magic Methods.

A FASTA parser object

Now that we have a basic class for working with FASTA records let us create another class for parsing FASTA files.

class FastaParser(object):
    """Class for parsing FASTA files."""

    def __init__(self, fpath):
        """Initialise an instance of the FastaParser."""
        self.fpath = fpath

    def __iter__(self):
        """Yield FastaRecord instances."""
        fasta_record = None
        with open(self.fpath, 'r') as fh:
            for line in fh:
                if line.startswith('>'):
                    if fasta_record:
                        yield fasta_record
                    fasta_record = FastaRecord(line)
                else:
                    fasta_record.add_sequence_line(line)
        yield fasta_record

In the example above I have used the __iter__ magic method. This basically defines the behaviour the class should display when called as an iterator. In this particular case we want it to yield FastaRecord instances as the FASTA file is parsed.

>>> from fasta import FastaParser
>>> fasta_parser = FastaParser('example.fasta')
>>> for fasta_record in fasta_parser:
...     print(fasta_record.description)
...
>sp|O76074|PDE5A_HUMAN cGMP-specific 3',5'-cyclic phosphodiesterase OS=Homo sapiens GN=PDE5A PE=1 SV=2
>sp|Q9Y233|PDE10_HUMAN cAMP and cAMP-inhibited cGMP 3',5'-cyclic phosphodiesterase 10A OS=Homo sapiens GN=PDE10A PE=1 SV=1

Back to grouping data and functionality

At this point we could write a simple script to loop over the FASTA records and find the hits of interest. However, where should we add the logic for finding hits of interest?

I would argue that this is a great opportunity for abstracting away the logic of identifying a hit by putting it in the FastaRecord class itself. Let us extend the class to do this.

class FastaRecord(object):
    """Class representing a FASTA record."""

    def __init__(self, description_line):
        """Initialise an instance of the FastaRecord class."""
        self.description = description_line.strip()
        self.sequences = []

    def add_sequence_line(self, sequence_line):
        """
        Add a sequence line to the FastaRecord instance.
        This function can be called more than once.
        """
        self.sequences.append( sequence_line.strip() )

    def matches(self, search_term):
        """Return True if the search_term is in the description."""
        return self.description.find(search_term) != -1

    def __repr__(self):
        """Representation of the FastaRecord instance."""
        lines = [self.description,]
        lines.extend(self.sequences)
        return '\n'.join(lines)

Note the addition of the matches method above. Also, note that the addition of more functionality did not make the code any more difficult to understand.

It is now trivial to write a script to do the analysis that we want.

from fasta import FastaParser

for fasta_record in FastaParser('example.fasta'):
    if fasta_record.matches('Q9Y233'):
        print(fasta_record)

Compare the descriptiveness of this code to that of the procedural example at the beginning of this post.

But you had to write so much more code to get to this point, is it really worth it?

I go back to the scenario outlined earlier in this post. Imagine that you had to extend the logic of the code to be able to filter based on the length of the sequence. Which code base would you rather use as a starting point? If you are unsure, try adding this functionality to both code bases to find out which one is more extensible.

Try to avoid re-inventing the wheel

The point of this post was to illustrate object-oriented programming, not to re-invent the wheel. I used the example of parsing FASTA files in this post as they are widely used in biological research and are conceptually easy to understand. However, if you are serious about using Python for bioinformatics I suggest that you check out Biopython.

Conclusion

Object-oriented programming can be very useful when dealing with complex data structures. In particular it can be used to hide complexity by grouping data and functionality together.

Furthermore, it can make your code more understandable and extensible.

Finally, do not let your lack of knowledge about “polymorphism” and “inheritance” hold you back from making use of objects. Yes, these are interesting topics, and please do read up on them. However, they are not essential to your use of object-oriented programming (at least not in Python).

I hope you find this post useful and that it has encouraged you to try out object-oriented programming. Send me a message if you need any help.