# Binary Ops and Gene Compression in Go

August 31, 2022

I’ve been reading Classic Computer Science Problems in Python by David Kopec to review CS algorithms.

One of the first problems is gene compression where the goal is to encode a string of nucleotides (“ACGTA…”) into binary to reduce it’s size in memory.

Compressed data is great because there’s less stuff to download and upload, while receiving the same information content.

The property of nucleotides being limited to the set of four letters ‘A’, ‘C’, ‘G’ and ‘T’ makes it suitable to encode it using simple integers - 0, 1, 2, 3. Storing numbers is more memory-efficient than storing strings, which are collections of bytes representing unique characters in UTF-8 or ASCII.

While the book implements the compression/decompression algorithm in Python, I thought it would be a good learning exercise to write it in Go and provide an overview of binary numbers.

## Binary Basics

We normally use base 10 integer values to represent numbers in most situations. However, base 10 is unwieldy when you want to model data using hardware such as switches in an on/off state. But base 2 can easily be modeled using switches, and is a natural fit for computers.

Decimal Base 2 Expansion Binary
0 0 (20) 0
1 1 (20) 1
2 1 (21) + 0 (20) 10
3 1 (21) + 1 (20) 11
7 1 (22) + 1 (21) + 1 (20) 111
8 1 (23) + 0 (22) + 0 (21) + 0 (20) 1000

The coefficients of the powers give the binary representation of the decimal number.

## Binary Ops in Go

We can have some fun with binary in Go. Binary literals are written with a `0b` prefix such as `0b101`.

Binary ops playground

``````package main

import "fmt"

func main() {
fmt.Println(0b11) // 3

fmt.Println(0b11 == 3)   // true
fmt.Println(0b1000 == 8) // true

// Left shift moves the bits to the left, introducing 0s on the right,
// increasing the number
fmt.Println(0b10 << 1)         // 4 i.e 100
fmt.Println(0b1 << 2)          // also 4
fmt.Println(0b1<<2 == 0b10<<1) // true

// Right shift moves the bits to the right, losing columns and reducing the number.
fmt.Println(0b011>>1 == 1)  // true
fmt.Println(0b1111 >> 2)    // 3 i.e. 11
fmt.Println(0b1110>>2 == 3) // true

// ANDing with & produces a 1 where both columns are 1, else 0.
fmt.Println(0b110&0b100 == 0b100) // true
fmt.Println(0b01&0b00 == 0)       // true

// ORing with | produces a 1 where either column or both are 1, else 0.
fmt.Println(0b11|0b01 == 0b11)    // true
fmt.Println(0b100|0b011 == 0b111) // true

// ExclusiveORing with ^ produces a 1 only where columns are unequal, else 0.
fmt.Println(0b11^0b00 == 0b11)  // true
fmt.Println(0b100^0b101 == 0b1) // true
}``````

## Gene Compression Algorithm

Binary operations really shine when applied to the gene compression problem.

Lets get started.

``````mkdir geneCompression
cd geneCompression``````

Create a `main.go` file with a type `CompressedGene`.

``````package main

type CompressedGene struct {
bits int
}``````

We’ll create a constructor function, and a `compress` private method that’s called within the constructor to store the gene as the bits field.

``````func (cg *CompressedGene) compress(gene string) error {
cg.bits = 1 // sentinel value so we can store 00 and 01 as first nucleotide
for _, nucleotide := range strings.ToUpper(gene) {
cg.bits = cg.bits << 2 // left shift by 2 to introduce 00
switch nucleotide {
case 'A':
cg.bits |= 0b00
case 'C':
cg.bits |= 0b01
case 'G':
cg.bits |= 0b10
case 'T':
cg.bits |= 0b11
default:
return fmt.Errorf("Invalid nucleotide: %c", nucleotide)
}
}
return nil
}

// NewCompressedGene provides a compressed gene struct reference.
func NewCompressedGene(gene string) (*CompressedGene, error) {
cg := CompressedGene{}
err := cg.compress(gene)
if err != nil {
return nil, err
}
return &cg, nil
}``````

For compression, we start out with a `1` as a sentinel value. Starting with `0` would make us lose genes if we added `00` or `01` as the first elements since leading zeros are ignored in binary expressions.

A left shift is performed to create two columns to store the gene. We use the or equals `|=` operator to change the value of the rightmost bits according to the gene and store it.

Decompression will be the inverse of the compression method.

``````// Decompress expands the bits into a string.
func (cg *CompressedGene) Decompress() (string, error) {
gene := ""

// bitLength - 1 to ignore sentinel value
for i := 0; i < bitLength(cg.bits)-1; i += 2 {
rightShifted := cg.bits >> i
fmt.Printf("Rightshifted by %d - %b\n", i, rightShifted)

// get the two rightmost bits
rightPair := rightShifted & 0b11

switch rightPair {
case 0b00:
gene += "A"
case 0b01:
gene += "C"
case 0b10:
gene += "G"
case 0b11:
gene += "T"
default:
return "", fmt.Errorf("Invalid bits: %d", rightPair)
}
}
return reverse(gene), nil
}

func bitLength(n int) int {
return int(math.Floor(math.Log2(float64(n))) + 1)
}

func reverse(s string) string {
r := []rune(s)
for i, j := 0, len(r)-1; i < len(r)/2; i, j = i+1, j-1 {
r[i], r[j] = r[j], r[i]
}
return string(r)
}``````

We use the `bitLength` helper function to determine the number of bits in the integer value, and the `reverse` function to reverse the expanded string. Since we are going right to left, turning bits into genes, the string we build is the reverse of the original. That’s why the final step of reversing the string is needed.

We can test everything works as expected!

``````func main() {
gene := "ACGT"
fmt.Printf("Original gene: %s\n", gene)
compressed, err := NewCompressedGene(gene)
if err != nil {
log.Fatalln(err)
}
fmt.Printf("Bits: %b\n", compressed.bits)
decompressed, err := compressed.Decompress()
if err != nil {
log.Fatalln(err)
}
fmt.Printf("Matches original gene: %v\n", decompressed == gene)
}``````
``````go run main.go
Original gene: ACGT
Bits: 100011011
Rightshifted by 0 - 100011011
Rightshifted by 2 - 1000110
Rightshifted by 4 - 10001
Rightshifted by 6 - 100
Matches original gene: true``````

The only issue with this algorithm is that it fails for large string sequences.

If you change the `gene` value, and rerun - we’ll get quite unexpected results.

``````func main() {
gene := strings.Repeat("TAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATA", 100)
compressed, err := NewCompressedGene(gene)
if err != nil {
log.Fatalln(err)
}
fmt.Printf("Bits: %b\n", compressed.bits)
decompressed, err := compressed.Decompress()
if err != nil {
log.Fatalln(err)
}
fmt.Printf("Matches original gene: %v\n", decompressed == gene)
}``````

Large gene strings will cause this algorithm problems because our `bits int` field will easily overflow once we reach its 64 bit limit! The `int` type in go is 32 bits or 64 bits depending on the underlying system. But we definitely don’t want that limitation, so we’ll have to leverage the mighty `big.Int`.

## Using `big.Int`

The `big.Int` type won’t support the binary operation symbols like `&` and `<<` but it comes with methods that do the same thing. We can use `Or`, `And`, `Lsh` (left shift), and `Rsh` (right shift) methods for our logic.

``````// CompressedGene models a compressed format gene sequence
type CompressedGene struct {
bits *big.Int
}

func (c *CompressedGene) compress(gene string) error {
c.bits = big.NewInt(1) // sentinel value

for _, nucleotide := range strings.ToUpper(gene) {
c.bits.Lsh(c.bits, 2)
switch nucleotide {
case 'A':
c.bits.Or(c.bits, big.NewInt(0b00))
case 'C':
c.bits.Or(c.bits, big.NewInt(0b01))
case 'G':
c.bits.Or(c.bits, big.NewInt(0b10))
case 'T':
c.bits.Or(c.bits, big.NewInt(0b11))
default:
return fmt.Errorf("Invalid nucleotide: %c", nucleotide)
}
}
return nil
}

// Decompress expands the compressed bit sequence into
// a string of nucleotides A,C,G,T.
func (c *CompressedGene) Decompress() (string, error) {
gene := ""

// bitlen - 1 to ignore sentinel value
for i := 0; i < c.bits.BitLen()-1; i += 2 {
rightShifted := &big.Int{} // local value to not mutate the original
rightShifted.Rsh(c.bits, uint(i))
// Getting the two, rightmost bits.
rightBits := (rightShifted.And(rightShifted, big.NewInt(0b11))).Uint64()
switch rightBits {
case 0b00:
gene += "A"
case 0b01:
gene += "C"
case 0b10:
gene += "G"
case 0b11:
gene += "T"
default:
return "", fmt.Errorf("Invalid bits: %d", rightBits)
}
}
return reverse(gene), nil
}``````

The constructor remains the same.

Let’s adjust the main function to print out the byte sizes so we can judge the value of this compression algorithm.

``````func main() {
s := strings.Repeat("TAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATATAGGGATTAACCGTTATATATATATAGCCATGGATCGATTATA", 100)
c, err := NewCompressedGene(s)
if err != nil {
log.Fatalln(err)
}
fmt.Printf("String bytes: %d\n", len([]byte(s)))
fmt.Printf("Compressed bytes: %d\n", len(c.bits.Bytes()))
d, err := c.Decompress()
if err != nil {
log.Fatalln(err)
}
fmt.Printf("Matches original gene: %v\n", d == s)
}``````
``````go run main.go
String bytes: 8600
Compressed bytes: 2151
true``````

There we go! We saved quite a bit of memory by encoding the genes to a big int. By implementing this algorithm in Go, we’ve learned about Go’s `big.Int` type and applying binary operations to encode data for compression.

For me, this was a fun demonstration of just how interesting computing problems can be, and the ingenious ways in which they can be solved.

I can’t wait to discover and implement the rest of the algorithms in the book!