[foaf-dev] Fwd: A Proposal for Universal Identification of Genetic Sequence Features and Biological Parts By Sequence Hashes

Dan Brickley danbri at danbri.org
Mon Dec 20 20:03:07 CET 2010


Just when i was thinking we should hide foaf:dnaChecksum even more
from casual readers of the FOAF spec, ... ;)

(and yes, foaf:dnaChecksum is still a joke...)

Dan


---------- Forwarded message ----------
From: Timothy Ham <timothyham at gmail.com>
Date: Mon, Dec 20, 2010 at 7:54 PM
Subject: A Proposal for Universal Identification of Genetic Sequence
Features and Biological Parts By Sequence Hashes
To: Synthetic Biology Data Exchange Group <synbiodex at googlegroups.com>


Request for Comments [Draft v. 2010-12-20]
The original file of this draft can be found at
http://loche.lbl.gov/tim/static/TSH-RFC-2010-12-20.doc

Title:
A Proposal for Universal Identification of Genetic Sequence Features
and Biological Parts By Sequence Hashes

Authors:
Timothy Ham
Others?

Purpose:
This Proposal specifies a single method to generate a hash for an
arbitrary DNA sequence for identification purposes.

Abstract:
Sequences of arbitrary lengths can be universally and uniquely
identified using a hash generated by the SHA-256 algorithm. A short
identifier generated in a consistent way would facilitate global
identification of genetic sequences, sequence features, and biological
parts across different assembly standards and packaging formats.
Furthermore, the use of the hash as a token would facilitate discovery
of information related to a defined sequence on the Internet by
providing a unique keyword for search engines, supplementing sequence
similarity searches such as BLAST.

Body Text:
Currently, an arbitrary sequence of DNA is identified by its gene name
or protein sequence. For example, the Beta-lactamase gene (bla) can be
identified by its amino acid sequence. However, the amino acid
sequence alone does not give enough specificity to identify a
synthetic biological part. A gene with a different codon usage could
generate the same amino acid sequence, but perform differently under
experimental conditions. A truly distinct identification is achieved
only with an unambiguous DNA sequence.

Using the entire DNA sequence for identification and identity
verification is cumbersome. It is possible to use identifiers
generated by biological databases (NCBI, EMBL, Protein Databank etc.),
but as it is not possible to generate identifiers on the fly, their
use in experimental biology is limited.

In the field of computer science, algorithms have been devised to
convert arbitrary length of data into a fixed length “digital
fingerprint” or a “hash”. These family of algorithms called
“cryptographic hash functions” have the property that it is infeasible
to find two different input data that would generate an identical hash
result. The hash values generated by the algorithm is analogous to a
bar-code, except anyone can generate them and they are always the same
for the same input.

We propose that once an unambiguous DNA sequence for a biological part
is determined, its cryptographic hash value should be calculated and
associated with that sequence, along with any description, measurement
or experiment performed with that DNA sequence.

By adopting a standard method to calculate these hash values, it would
be possible to:
1.Create a globally unique and unambiguous identification token for
any arbitrary DNA sequence.
2.Unambiguously identify same parts encoded in different assembly
standards.
3.Unambiguously identify the same parts or DNA sequences created by
different people. This enables association different experiments
performed by different people to the same DNA sequence.
4.Facilitate search and indexing of biological parts by providing a
short yet unique identifier token to search engines.

We propose that the hash value is calculated as follows:
1.SHA-256 algorithm must be used. Adopted as a US Federal Information
Processing Standard, it is regarded as a robust, secure algorithm and
widely implemented and used.
2.Letters a, t, g, c in lower case must be used to to represent DNA
sequence. Ambiguous nucleotide alphabets must not be used.
3.Blank lines, spaces, or other symbols must not be included in the
sequence text.
4.The sequence text must be in ASCII or UTF-8 encoding. For the
alphabets used, the two are identical. UTF-16 encoding must not be
used, as its Byte Order Mark will change the calculated hash value.
5.For DNA sequences in assembly formats, two calculations must be
performed: One for the entire sequence including the assembly prefix
and suffix, and another for the desired or functional sequence
enclosed between the prefix and suffix (called inner hash). Generally,
this would mean sequences exclusive of the prefix and suffix.
6.If the desired or functional sequence overlaps parts of the prefix
or the suffix, then these sequences should be included in the second
hash calculation. For example, when two BglBrick formatted parts
encoding two protein domains are assembled together to create a single
protein, utilizing the reaction scar as a Glycine-Serine linker, then
the the protein domains that incur into the prefix and suffix regions
should be included in the inner hash calculation for the two original
parts.
7.The resulting hash value must be represented in hexadecimal
representation, also known as a hex digest. Only the numbers 0-9, and
leters a-f in lower case must be used.
8.It is not necessary to store both the forward and reverse values.
However, when performing searches, both directions must be calculated
and checked for completeness.
9.This proposal is only to identify specifically and uniquely a
particular DNA sequence. If different version of a sequences are used
in an experiment—for example sequences that lack start or stop codons,
point mutations, silent mutations, codon optimizations, etc.—they are
considered as different sequences with different hash values. A
separate mechanism must be used to deal with variations.

Frequently Asked Questions:
1.Why not blast?
The purpose of the hash values are to facilitate search algorithms,
not to replace them. Blast and other sequence comparison algorithms
are great at finding similar sequences, but that very feature makes
them cumbersome to use for identification. To abuse an old analogy:
blast is good for looking for needle like metallic objects in a hay
stack. The hash value is used to find the exact needle in a stack of
similar looking needles. Furthermore, in order to use blast, the
sequences must be deposited into a blast database. But not everyone is
inclined to host a blast enabled database or deposit all their parts
into one. Since hash values are easily calculated (via a web form, for
example. See http://loche.lbl.gov/tim/seqhash/), they can quickly
enable searches by Internet search engines.
2.What are some use cases?
a)In part assembly usage, instead of using arbitrary and synthetic
identifiers (a database ID for example), composite part can be defined
as a list of sequences hashes (or as a list of inner, scar, prefix and
suffix hashes). As the parts are repackaged in different assembly
formats or no assembly formats at all, instead of keeping track of
arbitrary part numbers, inner hashes will remain constant and
predictable.
b)When biological devices composed of multiple composite parts are
exchanged, it is not necessary to exchange all the sub parts. A sub
parts' positions would be annotated, and when further information is
desired, it can be searched for on the Internet using its hash value.
c)At JBEI, when a new annotated Genbank file is entered, all the
annotated features are identified, hashes calculated and stored. By
keeping track known features, we are able to know amount of part/
sequence/gene reuse, identification of mis-annotations, and facilitate
automatic feature annotation.
3.Would I be required to memorize a 64bit hexadecimal string all the
time?
No. Clearly hash values are primarily for machine use only. Users
should be presented with friendly part numbers or names whenever
possible.

Example:
One version of the Beta-lactamase gene has the DNA sequence:

>gi|58000284:100368-101228 Escherichia coli A2363 plasmid pAPEC-O2-R, complete sequence
ATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTC
ACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCACGAGTGGGTTACATCGAACT
GGATCTCAACAGCGGTAAGATCCTTGAGAGTTTTCGCCCCGAAGAACGTTTTCCAATGATGAGCACTTTT
AAAGTTCTGCTATGTGGCGCGGTATTATCCCGTGTTGACGCCGGGCAAGAGCAACTCGGTCGCCGCATAC
ACTATTCTCAGAATGACTTGGTTGAGTACTCACCAGTCACAGAAAAGCATCTTACGGATGGCATGACAGT
AAGAGAATTATGCAGTGCTGCCATAACCATGAGTGATAACACTGCGGCCAACTTACTTCTGACAACGATC
GGAGGACCGAAGGAGCTAACCGCTTTTTTGCACAACATGGGGGATCATGTAACTCGCCTTGATCGTTGGG
AACCGGAGCTGAATGAAGCCATACCAAACGACGAGCGTGACACCACGATGCCTGCAGCAATGGCAACAAC
GTTGCGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCAACAATTAATAGACTGGATGGAG
GCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGTTTATTGCTGATAAATCTG
GAGCCGGTGAGCGTGGGTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTCCCGTATCGT
AGTTATCTACACGACGGGGAGTCAGGCAACTATGGATGAACGAAATAGACAGATCGCTGAGATAGGTGCC
TCACTGATTAAGCATTGGTAA
After normalization (removal of spaces and converted to lower case),
this sequence results in the hash value:
2c5bc10fd145290e60b0b03fb203c474b1add56c3ede4fe26bed5408175c7cf3


More information about the foaf-dev mailing list