Post on 25-Nov-2014
transcript
DNA CRYPTOGRAPHY
NIT KURUKSHETRA 1
Chapter 1Introduction
DNA CRYPTOGRAPHY
1 Introduction
1.1 DNA Cryptography
DNA cryptography is a new born cryptographic field emerged with the research of DNA
computing, in which DNA is used as information carrier and the modern biological technology
is used as implementation tool. The vast parallelism and extraordinary information density
inherent in DNA molecules are explored for cryptographic purposes such as encryption,
authentication, signature, and so on.
1.2 DNA
DNA is the abbreviation for deoxyribonucleic acid which is the germ plasm of all life styles.
DNA is a kind of biological macromolecule and is made of nucleotides. Each nucleotide
contains a single base and there are four kinds of bases, which are adenine (A) and thymine (T)
or cytosine (C) and guanine (G), corresponding to four kinds of nucleotides. A single-stranded
DNA is constructed with orientation: one end is called 5′, and the other end is called 3′. Usually
DNA exists as double-stranded molecules in nature. The two complementary DNA strands are
held together to form a double-helix structure by hydrogen bonds between the complementary
bases of A and T (or C and G).
Fig 1.2.1 Double helix structure of DNA
NIT KURUKSHETRA 2
DNA CRYPTOGRAPHY
1.3 Amino Acid Codes
Amino Acid Name Amino Acid Code Nucleotide Codon
Alanine A GCT GCC GCA GCG
Arginine R CGT CGC CGA CGG AGA AGG
Asparagine N ATT AAC
Aspartic acid (Aspartate) D GAT GAC
Cysteine C TGT TGC
Glutamine Q CAA CAG
Glutamic acid (Glutamate) E GAA GAG
Glycine G GGT GGC GGA GGG
Histidine H CAT CAC
Isoleucine I ATT ATC ATA
Leucine L TTA TTG CTT CTC CTA CTG
Lysine K AAA AAG
Methionine M ATG
Phenylalanine F TTT TTC
Proline P CCT CCC CCA CCG
Serine S TCT TCC TCA TCG AGT AGC
Threonine T ACT ACC ACA ACG
Tryptophan W TGG
Tyrosine Y TAT, TAC
Valine V GTT GTC GTA GTG
Asparagine or Aspartic acid (Aspartate) B Random codon from D and N
Glutamine or Glutamic acid (Glutamate) Z Random codon from E and Q
Unknown amino acid (any amino acid) X Random codon
Translation stop * TAA TAG TGA
Gap of indeterminate length - ---
Unknown character (any character or symbol not in table) ? ???
Table 1.3.1 Amino acids and codes
1.4 Primer
A primer is a short synthetic oligonucleotide which is used in many molecular techniques
from PCR to DNA sequencing. These primers are designed to have a sequence which is the
NIT KURUKSHETRA 3
DNA CRYPTOGRAPHY
reverse complement of a region of template or target DNA to which we wish the primer to
anneal.
Some thoughts on designing primers
1. primers should be 17-28 bases in length;
2. base composition should be 50-60% (G+C);
3. primers should end (3') in a G or C, or CG or GC: this prevents "breathing" of ends
and increases efficiency of priming;
4. Tms between 55-80oC are preferred;
5. 3'-ends of primers should not be complementary (ie. base pair), as otherwise primer
dimers will be synthesised preferentially to any other product;
6. primer self-complementarity (ability to form 2o structures such as hairpins) should be
avoided;
7. runs of three or more Cs or Gs at the 3'-ends of primers may promote mispriming at G
or C-rich sequences (because of stability of annealing), and should be avoided.
1.5 Transcription and Translation
Transcription, or RNA synthesis, is the process of creating an equivalent RNA copy of a
sequence of DNA. Both RNA and DNA are nucleic acids, which use base pairs of nucleotides as
a complementary language that can be converted back and forth from DNA to RNA in the
presence of the correct enzymes. During transcription, a DNA sequence is read by RNA
polymerase, which produces a complementary, anti-parallel RNA strand. As opposed to DNA
replication, transcription results in an RNA complement that includes uracil (U) in all instances
where thymine (T) would have occurred in a DNA complement.
Translation is the first stage of protein biosynthesis (part of the overall process of gene
expression). Translation is the production of proteins by decoding mRNA produced
in transcription. Translation occurs in the cytoplasm where the ribosomes are located.
Ribosomes are made of a small and large subunit which surrounds the mRNA. In
translation, messenger RNA (mRNA) is decoded to produce a specific polypeptide according to
the rules specified by the genetic code. This uses an mRNA sequence as a template to guide the
synthesis of a chain of amino acids that form a protein. Many types of transcribed RNA, such as
NIT KURUKSHETRA 4
DNA CRYPTOGRAPHY
transfer RNA, ribosomal RNA, and small nuclear RNA are not necessarily translated into an
amino acid sequence.
1.6 Cryptography
Data security and cryptography are critical aspects of conventional computing and may also be
important to possible DNA database applications. Here we provide basic terminology used in
cryptography. The goal is to transmit a message between a sender and receiver such that an
eavesdropper is unable to understand it. Plaintext refers to a sequence of characters drawn from
a ¯nite alphabet, such as that of a natural language. Encryption is the process of scrambling the
plaintext using a known algorithm and a secret key. The output is a sequence of characters
known as the ciphertext. Decryption is the reverse process, which transforms the encrypted
message back to the original form using a key. The goal of encryption is to prevent decryption
by an adversary who does not know the secret key. An unbreakable cryptosystem is one for
which successful cryptanalysis is not possible. Such a system is the one-time-pad cipher. It gets
its name from the fact that the sender and receiver each possess identical notepads ¯lled with
random data. Each piece of data is used once to encrypt a message by the sender and to decrypt
it by the receiver, after which it is destroyed.
* The main goal of the research of DNA cryptography is exploring characteristics of DNA
molecule and reaction, establishing corresponding theories, discovering possible development
directions, searching for simple methods of realizing DNA cryptography, and lay-ing the basis
for future development.
1.7 Advantages Of DNA Cryptography
The difficult biological problem referred to here is “It is extremely difficult to amplify the
message-encoded sequence without knowing the correct PCR two primer pairs”. Polymerase
Chain Reaction (PCR) is a fast DNA amplification technology based on Watson-Crick
complementarity, and is one of the most important inventions in modern biology. Two
complementary oligonucleotide primers are annealed to double-stranded target DNA strands,
and the necessary target DNA can be amplified after a serial of polymerase enzyme. The PCR is
a very sensitive method, and a single target DNA molecule can be amplified to 106 after 20
cycles in theory. Thus one can effectively amplify a lot of DNA strands within a very short time.
NIT KURUKSHETRA 5
DNA CRYPTOGRAPHY
Thinking about the highly stability of PCR, each PCR primer (20-27)-mer nucleotides long is a
comparatively perfect selection. In this study, we selected each PCR primer 20-mer nucleotides
long. It is a special function in PCR amplification that having the correct primer pairs. It would
still be extremely difficult to amplify the message-encoded sequence without knowing the
correct two primer pairs. If an adversary without knowing the correct two primer pairs wants to
pick out the message encoded sequence by PCR amplification, he must choose two primer
sequences from about 10^23 kinds of sequences (the number of combination taking 2 sequences
from 420 candidates). So, we believe that this biological problem is difficult and will last a
relatively long time.
1.8 Limitations Of DNA Cryptography
(i) Lack of the related theoretical basis.
(ii) Difficult to realize and expensive to apply.
1.9 Comparisons among DNA cryptography, traditional cryptography and
quantum cryptography
1.9.1 Development
Traditional cryptography can be traced back to Caesar cipher 2000 years ago or even earlier.
Related theory is almost sound. All the practical ciphers can be seen as traditional ones.
Quantum cryptography came into being in the 1970s, and the theory basis has been prepared
while implementation is difficult. By and large, they have not been plunged into practical use.
DNA cryptography has only nearly ten years history, the theory basis is under research and the
application costs very much.
1.9.2 Security
Only computational security can be achieved for traditional cryptographic schemes except for
the one-time pad, that is to say, an adversary with infinite power of computation can break them
theoretically. It is shown that quantum computers have great and striking computational
potential. Although there is uncertainty about the computational power of quantum computers, it
is possible that all the traditional schemes except for the one-time pad can be broken by using
the future quantum computers. Quantum cryptographic schemes are unbreakable under current
theories. Differently, their security is based on Heinsberg's Uncertainty Principle. Even if an
eavesdropper is given the ability to do whatever he wants, and has infinite computing re-sources,
NIT KURUKSHETRA 6
DNA CRYPTOGRAPHY
so much as P=NP, it is still impossible to break such a scheme. Any behavior of eavesdropping
will change the cipher so it can be detected. It is impossible for an adversary to obtain a totally
same the quanta with the intercepted one, thus the attempt to tamper but without being detected
in vain. Therefore, quantum key agreement schemes have unconditional security. For the DNA
cryptography, the main security basis is the restriction of biological techniques, which has
nothing to do with the computing power and immunizes DNA cryptographic schemes against
attacks using quantum computers. Nonetheless, the problem as to what is the extent this kind of
security and how long it can be maintained it is still under exploration.
1.9.3 Application
Traditional cryptosystems are the most convenient of which the computation can be executed by
electronic, quantum as well as DNA computers, the data can be transmitted by wire, fiber,
wireless channel and even by a messenger, and the storage can be CDs, magnetic medium, DNA
and other storage medium. Using the traditional cryptography we can realize purposes as public
and private key encryption, identity authentication and digital signature. Quantum cryptosystem
is implemented on quantum channels of which main ad-vantage lies in real-time communication.
The disadvantage lies in the secure data storage, which makes it infeasible to implement public-
key encryption and digital signature as easily as traditional one does. Under the current level of
techniques, only by physical ways can the cipher text of DNA cryptography be transmitted. Due
to the vast parallelism, exceptional energy efficiency and extraordinary information density
inherent in DNA molecules, DNA cryptography can have special advantages in some
cryptographic purposes, such as secure data storage, authentication, digital signature,
steganography, and so on. DNA can even be used to produce unforgeable contract, cash ticket
and identification card.
Researches of all the three kinds of cryptography are still in progress, and a great many
problems remains to be solved especially for DNA and quantum cryptography, this making it
hard to predict the future. But from the above discussions we think it is likely that they exist and
develop conjunctively and complement each other rather than one of them falls into disuse
thoroughly.
1.10 Development directions of DNA cryptography
NIT KURUKSHETRA 7
DNA CRYPTOGRAPHY
Since DNA cryptography is still in its immature stage, it is too early to predict the future
development precisely. However, in view of the development of biological techniques and the
requirement of cryptography, we hold the following opinions:
1) DNA cryptography should be implemented by using modern biological techniques as tools
and biological hard problems as main security basis to fully exert the special advantages.
Encryption and decryption are procedures of data transform which, if described by mathematical
methods, are easier to be implemented than physical and chemical ones in the present era of
electronic computers and the Internet. If other kinds of cryptosystems are necessary to be
researched and developed, they should have properties such as higher security levels and storage
density etc, which cannot be realized by electronic computers by using mathematical methods.
Thus, if DNA cryptography is necessary to be developed, the advantages inherent in DNA
should be fully explored, such as developing nanoscopic storage based on the tiny volume of
DNA, realizing fast encryption and decryption based on the vast parallelism, and utilizing
difficult biological problems that one can utilize but still far from fully understand them as the
secure foundation of DNA cryptography to realize novel crypto-system which can resist the
attack from quantum com-puters. Since it has not been made sure whether quan-tum computers
threaten the hardness of various mathematical hard problems, these problems being se-curity
basis cannot be excluded absolutely. Encryption and decryption algorithms hard to be
implemented using electronic computers may be feasible using DNA ones with regard to their
vast parallel computational ability. If these schemes withstand attacks by quantum computers,
their computational security will be inherited into DNA schemes. Thereby, DNA cryptography
does not absolutely repulse traditional cryptography and it is possible to construct a hybrid
cryptosystem of them.
2) Security requirements :Regardless of the many differences between DNA and traditional
cryptography, they both satisfy the same characteristic of cryptography. The communication
model for DNA encryption is also made up of two par-ties, i.e. a sender and a receiver, which
obtain the secret key in a secure or authenticated way and then communicate securely with each
other in an insecure or unauthenticated channel. The security requirements should also be
founded upon the assumption proposed by Kirchoff that security should depend only on the
secrecy of decryption key; that is, an attacker should be fully aware of all the details of
encryption and decryption except the decryption key. It is under this assumption that a
cryptosystem can be said secure when any attacker cannot break it. More precisely, it must be
NIT KURUKSHETRA 8
DNA CRYPTOGRAPHY
assumed that an attacker knows the basic biological method the designer used, and has enough
knowledge and excellent laboratory devices to repeat the de-signer’s operations. The only thing
not known by the attacker is the key. In a DNA cryptosystem, a key is usually some substances
of biological materials or a preparation flow, and sometimes the experiment conditions.
3) For DNA cryptography, the current research target should lie first in security and feasibility,
second in storage density.
A sound cryptosystem should be secure as well as easy to be implemented. The development of
modern biological technology makes it possible to express data by DNA, although the related
research is just in its initial stage. In fact, it is still difficult to operate the nanoscopic DNA
directly. Scientists can easily operate DNA with the aid of kinds of restriction enzymes only
after DNA strands are amplified with amplification technology such as PCR. With the current
technology, it is also impossible to store all the worldwide data by using several grams of DNA.
If the only requirement is to improve the density of storage, it is hard to implement DNA
cryptography at the present technique level.
It is more practical to make use of colony property of plentiful DNA for cryptographer. For
example, store data by DNA chips and read data by hybridization, which makes the operations
of input/output faster and more convenient. The method is easier to be implemented than
encoding message into nucleotides directly while the storage density is somewhat lower.
4) Currently, the main task for DNA cryptographers is to establish the theory foundations and to
accumulate the practical experience.
It can be proved that there are vast parallelism, exceptional energy efficiency and extraordinary
information density inherent in DNA. This motivates the research of DNA computing and
cryptography. The cur-rent goal or difficulty is to find and make use of the utmost potential, but
the related research is in its initial stage. Sound theories have not been founded for both DNA
computing and cryptography. Modern biology lays particular stress on experiments rather than
theories. There is no efficient way to measure the hardness of a biological problem and the
security level of the corresponding cryptosystems based on the problem. It is certainly urgent to
find such a method similar to computational complexity. Presently, the most important is to find
the sound properties of DNA that can be used to computation and encryption, to establish the
theoretical basis and to accumulate the experience, based on which the design of secure and
practical DNA cryptosystems is possible.
NIT KURUKSHETRA 9
DNA CRYPTOGRAPHY
1.11 DNA Digital Coding Technology
In the information science, the most fundamental coding method is binary digital coding, which
is anything can be encoded by two state 0 or 1 and a combination of 0 and 1. There are four
kinds of bases, which are adenine (A) and thymine (T) or cytosine (C) and guanine (G) in
DNA sequence. The simplest coding patterns to encode the 4 nucleotide bases (A, T, C, G) is by
means of 4 digits: 0(00), 1(01), 2(10), 3(11). Obviously, there are 4!=24 possible coding patterns
by this encoding format. As we all know, in a double helix DNA string, two DNA strands are
held together complementary in terms of sequence, that is A to T and C to G according to
Watson-Crick complementarity rule. Take DNA digital coding into account, it should reflect the
biological characteristics of 4 nucleotide bases, the complementary rule that (~0)=1, and (~1=0)
is proposed in this DNA digital coding. According to this complementary rule, that is 0(00) to
3(11) and 1(01) to 2(10). So among these 24 patterns, only 8 kinds of patterns (0123/CTAG,
0123/CATG, 0123/GTAC, 0123/GATC, 0123/TCGA, 0123/TGCA, 0123/ACGT, 0123/AGCT)
which are topologically identical fit the complementary rule of the nucleotide bases. It is
suggested that the coding pattern in accordance with the sequence of molecular weight,
0123/CTAG, is the best coding pattern for the nucleotide bases. This pattern could perfect
reflect the biological characteristics of 4 nucleotide bases and have a certain biological
significance. The binary digital coding of DNA sequences prevails over the character DNA
coding with the following advantages:
(1). To decrease the redundancy of the information coding andimprove the coding efficiency
compared to the traditional character DNA coding.
(2). The digital coding of DNA sequence is very convenient for mathematical operation and
logical operation and may give a great impact on the DNA bio-computer.
(3). The DNA sequence after preprocessing by DNA digital coding techniques is able to do
digital computing and adapt to the existing computer-processing mode, which facilitates the
direct conversion between biological information and encryption information in the
cryptographyscheme.
(4). By using the technology of DNA digital coding, the traditional encryption method such as
DES or RSA could be used to preprocess to the plaintext in the cryptography scheme.
1.12 System Design Of Encryption Scheme
NIT KURUKSHETRA 10
DNA CRYPTOGRAPHY
Now, we will describe the system design of encryption scheme, whose security on the scheme is
mainly based on the difficult biological problems and difficult mathematical problems. We will
show the way of exchanging message safely just between specific two persons. We shall call the
sender Alice, and the intended receiver Bob. Above all, we extend the definition of this
encryption scheme as follows. Suppose there is a sender Alice who owns an encryption key KA,
and an intended receiver Bob who owns a decryption key KB (KA = KB or KA ≠ KB). Alice
uses KA to translate a plaintext M into ciphertext C by a translation E. Bob uses KB to translate
the ciphertext C into the plaintext M by a translation D.
The encryption process is:
C = EKA (M)
The decryption process is:
DKB (C) = DKB (EKA (M)) = M
It is difficult to obtain M from C unless one has KB. We call translation E as encryption process
and C as ciphertext. Here, KA, KB and C are not limited to digital data, but can be any method,
material, data, etc. such as DNA sequence. E and D are also not limited to mathematical
calculations, but can be any physical or chemical or biological or mathematical process such as
traditional encryption method. Using traditional cryptography RSA to preprocess to the
plaintext, an encryption scheme with DNA technologies was proposed in this paper. The
intended receiver Bob has a pair of keys (e, d). We will describe the general process of the
encryption scheme as follows.
A. Key Generation
The message-sender Alice designs a DNA sequence which is 20-mer oligo nucleotides long as a
forward primer for PCR amplification and transmits it to intended receiver Bob over a secure
channel. The message-receiver Bob also designs a DNA sequence which is 20-mer oligo
nucleotides long as a reverse primer for PCR amplification and transmits it to Alice over a
secure channel. After a pair of PCR primers is respectively designed and exchanged over a
secure communication channel, we can get an encryption key KA that is a pair of PCR primers
and Bob’s public key e, as well as an decryption key KB that is a pair of PCR primers and Bob’s
secret key d.
B. Encryption
First of all, the sender Alice will translate the plaintext M into hexadecimal code by using the
built-in computer code. Then hexadecimal code is translated into binary plaintext M_ by using
third-party software. Finally, Alice translates the binary plaintext M_ into the binary ciphertext
NIT KURUKSHETRA 11
DNA CRYPTOGRAPHY
C_ by using Bob’s public key e. We call this preprocess operation is pretreatment data process
(data pre-treatment). Through this preprocess operation, we can get completely different
ciphertext from the same plaintext, which can effectively prevent attack from a possible word as
PCR primers. Then, Alice translates the binary ciphertext C_ into the DNA sequence according
to the DNA digital coding technology. After coding, Alice synthesizes the secret-message DNA
sequence which is flanked by forward and reverse PCR primers, each 20-mer oligo nucleotides
long. Thus, the secrete-message DNA sequence is prepared. The last process of this encryption
is that Alice generates a certain number of dummies and puts the secrete-message DNA
sequence among them. It is necessary that each dummy has the same structure as the secrete-
message DNA sequence. In this scheme, the dummy is generated by sonicating human DNA to
roughly 60 to 160 nucleotide pairs (average size) and denaturing it. After mixing the secrete-
message DNA sequence with a certain number of dummies, Alice sends the DNA mixture to
Bob using an open communication channel.
C. Decryption
After the intended receiver Bob gets the DNA mixture, he can easily find the secrete-message
DNA sequence. Since the intended receiver Bob had gotten the correct PCR two primer pairs
through a secure way, he could amplify the secret-message DNA sequence by perform PCR on
DNA mixture. After Bob amplifies the secrete-message DNA sequence, he could retrieve the
plaintext M sended from Alice from the reverse preprocess operation using his secret key d. This
decryption process is not only a mathematic computation, but also a biological process. The
pretreatment data flow chart is described in Fig. 1.12.1
Fig.1.12.1 Data pre(post)treatment flow chart
NIT KURUKSHETRA 12
DNA CRYPTOGRAPHY
In the following part of this section, we thoroughly discuss details of this encryption scheme
with an example shown in fig. 1.12.2. The result of the PCR amplification is shown in fig.
1.12.3.
Step 1: Key Generation. The message-sender Alice and the message-receiver Bob respectively
design and exchange a pair of PCR primers over a secure communication channel. The
encryption and decryption keys are a pair of PCR primers. In this scheme, the intended PCR two
primer pairs was not independent designed by sender or receiver, but respectively designed
complete cooperation by sender and receiver. This operation could increase the security of this
encryption scheme, because even if an adversary somehow caught one of a primer pair, the
amplification was not efficient when one of a primer pair is incorrect, only when both of the
primer sequences were correct, the amplification could be successful.
Step 2: Data pretreatment. Here we choose “GENECRYPTOGRAPHY” (gene cryptography) as
plaintext to encrypt. We first convert this sentence into hexadecimal code by using the built-in
computer code, that is: “47 45 4E 45 43 52 59 50 54 4F 47 52 41 50 48 59”. Then we translate
hexadecimal code into binary plaintext M_ by using third-party software, that is:
01000111 01000101 01001110 01000101
01000011 01010010 01011001 01010000
01010100 01001111 01000111 01010010
01000001 01010000 01001000 01011001
NIT KURUKSHETRA 13
DNA CRYPTOGRAPHY
Fig. 1.12.2. Flow chart of Encryption scheme system.
Fig. 1.12.3. Result of the PCR amplification
Step 3: Encryption. Alice will encrypt the binary plaintext M_ into the binary ciphertext C_ by
using Bob’s public key e. After that, Alice converts the binary ciphertext C_ into the DNA
sequence by using the DNA digital coding technology. Finally, a secret-message DNA sequence
containing an encoded message 64 nucleotides long flanked by forward and reverse PCR
primers. Thus, the secrete-message DNA is prepared. After mixing the secrete-message DNA
sequence with a certain number of dummies, Alice sends the DNA mixture to Bob using an open
communication channel, such as DNA ink or DNA book.
Step 4: Decryption. After the intended receiver Bob gets the DNA mixture, he can easily pick
out the secret-message DNA sequence by using the correct primer pairs. Bob translates the
NIT KURUKSHETRA 14
DNA CRYPTOGRAPHY
secret-message DNA sequence into the binary ciphertext C_ by using the DNA digital coding
technology. Then, Bob can decrypt the binary ciphertext C_ into the binary plaintext M_ by
using his secret key e.
Step 5: data post-treatment. After the binary plaintext M_ has been recovered, Bob can retrieve
the plaintext M, “GENECRYPTOGRAPHY” from the binary plaintext M_ by using data post-
treatment.
1.13 The codes
The three codes described in detail in this paper are referred to as the Huffman code, the comma
code and the alternating code. It should be stated at the outset that none of them fulfill all the
criteria listed above. The Huffman code is the most economical and would be the best for
encrypting text for short-term storage, providing that this text lacked any sort of punctuation,
symbols or numbers. Both the comma code and the alternating code, while the most
uneconomical of the codes, have the advantage that they generate base sequences which are
obviously artificial, and so would be best suited to the encryption of information for long-term
storage.
1.13.1 The Huffman code
By varying the number of symbols allotted to a character in a code, with the most frequent
character being given the least number of symbols and the least frequent the most number of
symbols, it is possible to construct very economical codes, i.e. codes in which the text is
encrypted by the minimum number of symbols – it is as short as it can possibly be. One of the
best ways of constructing an economical code is to use Huffman’s method (Huffman 1952). As
well as being compact, the message generated by a Huffman code is unambiguous. That is, once
the start point has been specified, there is only one way in which the stream of symbols
comprising the message can be read. The Huffman code constructed with the four DNA bases A,
G, C and T for the letters of the English alphabet is shown in Table 1.13.1 Given the frequencies
of occurrence of these letters, such a code is straightforward to construct (Materials and
methods). In the code, the shortest codon is just one base long (representing e, the most
frequently used letter in the English language), and the longest codon is five bases long
(representing q and z, the most infrequent letters in the English language). The average codon
length is 2.2 bases, shorter than the codons of any of the other codes described in this paper. The
NIT KURUKSHETRA 15
DNA CRYPTOGRAPHY
unambiguous nature of the Huffman code shown in Table 1 can be seen by encoding any group
of letters with it and then decoding them from the beginning of the sequence: there is only one
way it can be done. For instance, the base sequence CATGTAGTCG can only be read from the
beginning as hester – no other interpretation of the message is possible. Given a suitable start
signal, the alternating code is also unambiguous. While of the three codes discussed here, the
Huffman makes the most economical use of DNA, it does have two disadvantages. The first is
that it does not cater for any symbols or numbers, as the frequency of these characters will be
heavily text-dependent. Consequently they cannot be included when deriving the Huffman code.
The second disadvantage of the Huffman code relates to its possible use in long-term storage of
information. Because of the variable length of the codons, no obvious pattern emerges when
they are joined together to encode a message. The naive investigator might confuse it with
natural DNA and therefore not appreciate its significance. One could counteract this problem by
using three instead of four bases (e.g. A, C and T), at the expense of economy. The Huffman
code is the only code discussed in this paper with variable length codons. The others all have
fixed length codons. We note that, in a similar manner to the above, the Huffman code has also
been used to construct a ‘perfect’ genetic code comprising variable length codons.
NIT KURUKSHETRA 16
DNA CRYPTOGRAPHY
Table 1.13.1 The Huffman code
1.13.2 The comma code
In the comma code, consecutive 5-base codons are separated by a single base, the comma, which
is always the same: e.g. G− − − − − G− − − − − G− − − −− G. The repetition of G every six
bases must be construed by any careful sequence analyst as a deliberate device. The codons that
slot into the gaps in the above framework are made up of the remaining bases C, A and T, but
not G, e.g. ATCAC. These codons are further restricted to three A:T base pairs and two G:C
base pairs, with the C of the latter always being located in the top strand. This kind of an
arrangement,suggested by unrelated work , has the advantage that it will generate a set of codons
with isothermal melting temperatures, facilitating the construction of message DNA (‘Criteria
for an optimal code’, above). The codons take the general form CWWWC, where W = A or T,
and the C’s and W’s can adopt any arrangement (e.g. WWCWC or WCCWW). There are 80
codons in this set. Most (83%) point mutations give nonsense codons, and therefore the comma
code is good at detecting errors. But the principal attraction of the comma code is the reading
frame established by the regular pattern of repeating G’s. The other two codes described do not
have this advantage, and, unless a start point is specified, it might be difficult to orientate oneself
with respect to the message. With the comma code, the reading frame is clear. Furthermore, it
offers some protection against deletion and insertion mutations, which could further complicate
the interpretation of the other codes. For example, it is not difficult to spot the codon containing
the deletion mutation in the following comma-coded sequence:
GATCACGATTCCGCTATGACTCAG. It should also be noted that the base composition of
the codons will give, when the commas are included, message DNA with the unusual property
of a 1:1 ratio of A:T to G:C base pairs.
1.13.3 The alternating code
The alternating code comprises sixty-four 6-base codons of alternating purines and pyrimidines:
RYRYRY, where R = A or G and Y = C or T (although there is no reason why the purines and
pyrimidines should not alternate YRYRYR, or be fixed in other arrangements such as YYYRRR
or RRYYRY). It is very unlikely that the alternating structure formed by strings of these codons
would go unremarked – even short stretches (8 base pairs) of alternating purines and
pyrimidines have been noted in naturally occurring DNA . As in the comma code, the alternating
structure has the unusual property that, in a given piece of message DNA, the number of G:C
pairs will be the same as the number of A:T pairs. As well as creating message DNA of an
obviously artificial nature, the alternating code has two other advantages of the comma code: it
NIT KURUKSHETRA 17
DNA CRYPTOGRAPHY
is isothermal, and it is error-detecting, but less so than the comma code, since 67% of single
point mutations result in nonsense codons. Like the comma code it does not use DNA
economically. Unlike the comma code, there is no automatic reading frame.
Three possible point mutations can occur at each position of the codon GCWWWC (which
includes the initial comma), and therefore there are 18 single point mutations altogether. Of
these 18 single point mutations, three (17%) will produce sense codons (mutation of an A to a T,
or vice versa) and therefore the remaining 83% of single point mutations will given nonsense
codons.
Table 1.13.2 General features of the codes
Table 1.13.3 Advantages of the codes
1.13.4 Other codes
The three codes detailed above are meant to be illustrative rather than exhaustive. They are by
no means the only codes, or the only types of code, possible. Three others are outlined briefly in
this section. Before experimental data for the nature of the genetic code became available, there
were a number of suggestions as to what form it might take. One of these was the comma-free
code. As the name suggests, a comma-free code is just a comma code without the commas. One
might think that removing the commas would give a code without a reading frame. But, by
restricting oneself to a set of fixed-length codons with particular base combinations, the codons
NIT KURUKSHETRA 18
DNA CRYPTOGRAPHY
in this set can be chosen such that only one reading frame is ever possible – all the others give
nonsense. For instance, the 3-base codons AGG, ACG and GTG are part of a comma free code.
Any combination of these codons will give a sequence which can be read in only one way. For
example, in the sequence ACGGTGGTGACGAGG, one could not begin reading one base in, at
CGG, because CGG does not belong to the set. In their original paper on the subject, showed
that twenty 3-base codons could be selected to act in a comma-free manner. Although twenty
codons is not sufficient to comfortably encrypt text, there is a set of fifty-seven 4-base codons
that would be enough to carry out this task. There is nothing particularly wrong with the comma-
free code as a message-encoding scheme. In fact, since it is quite economical and establishes an
automatic reading frame, it ought to be rather good. However, the only significant clue to the
synthetic nature of message DNA containing text encrypted with a comma-free code would be
the absence of runs of four identical bases (e.g. AAAA), as the comma-free code forbids these.
There are no such absences in natural DNA. Like the alternating and comma codes, the comma-
free code would be error-detecting to a certain extent. One other simple code that should also be
mentioned because it produces DNA that is obviously artificial DNA is one that uses only three
of the four different bases, in a similar manner to the codons of the comma code. In fact,
message DNA has already been constructed with a 3-base codon version of this code. We would
probably use a 4-base codon version of this code, however, to give a larger codon set (34 = 81 as
opposed to 33 = 27). Finally, perhaps the most obvious code of all is one similar to the genetic
code – a triplet code. Codon assignment in this case may be done in a non-random fashion, such
that a degree of error-protection could be achieved, with error-correcting codons representing
symbols with opposite meaning (e.g. CTT to encode for ’<’ and AAG for ’>’).
NIT KURUKSHETRA 19
DNA CRYPTOGRAPHY
Chapter 2
Objective
NIT KURUKSHETRA 20
DNA CRYPTOGRAPHY
2.1 Objective
The aim of our project is to build a system which fulfills the following objectives :
To implement the basic concepts of DNA Cryptography.
Hide the biological complexity involved in basic processing
of DNA cryptography.
Allow users to apply the encoding on textual information.
To obtain an encoded text as desired.
Although many encoding techniques are available in the market this project aims at
understanding the limitations and configurations needed to perform a new technique (DNA
Cryptography) for encoding text.
2.2 Product Perspective
The main purpose or goal of the project is to implement the basic fundamentals of DNA
Cryptography using the Java platform so as to produce an encoding tool capable of applying the
elementary encoding transformations to the text. Added to this it is aimed to obtain a clear
understanding of the Java cryptography and its native API.
Chapter 3NIT KURUKSHETRA 21
DNA CRYPTOGRAPHY
System Requirement Analysis
NIT KURUKSHETRA 22
DNA CRYPTOGRAPHY
3 System Requirement Analysis:
3.1 Characteristics
The important characteristics of the system being developed:
FUNCTIONS
~ Loading the text file from source.
~ Encoding the text using DNA cryptography and PCR
amplifications.
INPUT
~ User input text file for encoder
~ Encoded file for the decoder
OUTPUT
~ A Transformed encoded text for sending to decoder
~ Original text file at decoder
3.2 System Requirements
The following requirements must be fulfilled to run the software on any computer system .
HARDWARE SPECIFICATIONS
Processor Intel Pentium III or higher
MonitorColor Monitor
800 x 600 or higher resolution
AmplifierPCR (Polymerase Chain Reaction)
Amplifier
NIT KURUKSHETRA 23
DNA CRYPTOGRAPHY
SOFTWARE SUPPORT
Operating SystemWindows 9x / XP/ NT / 2000
JVM and JRE installed.
Framework NetBeans 6.0
3.3 Technology Used
Programming Language JAVA 5
3.4 Use Case Diagram
3.4.1 Encoder
Fig 3.4.1 Usecase diagram(encoder)
NIT KURUKSHETRA 24
DNA CRYPTOGRAPHY
3.4.2 Decoder
Fig 3.4.2 Usecase diagram(decoder)
NIT KURUKSHETRA 25
DNA CRYPTOGRAPHY
Chapter 4Project Overview
NIT KURUKSHETRA 26
DNA CRYPTOGRAPHY
4 Project Overview
Fig 4.1 Project overview
The above figure shows the basic components comprising a typical general-purpose system used
for dna cryptography. The functions of each component is as described below.
The computer is a general computer that can range from a PC to a supercomputer. In dedicated
applications sometimes specialized computers are used to achieve the desired level of
performance.
Text File is a user input that has to be encoded.
PCR Amplifier is the hardware component that will be used for converting the text into a
graphical format which reduces the space consumed. It consists of specialized modules that
perform specific tasks.
NIT KURUKSHETRA 27
ComputerText
File
PCR
Amplifier
Network (To Receiver)
DNA CRYPTOGRAPHY
Chapter 5Software Design
NIT KURUKSHETRA 28
DNA CRYPTOGRAPHY
5 Software Design
5.1 Methodology OF Encryption Scheme
The encryption process is:
C. T. = EKA (P.T.)
The decryption process is:
DKB (C.T.) = DKB (EKA (P.T.)) = P.T.
STEPS:
1. Key generation
2. Encryption
3. Decryption
5.2 Flow Diagrams
5.2.1 Encoder
Fig 5.2.1 Flow Diagram(encoder)
NIT KURUKSHETRA 29
DNA CRYPTOGRAPHY
5.2.2 Decoder
Fig 5.2.2 Flow Diagram(decoder)
NIT KURUKSHETRA 30
DNA CRYPTOGRAPHY
5.3 Class Diagrams
5.3.1 Encoder
Fig 5.3.1 Class diagram(encoder)
NIT KURUKSHETRA 31
DNA CRYPTOGRAPHY
5.3.2 Decoder
Fig 5.3.2 Class diagram(decoder)
NIT KURUKSHETRA 32
DNA CRYPTOGRAPHY
5.3.3 KeyGen
Fig 5.3.3 Class diagram(keyGen)
NIT KURUKSHETRA 33
DNA CRYPTOGRAPHY
Chapter 6Software Testing
NIT KURUKSHETRA 34
DNA CRYPTOGRAPHY
6 Testing
6.1 Testing MethodologySoftware testing is critical element of software quality assurance and represents the ultimate
review of specification, design and coding. It is used to detect errors. Testing is a dynamic
method for verification and validation, where the system to be tested is executed and the
behavior of the system is observed.
6.2 Testing Objectives
1. Testing is a process of executing a program with the intent of finding an error.
2. A good test case is one that has a high probability of finding an as-yet-
undiscovered error.
3. A successful test is one that uncovers an as-yet-undiscovered error.
4. The above objectives imply a dramatic change in viewpoint. They move counter
to the commonly held view that a successful test is one in which no errors are
found. Our objective is to design tests that systematically uncover different
classes of errors and do so with a minimum amount of time and effort.
6.3 Testing Technique
The techniques followed throughout the testing of the system are as follows:
6.3.1 Black-Box TestingBlack box testing focuses on the functional requirements of the software. That is, Black Box
testing enables the software engineer to derive sets of input conditions that will fully exercise all
functional requirements for a program. Black Box Testing is not an alternative to white-box
techniques. Rather, it is a complementary approach that is likely to uncover a different class of
errors than white-box methods.Black-Box Testing attempts to find errors in the following
categories:
Incorrect or missing functions.
NIT KURUKSHETRA 35
DNA CRYPTOGRAPHY
Interface errors.
Errors in data structures or external data base access.
Performance errors.
Initialization and termination errors.
* Unlike White Box Testing, which is performed early in the testing process, Black Box Testing
tends to be applied during later stages of testing. Because Black Box Testing purposely
disregards control structure, attention is focused on the information domain. Tests are designed
to answer the following questions:
How is functional validity tested?
What classes of input will make good test cases?
Is the system particularly sensitive to certain input values?
How are the boundaries of a data class isolated?
What data rates and data volume can the system tolerate?
What effect will specific combinations of data have on system operation?
By applying black box techniques, we derive a set of test cases that satisfy the following criteria:
Test cases that reduce, by a count that is greater than one, the number of
additional test cases that must be designed to achieve reasonable testing, and
Test cases that tell us something about the presence or absence of classes of
errors, rather than errors associated only with the specific test at hand.
6.3.2 White-Box Testing
White Box Testing knowing the internal workings of a product tests can be conducted to ensure
that internal operations are performed according to specifications and all internal components
have been adequately exercised.
Using white box testing methods the test cases that can derived are:
All independent paths with in a module have been exercised at least once.
Exercise all logical decisions on their true and false sides.
NIT KURUKSHETRA 36
DNA CRYPTOGRAPHY
Execute all loops at their boundaries and within their operational bounds.
Exercise internal data structures to ensure their validity.
6.3.3 Control Structure Testing
6.3.3.1 Condition Testing
Condition testing is a test case design method that exercises the logical conditions
contained in a program module. If a condition is incorrect then at least one component of the
condition is incorrect. Therefore types of errors in a condition include the following
Boolean operator error
Boolean variable error
Boolean parenthesis error
Relational operator error
Arithmetic expression error
6.3.3.2 Loop Testing
Loops are the corner stone for the vast majority of all algorithms implemented in software. Loop
testing is a white-box testing technique that focuses exclusively on the validity of loop
constructs. Four different classes of loops:
Simple Loops
Nested Loops
Concatenated Loops
Unstructured Loops
6.3.3.3 Dataflow Testing
The dataflow testing method selects test paths of a program according to the location of
definitions and uses of variables in the program. In this testing approach, assume that each
statement in a program is assigned a unique statement number and that each function does not
modify its parameters or global variables.
It is useful for selecting test paths of a program containing nested if and loop statement. This
approach is effective for error detection. However, the problems of measuring test coverage and
NIT KURUKSHETRA 37
DNA CRYPTOGRAPHY
selecting test paths for data flow testing are more difficult than the corresponding problems for
condition testing.
6.4 Testing Strategies
A strategy for software testing integrates software test case design methods into a well planned
series of steps that result in the successful construction of software. A software testing strategy
should be flexible enough to promote a customized testing approach.
6.4.1 Unit Testing
Unit testing focuses verification efforts on the smallest unit of software design. It is white box
oriented. Unit testing is essentially for verification of the code produced during the coding phase
and hence the goal is to test the internal logic of the module. Others consider a module for
integration and use only after it has been unit tested satisfactorily.
The module interface is tested to ensure that information properly flows in and
out of program.
Local data structure is examined to ensure that data stored temporarily maintain
its integrity.
Boundary conditions are tested to ensure that modules operate properly at
boundary limits of processing.
All independent paths are exercised to ensure all statements in a module have
been executed at least once.
All error-handling paths are tested.
6.4.2 Integration Testing Integration testing focuses on design and construction of the software architecture. For example:
- We followed a systematic technique for constructing the program structure that is “putting
them together”- interfacing at the same time conducting tests to uncover errors. We took unit
tested components and build a program that has been dictated by design.
NIT KURUKSHETRA 38
DNA CRYPTOGRAPHY
6.4.3 Validation Testing It is achieved through a series of Black Box tests. An important element of validation process is
configuration review. It is intended for all the elements are properly configured and cataloged. It
is also called AUDIT.
6.4.4 System Testing The last high-order testing step falls outside the boundary of software engineering and into tile
broader context of computer system engineering. Software, once validated, must be combined
with other system element (e.g., hardware, people, and database).System testing verifies that all
elements mesh properly and that overall system function/performance is achieved.
It is a series of different tests whose primary purpose is to fully exercise the computer-based
system. Although each test has a different purpose all work to verify that system elements have
been properly integrated and perform allocated functions.
NIT KURUKSHETRA 39
DNA CRYPTOGRAPHY
NIT KURUKSHETRA 40
Chapter 7Project Snapshots
DNA CRYPTOGRAPHY
7.1 Text file
Fig 7.1 Snapshot(original text)
NIT KURUKSHETRA 41
DNA CRYPTOGRAPHY
7.2 Encoded file
Fig 7.2 Snapshot(encoded text)
NIT KURUKSHETRA 42
DNA CRYPTOGRAPHY
7.3 Decoded file
Fig 7.3 Snapshot(decoded text)
NIT KURUKSHETRA 43
DNA CRYPTOGRAPHY
Chapter 8Conclusion
NIT KURUKSHETRA 44
DNA CRYPTOGRAPHY
8 Conclusion
The main purpose or goal of the project was to study and implement the basic fundamentals of
DNA cryptography on textual information. This project provides an insight into the various
details of the DNA and its use in cryptography purposes. This project provided us with an
opportunity to analyse and practice all the phases of the Software Development Life Cycle.
NIT KURUKSHETRA 45
DNA CRYPTOGRAPHY
Chapter 9Future Prospects & Enhancements
NIT KURUKSHETRA 46
DNA CRYPTOGRAPHY
9 Future Prospects and Enhancements
This project can be extended to encrypt other data formats.
The space complexity can be reduced by practical usage of PCR Amplifier.
Ongoing researches could be used for the future enhancement of this project.
DNA Cryptography can be used to prevent cyber crimes like hacking, and provide
secure channel for communication.
NIT KURUKSHETRA 47
DNA CRYPTOGRAPHY
APPENDIX
Abbreviations Fullforms
DNA Deoxyribose Nucleic Acid
RNA Ribose Nucleic Acid
PCR Polymer Chain Reaction
C Cytosine
T Thymine
A Adenine
G Guanine
U Uracil
mRNA Messanger Ribose Nucleic Acid
tRNA Transfer Ribose Nucleic Acid
NIT KURUKSHETRA 48
DNA CRYPTOGRAPHY
Bibliography
Books & Literature
[1] “Herbert Schildt”, JAVA2 Complete Reference, Fifth Edition, Tata McGraw-Hill
Publishing Company Limited , 2004
[2] Scott W. Amber , JAVA2 Enterprise Edition 1.4 Bible ,Willey Publishing Inc. , 2003
[3] Java 5.0 API Documentation
Websites
[4] Hodorogea Tatiana, Vaida Mircea-Florin , Borda Monica, Streletchi Cosmin,
A Java Crypto Implementation of DNAProvider Featuring Complexity in Theory and
Practice, IEEE 2008
[5] Sherif T. Amin , Magdy Saeb , Salah El-Gindi,
A DNA-based Implementation of YAEA Encryption Algorithm
[6] Guangzhao Cui , Limin Qin , Yanfeng Wang , Xuncai Zhang
An Encryption Scheme Using DNA Technology, IEEE 2008
[7] Ning Kang, A Pseudo DNA Cryptography Method
[8] Geoff C. Smith, Ceridwyn C. Fiddes, Jonathan P. Hawkins & Jonathan P.L. Cox,∗Some possible codes for encrypting data in DNA,
Biotechnology Letters 25: 1125–1130, 2003.
NIT KURUKSHETRA 49