Genes in HIV
This page summarizes the genes in HIV. These genes are sequences of RNA
that encode a particular protein.
GAG
The group antigen gene is found in all retroviruses. It makes various
proteins necessary to protect the virus. In HIV, it has three parts:
MA (matrix), CA (capsid), and NC (nucleocapsid).
POL
The polymerase gene is also found in all retroviruses. It makes enzymes
necessary for virus replication. In HIV, it also has three parts:
PR (protease), IN (endonuclease), and RT (reverse transcriptase).
ENV
The envelope gene is also found in all retroviruses. It makes proteins
for the envelope to the virus. In HIV, it has two parts. SU (surface
envelope, gp120) and TM (transmembrane envelope, gp41).
tat
The transactivator gene influences the function of genes some distance
away. It controls transactivation of all HIV proteins.
rev
The differential regulator of expression of virus protein genes.
vif
The virus infectivity factor gene is required for infectivity as
cell-free virus.
nef
The negative regulator factor retards HIV replication.
vpr
The virus protein R gene has an undetermined function.
vpu
The virus protein U gene is required for efficient viral replication and
release. It is found only in HIV-1.
vpx
The virus protein X gene has an undetermined function. It is found only
in HIV-2 and SIV.
The following are diagrams of the genes in HIV, visna, and HTLV-I.
These diagrams are structured with the virus's
genetic sequence going from left to right. (Imagine all the parts of
this diagram collapsed vertically into one line.)
The top part of the page shows the position of the various genes in the
virus.
The bottom of the page shows the
sequence read in three different ways. Lines in these three parts
indicate "stop codons", explained below.
The large blank areas could contain genes; note how
they line up with the known genes shown above.
To understand these diagrams, you must know a few facts about how DNA
(RNA) is structured:
The DNA sequence can be expressed as a sequence of letters c,g,a, and t.
Genes in the sequence represent proteins. To form a protein, the
letters are taken three at a time; each group of three specifies an
amino acid, according to the genetic code. E.g. aaa -> K, cac -> H, cca
-> P, where K, H, and P represent different amino acids. Thus,
the DNA sequence aaacaccca would be converted into a protein consisting
of three amino acids: KHP.
Note that a DNA sequence can be read three different ways, depending on
which base you use to start. For instance, aaacaccca could be read as "..a
aac acc ca..", which would form an entirely different protein sequence.
Thus, the same DNA could encode up to three different genes.
Finally, the sequences taa, tag, and tga are special; they are "stop
codons". If one of these is encountered, it ends protein synthesis.
Note that a sequence such as ...taa... will only encounter the stop codon if
it is being read as the appropriate groups of three. If it is read as
..t aa. ..., then the protein can be formed. Also note that in a random
DNA sequence, 3 out of 64 codons would be stop codons. Thus, we could
expect to hit one about every 21 codons. Since proteins are normally
much longer, there will be an unusual lack of stop codons in the DNA
that is a gene specifying a protein.
What this means is that we can go through the DNA, starting at position
1, position 2, and position 3, looking for stop codons. If we find a
long area without them, then it probably represents a gene. This region
is known as an ORF, or "Open Reading Frame".
Here is the Perl script to generate these
figures. It takes a Genbank format file and outputs a MIF file that
can be read by FrameMaker.
Ken Shirriff:
shirriff@eng.sun.com
This page:
http://www.righto.com/theories/hiv_genes.html
Copyright 2000 Ken Shirriff.