Genes in HIV

This page summarizes the genes in HIV. These genes are sequences of RNA that encode a particular protein.


The group antigen gene is found in all retroviruses. It makes various proteins necessary to protect the virus. In HIV, it has three parts: MA (matrix), CA (capsid), and NC (nucleocapsid).


The polymerase gene is also found in all retroviruses. It makes enzymes necessary for virus replication. In HIV, it also has three parts: PR (protease), IN (endonuclease), and RT (reverse transcriptase).


The envelope gene is also found in all retroviruses. It makes proteins for the envelope to the virus. In HIV, it has two parts. SU (surface envelope, gp120) and TM (transmembrane envelope, gp41).


The transactivator gene influences the function of genes some distance away. It controls transactivation of all HIV proteins.


The differential regulator of expression of virus protein genes.


The virus infectivity factor gene is required for infectivity as cell-free virus.


The negative regulator factor retards HIV replication.


The virus protein R gene has an undetermined function.


The virus protein U gene is required for efficient viral replication and release. It is found only in HIV-1.


The virus protein X gene has an undetermined function. It is found only in HIV-2 and SIV.

The following are diagrams of the genes in HIV, visna, and HTLV-I. These diagrams are structured with the virus's genetic sequence going from left to right. (Imagine all the parts of this diagram collapsed vertically into one line.) The top part of the page shows the position of the various genes in the virus. The bottom of the page shows the sequence read in three different ways. Lines in these three parts indicate "stop codons", explained below. The large blank areas could contain genes; note how they line up with the known genes shown above.

To understand these diagrams, you must know a few facts about how DNA (RNA) is structured:

The DNA sequence can be expressed as a sequence of letters c,g,a, and t. Genes in the sequence represent proteins. To form a protein, the letters are taken three at a time; each group of three specifies an amino acid, according to the genetic code. E.g. aaa -> K, cac -> H, cca -> P, where K, H, and P represent different amino acids. Thus, the DNA sequence aaacaccca would be converted into a protein consisting of three amino acids: KHP.

Note that a DNA sequence can be read three different ways, depending on which base you use to start. For instance, aaacaccca could be read as "..a aac acc ca..", which would form an entirely different protein sequence. Thus, the same DNA could encode up to three different genes.

Finally, the sequences taa, tag, and tga are special; they are "stop codons". If one of these is encountered, it ends protein synthesis. Note that a sequence such as ...taa... will only encounter the stop codon if it is being read as the appropriate groups of three. If it is read as ..t aa. ..., then the protein can be formed. Also note that in a random DNA sequence, 3 out of 64 codons would be stop codons. Thus, we could expect to hit one about every 21 codons. Since proteins are normally much longer, there will be an unusual lack of stop codons in the DNA that is a gene specifying a protein.

What this means is that we can go through the DNA, starting at position 1, position 2, and position 3, looking for stop codons. If we find a long area without them, then it probably represents a gene. This region is known as an ORF, or "Open Reading Frame".

Here is the Perl script to generate these figures. It takes a Genbank format file and outputs a MIF file that can be read by FrameMaker.

Ken Shirriff:
This page: Copyright 2000 Ken Shirriff.