The Florida dentist

One study claims that HIV was transmitted from a Florida dentist to his patients. (Ou et al, "Molecular epidemiology of HIV transmission in a dental practice", Science, 1992 May 22, 256(5060).) This conclusion was reached by epidemiologic investigation and comparing the genetic sequences of his virus to the virus in his patients and of local HIV-infected control people. The genetic comparison found that five of the patients had closely-related viruses, while other patients had unrelated virus. (Click here for other information on the dentist.)

There has been controversy about this conclusion, with some people claiming that the patient's virus wasn't any closer than the local control viruses.

I decided to look at the DNA sequences myself (you can get them from ncbi.nlm.nih.gov). This data includes sequences from the dentist, from patients A,B,C,D,E,F,G,H, and from 35 local residents. I looked at a sequence of 40 envelope amino acids to see how they compared among the different strains.

Different strains from the dentist had 36 to 40 of the 40 amino acids matching when compared with each other. The dentist and A had 37 to 39 matching. The dentist and B had 37 to 39 matching. The dentist and C had 38 to 40 matching. The dentist and D had 30 to 31 matching. The dentist and E had 37 to 38 matching. The dentist and F had 31 to 33 matching. The dentist and G had 36 to 38 matching. The dentist and H had 28 to 30 matching. The dentist and locals had an average of 32 matching, with the best match being 34 to 36.

It should be clear from these numbers that the dentist's strains are very close to A,B,C,E, and G; they are as similar as different mutations in the dentist. The dentist's strains are not so close to D, F, H, or the locals.

This is, of course, not at all rigorous. I just wanted to see if the original study seemed believable to me when I looked at the data myself. It makes me inclined to believe the claims that 5 of the patients were infected by the dentist.

What do these sequences mean?

You can take the DNA (RNA) from HIV and sequence it to get the genome, which is about 10,000 bases: "ctagcagaa...". This genetic sequence encodes various proteins and enzymes that HIV uses, such as reverse transcriptase. Each group of three letters in the DNA indicates an amino acid, according to the genetic code. So, from the DNA, you can determine the sequence of amino acids in the protein, and indicate each amino acid with a letter: "LAEGVIIRS...". The sequence controls how the protein folds up and its properties.

Now, the idea is that as things mutate, you end up with changes in the DNA sequence and changes in the amino acid sequences. For example, you can look at the amino acid sequence in hemoglobin in a bunch of different species, look at how closely related they are, and determine how the species evolved from each other.

What the HIV study looked at was part of the envelope of the virus that mutates very rapidly. The envelope is what antibodies attack, so by changing the envelope, the virus can evade antibodies, which is why a vaccine is so hard to make. So, they take the virus, sequence the DNA that specifies the amino acids in the appropriate part of the envelope, and then determine the amino acid sequence from the DNA sequence.

Now, I took these sequences and looked at a subsequence of 40 amino acids out of it.

Dentist, sample 1:   EVVIRSANFTDNAKIIIVQLNASVEINCTRPNNNTRKGIH
Patient A, sample 1: EVVIRSANFTDNAKIIIVQLNASVEINCTRPNNNTRKGIR
Local #13:           EVVIRSENFSDNAKTIIVQLKESVEINCTRPNNNTRKRIT

Then it's just a matter of counting up the matches, so see how closely related they are (39 matches out of 40 in this case between the dentist and A). This comparison can be done more rigorously, of course, which the original paper did. I just wanted to take a quick look. The paper also looked at the DNA sequences as well as the amino acid sequences.

So the assumption is that the less related the two samples are, the more differences there will be between them. This isn't strictly true, since two unrelated samples could happen to mutate to get closer by chance.

Why does the dentist have several different sequences?

HIV keeps mutating in the body, avoiding antibodies, getting AZT resistance, and things like that. So even different samples taken from the same person will have slightly different sequences. There were 6 samples from the dentist, so depending on which you looked at, there would be a different number of matches. The dentist's own samples had 36 to 40 matches when compared with each other. This also explains why the match numbers are ranges; the value depends on which dentist value is used.

The data

The following is the sequence data that I used:

M90848-dentist	EVVIRSANFTDNAKIIIVQLNASVEINCTRPNNNTRKGIH
M90849-dentist	EIVIRSANFTDNAKIIIVQLNASVEIDCTRPNNNTRKGIH
M90850-dentist	EVVIRSANFTDNAKIIIVQLNASVEINCTRPNNNTRKGIH
M90851-dentist	EVVIRSANFTDNAKIIIVQLNASVEINCTRPNNYTRKGIR
M90852-dentist	EVVIRSANFTDNAKIIIVQLNASVEINCTRPNNNTRKGIH
M90853-dentist	EVVIRSANFTDNAKIIIVQLNASVEINCTRPNNYTRKGIR
M90855-patientA	EVVIRSANFTDNAKIIIVQLNASVEINCTRPNNNTRKGIR
M90862-patientB	EIVIRSANFTDNAKIIIVQLNASVEINCTRPNNNTRKGIH
M90877-patientC	EVVIRSANFTDNAKIIIVQLNASVEINCTRPNNNTRKGIH
M90882-patientD	EVVIRSANFSDNAKTIIVQLNKSVNITCVRPNNNTRESIP
M90893-patientE	EIVIRSANFTDNAKIIIVQLNASVEINCTRPNNNTRKGIN
M90895-patientF	EVVIRSENFMDNVKTIIVQLNESVQINCTRPNNNTRKSIH
M90902-patientG	EVVIRSANFTDNAKIIIVQLNAPVEINCTRPNNNTRGGIH
M90907-patientH	EVIIRSENFTDNAKTIIVQLNATINIICERPHNNTRKSIH
M90916-control	EVVIRSENFTNNAKIIIVHLNKTVNITCTRPNNNTRRSIP
M90917-control	EIVIRSANFTDNTKIIIVQLNESVEINCTRPNNYTGKRLS
M90918-control	EIVIRSANFTDNTKIIIVQLNESVKINCTRPSNNTRKSIP
M90919-control	EIVIRSANFTDNTKIIIVQLNESVEINCTRPNNYTGKRLS
M90920-control	EIVIRSANFTDNTKIIIVQLNESVEINCTRPSNYTGKRLS
M90921-control	EIVIRSANFTDNTKIIIVQLNESVEINCTRPSNNTRKSIP
M90922-control	EIVIRSANFTDNTKIIIVQLNESVEINCTRPSNNTRKSIP
M90924-control	EVVIRSENFTDNTKTIIVQLNTSVTINCTRPGNNTRKSIT
M90925-control	EVIIRSENFTDNTKTIIVQLNTSVTINCTRPGNNTRKSIT
M90926-control	EVVIRSENFTDNTKTIIIQLNTSVTINCTRPGNNTRKSIT
M90927-control	EVVIRSENFTNNAKTIIVQLNTSVTINCTRPGNNTRKSIT
M90928-control	EIVIRSANFTDNTKIIIVQLNESVEINCTRPSNNTSKSIH
M90929-control	EVVIRSENFTNNAKTIIVQLNTSVTINCTRPGNNTRKSIT
M90932-control	EVVIRSENFTNNAKTIIVQLKESVKINCTRPNNNTRKSIN
M90933-control	EVVIRSENFTDNAKTIIVQLNNSVVINCTRPNNNTRRSVH
M90934-control	EVVIRSENFTNNAKTIIVQLKESVKINCIRPNNNTRRSIN
M90936-control	EVVIRSENFTNNSKTIIVQLKESVVINCTRPNNNTRRSIH
M90938-control	EVVIRSENFSDNAKTIIVQLKESVEINCTRPNNNTRKRIT
M90939-control	EVVIRSENFTNNAKTIIVQLNVSVEINCTRPNNNTRKGIH
M90940-control	EVVIRSENFTDNAKTIIVQLKEPVEINCTRPSNNTRKGIP
M90943-control	EVVIRSDNFTDNVKTIIVQLNEAVVINCTRPNNNTRRGIH
M90944-control	EVVIRSENFTDNAKTIIVQLNESIEINCTRPNNNTRKSIP
M90945-control	EVIIRSENLTDNAKTIIVQLKEPVIINCTRPNNNTRKSIH
M90950-control	EVVIRSENISDNAKTIIVQLNESVVINCTRPNNNTRRSIH
M90951-control	EVVIRSDNFSDNARTIIVQLNESVVINCTRPNNNTSRRIS
M90952-control	EVVIRSENFTDNAKTIIVQLNQSVEINCTRPNNNTRRSIH
M90957-control	EIVIRSENFTNNARTIIVHLNESIVINCTRPNNNTGKSIH
M90959-control	EIVIRSDNFTDNAKTIIVQLNQTVEINCTRPNNNTRKSIH
M90962-control	EVVIRSKNFTDNAKTIIVQLNESVAINCSRPNNNTRKGIH
M90963-control	EIVIRSENFTDNLKNIIVQLKEPVEINCTRPGNNTRRSIH

You can simply count the matches to see where I got my numbers; this isn't rocket science. As mentioned above, I only used a short part of the sequence. You can get all the original sequences from ncbi.nlm.nih.gov if you want to do comparisons on your own. alsdkfjasldf asdfa sfasd f asd fa sd f asdfasdflkasdfj alskdfj lasdkfj asldkfjadsf asdfasdf

asdfasdfasdfasd

Ken Shirriff: shirriff@eng.sun.com