| Quick navigation: | Home | Site Map || References | Biography || Copyright | Other copyright | Contact us | | |
|
Re: [ccp4bb] an over refined structure |
|
CCP4bb navigationCCP4bb <-- 2008 <-- February 2008 <-- 12 February 2008Subject: Re: an over refined structure From: Dale Tronrud det102 {- at -} UOXRAY {- dot -} UOREGON {- dot -} EDU Date: 2008-02-12 > Dirk Kostrewa wrote: >> Dear Dean and others, >> >> Peter Zwart gave me a similar reply. This is very interesting >> discussion, and I would like to have a somewhat closer look to this to >> maybe make things a little bit clearer (please, excuse the general >> explanations - this might be interesting for beginners as well): >> >> 1). Ccrystallographic symmetry can be applied to the whole crystal and >> results in symmetry-equivalent intensities in reciprocal space. If you >> refine your model in a lower space group, there will be reflections in >> the test-set that are symmetry-equivalent in the higher space group to >> reflections in the working set. If you refine the >> (symmetry-equivalent) copies in your crystal independently, they will >> diverge due to resolution and data quality, and R-work and R-free will >> diverge to some extend due to this. If you force the copies to be >> identical, the R-work & R-free will still be different due to >> observational errors. In both cases, however, the R-free will be very >> close to the R-work. >> > Ah- that's going way to fast for the beginners, at least one of them! > Can someone explain why the R-free will be very close to the R-work, > preferably in simple concrete terms like Fo, Fc, at sym-related > reflections, and the change in the Fc resulting from a step of refinement? > > Ed Dear Ed, Some years ago I was castigated in group meeting for stating that the question posed by a post-doc was a "bad question". I gather this is considered rude behavior. My belief is that if you say "good question" to all questions you degrade the value of those truly "good questions" when they come along. Yours is a "good question" and demands a proper answer. Like all good questions, however, the answer is neither easy nor short. I'm going to make a stab at it, and I may end up far from the mark, but I'm sure someone will point out my failings in follow-up letters. At least I'll get these ideas out of my head so I can get back to my real work. The other attempts to answer this question, including my own, have included terms such as "error" and "bias" and, without definitions for these terms, are ultimately unsatisfying. It seems to me that the whole point of refinement is to bias the model to the observations, so the real matter is "inappropriate bias". This brings up the question of what a model is intended to fit and what it is not. When I first implemented an overall anisotropic B correction in TNT I noticed that the correction for a given model would grow larger as more refinement cycles were run. It appears that a model consisting of only atomic positions and isotropic B's can be created where the Fc's have an anisotropic fall off in resolution. When the isotropic model was refined with the anisotropy uncorrected the parameters managed to find a way to fit that anisotropy. When the anisotropy was properly modeled the positions and isotropic B's could go back to their job of fitting the signal they were designed to fit. This is what I would define a "inappropriate bias". The parameters of the model are attempting to fit a signal they were not designed to fit. In this example, the distortion of the parameters is distributed over a large number and each parameter is changed by a small amount; an amount usually considered too small to be significant, but in aggregate they produce a significant signal (the anisotropic falloff of the model's Fc's). A more trivial example would the the location of the side chains of amino acids near the density of an unmodeled ligand. Refinement will tend to move the side chains away from the center of their own density toward the unfilled density, perhaps even inappropriately placing a side chain in the ligand density instead of its own. Again, the fit of the parameters to the signal they were designed to fit has been degraded by the attempt to fit a signal they were not, and could never, fit properly. When well designed parameters fit the signal they were designed to fit the model has predictive power. I guess that is what "designed" is defined to mean in this case. A model that can't predict things is useless, and that is why the free R is such a good test of a model. If the parameters of a model are fitting signal in the data that they were not designed to fit, all bets are off. There is no reason to expect that they will have the same predictive power, except by happenstance or (bad) luck. Placing the end of an arginine residue in the density of a ligand does, at least, put a few atoms in places where atoms should be, and that will tend to lower the free R, but the requirement that there be bridging atoms linking those atoms to the main chain of the protein will cause the parameters of the middle atoms to engage is contortions to try to fit the data, and those contortions will harm the ability of the model to make correct predictions. Going back to the first example, there is also no reason to expect that the small perturbations in an isotropic model refined against an anisotropic data set will be able to predict the anisotropic decay in the amplitudes of the test set reflections. Well designed parameters are expected to have predictive power when they are used to fit the signal they were designed for, but not when they are trying to fit some other type of signal. You will note that I've done my best to avoid the term "error". One man's error is another man's signal, or in more politically correct language, one model's error is another (better) model's signal. The usual vague references to "error" are often just a way of saying "give up". In a data set there is the signal you want to model, and there is error. The goal is to fit the signal despite the error. The textbook descriptions of optimization deal with error as a uniform, gaussian, random signal imposed atop the "true" signal of the data set, and optimization methods are designed to result in a good set of parameters despite the presence of this "error" and therefor it can be ignored. Our "error" is neither uniform, gaussian, nor random. The methods we refinement package authors have pulled from the textbooks are not robust against our style of "error" and the parameters of our models are inappropriately perturbed from their proper values by its presence. This perturbation causes the predictive ability of the model to be degraded and the free R becomes larger than the working R. This effect is what we use the word "bias" to describe. A four letter word is certainly easier to type than the last six paragraphs. Now I've dug myself into a real hole. I've defined bias, and the difference between the free R and the working R in terms of the unmodeled signal ("error") in our diffraction data set. To discuss how noncrystallography and crystallographic symmetry are connected to the unmodeled signal I have to know something about the distribution of unmodeled signal in reciprocal space. Pretty much my entire career, the late night topic at conferences has been "Why can't I get my R factor lower?". There has been endless speculation as to what will be required to fit that last 20% of R factor. Now I'm stuck with at least describing the pattern behind this residual to answer your question. I'll start by pointing out that the uncertainty of measurement of the Fobs is unimportant. The R merge is usually lower than 10% (on intensity) and, as has been mentioned many times on this bulletin board, the R merge is more a measure of the quality of the intensities before merging. The merged intensities will be of higher quality due to the redundancy of measurements. The remaining uncertainties are tiny compared to the unexplained 20% (on amplitude) of R value in refinement. Here I go out on the ledge and make a proposal. I'm not saying that this idea explains all of the 20% but it is, I believe, a big part and enough to explain the "crosstalk" between reflections in a data set with symmetry. In my current refinement, I have an R work/free of 16.5/20.9% at 2.2A resolution. I also have a model of a similar protein from another species which is 80% sequence identical, but with a different crystal form. That R work/free, at 2.2A resolution, is 13.0/15.5%. How was I able to achieve such good stats for the second crystal? That crystal diffracts to 1.25A and I have been able to build a model that includes individual anisotropic B's, many alternative conformations and many more water molecules. The first crystal form only diffracts to 2.2A and while I can see some hints of a few alternative conformations I have not been confident enough, from the map alone, to build any. This result indicates to me that a large part of the remaining 20.9% of free R could be eliminated if a model with conventional anisotropic B's, and alternative conformations could be constructed and refined based only on the 2.2A data set. (Insert here your favorite advertisement for TLS refinement here.) This long and boring answer is not the answer to a question about refinement methods, but a question about the difference between the working R and the free R. What I'm proposing is that a large chunk of the R value difference in my 2.2A model is due to the inappropriate fitting of positions and isotropic B's to difference map features that actually result from unmodeled anisotropic B's and alternative conformations. While these parameters can do part of the job, and lower the working R below the free R, their attempt does not improve their ability to predict the amplitudes of test set reflections because they cannot fit this signal in the proper way. Except... The difference map features that arise from these unmodeled, or improperly modeled, aspects of the protein have the same symmetry, both crystallographic and noncrystallographic, as the aspects of the model that are being properly fit by the limited parameters. When the location of an atom is pulled to the left trying to fit the data, regardless of whether that attempt is appropriate or inappropriate, every symmetry image of that atom will be pulled in the corresponding way. The symmetry related structure factors, both crystallographic and noncrystallographic, will be affected in the same way and a reflection in the test set will be tied to its mate in the working set. In summary, this argument depends on two assertions that you can argue with me about: 1) When a parameter is being used to fit the signal it was designed for, the resulting model develops predictive power and can lower both the working and free R. When a signal is perturbing the value of a parameter for which is was not designed, it is unlikely to improve its predictive power and the working R will tend to drop, but the free R will not (and may rise). 2) If the unmodeled signal in the data set is a property in real space and has the same symmetry as the molecule in the unit cell, the inappropriate fitting of parameters will be systematic with respect to that symmetry and the presence of a reflection in the working set will tend to cause its symmetry mate in the test set to be better predicted despite the fact that this predictive power does not extend to reflections that are unrelated by symmetry. This "bias" will occur for any kind of "error" as long as that "error" obeys the symmetry of the unit cell in real space. I'm sorry for the long winded post, but sometimes I get these things stuck in my head and I can't get any work done until I get it out. I hope it helps, or at least is not complete nonsense. Dale Tronrud CCP4bb navigationCCP4bb <-- 2008 <-- February 2008 <-- 12 February 2008 |
| ProteinCrystallography.org: Copyright 2006-2008 by Quid United Ltd |