Multi-crystal symmetry analysis and scaling with DIALS


Recent additions to DIALS and xia2 have enabled multi-crystal analysis to be performed following integration. These tools are particularly relevant for analysis of many partial-datasets, which may be the only pracitical way of performing data collections for certain crystals. After integration, the space group symmetry can be investigated by testing for the presence of symmetry operations relating the integrated intensities of groups of reflections - the program to perform this is analysis is dials.symmetry (with algorithms similar to those of the program Pointless). Another thing to consider is that for certain space groups (polar space groups), there is an inherent ambiguity in the way that the diffraction pattern can be indexed. In order to combine multiple datasets from these space groups, one must reindex all data to a consistent setting, which can be done with the program dials.cosym (see Gildea and Winter for details). Finally, the data must be scaled, to correct for experimental effects such as differences in crystal size/illuminated volume and radiation damage - this can be done with the program dials.scale (with algorithms similar to those of the program Aimless). After the data has been scaled, choices can then be made about applying a resolution limit to exclude certain regions of the data which may be negatively affected by radiation damage.

In this tutorial, we shall investigate a multi-crystal dataset collected on the VMXi beamline, Diamond’s automated facility for data collection from crystallisation experiments in-situ. The dataset consists of four repeats of a 60-degree rotation measurement on a crystal of Proteinase K, taken at different locations on the crystal. We shall start with the integrated reflections and experiments files generated by running the automated processing software xia2 pipeline=dials. Have a look at ‘Processing in Detail’ tutorial if you want to know more about the different processing steps up to this point.


The data for this tutorial is currently only available for users at diamond on BAG training. After typing module load bagtraining you’ll be moved to a working folder, with the data already located in the tutorial-data/ccp4/integrated_files subdirectory. The processing in this tutorial will produce quite a few files, so it’s recommended to make an move to new directory:

mkdir multi_crystal
cd multi_crystal


The easiest way to run these tools for a multi-dataset analysis is through the program xia2.multiplex. This runs several DIALS programs, including the programs described above, while producing useful plots and output files.

To run xia2.multiplex, we must provide the path to the input integrated files from dials.integrate:

xia2.multiplex ../tutorial-data/ccp4/integrated_files/{reflections_*.pickle,experiments_*.json}

This runs dials.cosym, dials.symmetry, dials.scale, calculates a resolution limit and reruns dials.scale with a resolution cutoff. The final dataset is exported to an unmerged mtz and a html report plot is generated. The easiest way to see the results is to open the html report in your browser of choice e.g.:

firefox xia2-multi-crystal-report.html

Provided is a summary of the merging statistics as well as several plots, please explore these for a few minutes now!

So far the merging statistics look okay but not great, and if you are familiar with this system you will notice that the space group is P 1 2 1, whereas in the published structure the space group is P 43 21 2 (search for the PDB entry for Proteinase K)! Let’s have a look at the log files to see what’s going on. Type ‘ls’ in the current directory to see the files. The files are numbered based on the order of processing (so all 1_* files are from dials.cosym, 2_* from dials.symmetry etc.). First let’s inspect the results from dials.cosym, using the cat or open commands:

cat 1_dials.cosym.log

At the bottom of the log, you’ll notice:

Space groups:
P 4 2 2
[0, 1, 2, 3]
Reindexing operators:
[0, 1, 2, 3]

This looks good so far, in this case the datasets already had a consistent indexing, however in general there may be different reindexing operators applied to different datasets.

Next let’s look at the dials.symmetry results:

cat 2_dials.symmetry.log

Scoring all possible sub-groups
Patterson group     Likelihood  NetZcc  Zcc+   Zcc-   CC     CC-    delta  Reindex operator
P 1 2/m 1        *  0.424        2.72    7.16   4.44   0.72   0.44  0.0    a,b,c
P -1                0.294        2.21    7.20   4.99   0.72   0.49  0.0    a,b,c
C 1 2/m 1           0.081        1.19    6.18   4.99   0.62   0.49  0.0    -a-b,a-b,c
P 1 2/m 1           0.063        0.92    6.00   5.08   0.60   0.49  0.0    b,-c,-a
C 1 2/m 1           0.063        0.91    5.99   5.09   0.58   0.50  0.0    a-b,a+b,c
P 1 2/m 1           0.043        0.49    5.71   5.22   0.56   0.51  0.0    -b,a,c
P 4/m               0.014        0.33    5.55   5.22   0.52   0.52  0.0    a,b,c
P m m m             0.013        1.17    5.83   4.66   0.58   0.46  0.0    a,b,c
C m m m             0.004        0.10    5.40   5.30   0.53   0.51  0.0    -a-b,a-b,c
P 4/m m m           0.000        5.36    5.36   0.00   0.52   0.00  0.0    a,b,c
Best solution: P 1 2/m 1
Unit cell: (68.3974, 68.3974, 104.002, 90, 90, 90)
Reindex operator: a,b,c
Laue group probability: 0.424
Laue group confidence: 0.417

dials.symmetry has found P 1 2/m 1 as the most likely Patterson group, however you’ll notice that the likelihood is not very high (0.424). It is important to bear in mind that this symmetry analysis was performed on unscaled intensities. If the intensities need to be scaled significantly then this can hide the true symmetry of the dataset.

Even though the data was scaled in the wrong space group, scaling will still have helped the intensities to become more consistent. While the developers of DIALS work very hard to implement automated rechecking of the symmetry, we’ll have to reprocess manually from this point forwards.

Manual reprocessing

Let’s try running dials.symmetry again on the output of the first scaling run (before any resolution cutoff has been applied):

dials.symmetry 4_scaled_experiments.json 4_scaled_reflections.pickle

Scoring all possible sub-groups
Patterson group       Likelihood  NetZcc  Zcc+   Zcc-   CC     CC-    delta  Reindex operator
P 4/m m m        ***  0.995        9.35    9.35   0.00   0.93   0.00  0.0    a,b,c
C m m m               0.002        0.06    9.37   9.31   0.94   0.93  0.0    -a+b,a+b,-c
P m m m               0.002        0.09    9.38   9.29   0.94   0.93  0.0    a,b,c
P 4/m                 0.000        0.10    9.40   9.30   0.94   0.93  0.0    a,b,c
C 1 2/m 1             0.000        0.16    9.46   9.30   0.95   0.93  0.0    -a+b,a+b,-c
P 1 2/m 1             0.000        0.14    9.45   9.30   0.95   0.93  0.0    a,c,-b
C 1 2/m 1             0.000        0.14    9.45   9.30   0.95   0.93  0.0    -a-b,-a+b,-c
P 1 2/m 1             0.000        0.11    9.43   9.31   0.94   0.93  0.0    -b,a,c
P 1 2/m 1             0.000        0.23    9.51   9.28   0.95   0.93  0.0    a,b,c
P -1                  0.000        0.31    9.61   9.30   0.96   0.93  0.0    a,b,c
Best solution: P 4/m m m
Unit cell: (68.3815, 68.3815, 103.974, 90, 90, 90)
Reindex operator: a,b,c
Laue group probability: 0.995
Laue group confidence: 0.995

If scale factors from scaling are present, these will be applied before the symmetry analysis. Now the correct solution is easily found with a very high likelihood, that’s much better! We can now run scaling again on the output of dials.symmetry. Let’s also output the results to an unmerged mtz so that we can later create a report:

dials.scale reindexed_reflections.pickle reindexed_experiments.json unmerged_mtz=rescaled.mtz

From the output, you can see that the merging statistics are significantly better than before, with high correlation coefficients close to 1. At this point we could also apply a resolution limit by supplying d_min= to dials.scale, however the correlation coefficient and I/sigma looks good out to the highest resolution so it is unneccesary in this case.

To get a useful summary report, we can generate a dials-report or a xia2-report: scaled.pickle scaled_experiments.json rescaled.mtz

Take a look and the reports, how do the results compare the initial run of xia2.multiplex?

Almost there

If you looked carefully at the reports, you may have noticed that the fourth dataset is giving significantly higher R-merge values and much lower I/sigma. Therefore the question one must ask is if it is better to exclude this dataset. We can get some useful information about the agreement between datasets by running the program dials.compute_delta_cchalf. This program implements a version of the algorithms described in Assmann et al.

dials.compute_delta_cchalf scaled.pickle scaled_experiments.json

# Datasets: 4
# Reflections: 222934
# Unique: 26478
CC 1/2 mean: 94.896
CC 1/2 excluding dataset 0: 92.111
CC 1/2 excluding dataset 1: 92.086
CC 1/2 excluding dataset 2: 92.022
CC 1/2 excluding dataset 3: 99.327
Dataset: 3, Delta CC 1/2: -4.431
Dataset: 0, Delta CC 1/2: 2.785
Dataset: 1, Delta CC 1/2: 2.810
Dataset: 2, Delta CC 1/2: 2.874

It looks like we could get a significantly better CC 1/2 by excluding the final dataset - it has a negative Delta CC 1/2. But how bad is too bad that it warrants exclusion? Unfortunately this is a difficult question to answer and it may be the case that one would need to refine several structures with different data excluded to properly address this question. If we had many datasets and only a small fraction had a very large negative Delta CC 1/2 then one could argue that these measurements are not drawn from the same population as the rest of the data and should be excluded.

To see the effect of removing the last dataset (dataset ‘3’), we can rerun dials.scale (note that this will overwrite the previous scaled files). We have to provide the identifier of the dataset that we want to exclude, which are usually a string of integers (‘0’, ‘1’, ‘2’ … based on the order of input):

dials.scale scaled.pickle scaled_experiments.json exclude_datasets=3 unmerged_mtz=scaled.mtz

We could have also excluded a subset of images, for example using the option exclude_images=3:301:600 to exclude the last 300 images of dataset 3. This option could be used to exclude the end of a dataset that was showing sigificant radiation damage, or if the crystal had moved out of the beam part-way through the measurement.

Looking at the output from dials.scale, the merging statistics have significantly improved again, and although the multiplicity has reduced, we have not sacrificed much completeness. The anomalous correlation has only now become apparant, so maybe it would be best to proceed with only these three datasets for structure solution.