Multi-crystal analysis with DIALS and BLEND

Introduction

BLEND is a CCP4 program for analysis of multiple data sets. It attempts to identify isomorphous clusters that may be scaled and merged together to form a more complete multi-crystal dataset. Clustering in blend is based on the refined cell parameters from integration, so it is important that these are determined accurately. Unfortunately, for narrow wedges of data (where BLEND is most useful) cell refinement may be complicated by issues such as the high correlation between e.g. the detector distance and the cell volume.

One solution is to fix detector parameters during cell refinement so that the detector is the same for every dataset processed. However, this is only feasible if an accurate detector model is determined ahead of time, which might require the use of a well-diffracting sacrificial crystal. If we only have the narrow wedges of data available then it is more complicated to determine what the best detector model would be to fix.

If we can make the assumption that the beam and detector did not move during and between all of the datasets we collected then we can use DIALS joint refinement to refine all of the crystal cells at the same time, simultaneously with shared detector and beam models.

In this tutorial, we will attempt to do that for 73 sweeps of data collected from crystals of TehA, a well-diffracting integral membrane protein measured using in situ diffraction from crystallisation plates at room temperature. Each sweep provides between 4 and 10 degrees of data.

This tutorial is relatively advanced in that it requires high level scripting of the DIALS command line programs, however candidate scripts are provided and the tutorial will hopefully be easy enough to follow.

Individual processing

We start with a directory full of images. It is easy enough to figure out which files belong with which sweep from the filename templates, however note that the pattern between templates is not totally consistent. Most of the sweeps start with the prefix xtal, but some have just xta. One way of getting around this annoyance would be to use the fact that each dataset has a single *.log file associated with it and identify the different datasets that way. However, we would prefer to come up with a protocol that would work more generally, not just for the TehA data. Happily we can just let DIALS figure it out for us:

dials.import /path/to/TehA/*.cbf

The following parameters have been modified:

input {
  datablock = <image files>
}

--------------------------------------------------------------------------------
DataBlock 0
  format: <class 'dxtbx.format.FormatCBFMiniPilatus.FormatCBFMiniPilatus'>
  num images: 2711
  num sweeps: 73
  num stills: 0
--------------------------------------------------------------------------------
Writing datablocks to datablock.json

With a single command we have determined that there are 73 individual sweeps comprising 2711 total images. Running the following command will give us information about each one of these datasets:

dials.show datablock.json

That was a smooth start, but now things get abruptly more difficult. Before we perform the joint analysis, we want to do the individual analysis to compare to. This will also give us intermediate files so that we don’t have to start from scratch when setting up the joint refinement job. Essentially we just want to run a sequence of DIALS commands to process each recorded sweep. However we can’t (currently) split the datablock into individual sweeps with a single command. We will have to start again with dials.import for each sweep individually - but we really don’t want to run this manually 73 times.

The solution is to write a script that will take the datablock.json as input, extract the filename templates, and run the same processing commands for each dataset. This script could be written in BASH, tcsh, perl, ruby - whatever you feel most comfortable with. However here we will use Python, or more specifically dials.python because we will take advantage of features in the cctbx to make it easy to write scripts that take advantage of parallel execution. Also we would like to read datablock.json with the DIALS API rather than extracting the sweep templates using something like grep.

The script we used to do this is reproduced below. You can copy this into a file, save it as process_TehA.py and then run it as follows:

time dials.python process_TehA.py datablock.json

On a Linux desktop with a Core i7 CPU running at 3.07GHz the script took about 8 minutes to run (though file i/o is a significant factor) and successfully processed 41 datasets. If time is short, you might like to start running it now before reading the description of what the script does. If time is really short then try uncommenting the line tasklist = tasklist[0:35] to reduce the number of datasets processed.:

#!/bin/env dials.python
import os
import sys
import glob
from libtbx import easy_run, easy_mp
from dxtbx.datablock import DataBlockFactory
from dials.test import cd

def process_sweep(task):
  """Process a single sweep of data. The parameter 'task' will be a
  tuple, the first element of which is an integer job number and the
  second is the filename template of the images to process"""

  num = task[0]
  template = task[1]

  # create directory
  with cd("sweep_%02d" % num):
    cmd = "dials.import template={0}".format(template)
    easy_run.fully_buffered(command=cmd)
    easy_run.fully_buffered(command="dials.find_spots datablock.json")

    # initial indexing in P 1
    cmd = "dials.index datablock.json strong.pickle " +\
          "output.experiments=P1_experiments.json"
    easy_run.fully_buffered(command=cmd)
    if not os.path.isfile("P1_experiments.json"):
      print "Job %02d failed in initial indexing" % num
      return

    # bootstrap from the refined P 1 cell
    cmd = "dials.index P1_experiments.json strong.pickle space_group='H 3'"
    easy_run.fully_buffered(command=cmd)
    if not os.path.isfile("experiments.json"):
      print "Job %02d failed in indexing" % num
      return

    # static model refinement
    cmd = "dials.refine experiments.json indexed.pickle " + \
          "outlier.algorithm=tukey use_all_reflections=true"
    easy_run.fully_buffered(command=cmd)
    if not os.path.isfile("refined_experiments.json"):
      print "Job %02d failed in refinement" % num
      return

    # WARNING! Fast and dirty integration.
    # Do not use the result for scaling/merging!
    cmd = "dials.integrate refined_experiments.json indexed.pickle " + \
          "profile.fitting=False prediction.dmin=8.0 prediction.dmax=8.1"
    easy_run.fully_buffered(command=cmd)
    if not os.path.isfile("integrated.pickle"):
      print "Job %02d failed during integration" % num
      return

    # create MTZ
    cmd = "dials.export refined_experiments.json integrated.pickle " +\
          "mtz.hklout=integrated.mtz"
    easy_run.fully_buffered(command=cmd)
    if not os.path.isfile("integrated.mtz"):
      print "Job %02d failed during MTZ export" % num
      return

  # if we got this far, return the path to the MTZ
  return "sweep_%02d/integrated.mtz" % num

if __name__ == "__main__":

  if len(sys.argv) != 2:
    sys.exit("Usage: dials.python process_TehA.py datablock.json")

  datablock_path = os.path.abspath(sys.argv[1])
  datablock = DataBlockFactory.from_serialized_format(datablock_path,
    check_format=False)[0]
  sweeps = datablock.extract_sweeps()
  templates = [e.get_template() for e in sweeps]
  tasklist = list(enumerate(sorted(templates)))

  if len(tasklist) == 0: sys.exit("No images found!")

  # uncomment the following line if short on time!
  #tasklist = tasklist[0:35]

  from libtbx import Auto
  nproc = easy_mp.get_processes(Auto)

  print "Attempting to process the following datasets, with {} processes".format(nproc)
  for task in tasklist:
    print "%d: %s" % task

  results = easy_mp.parallel_map(
    func=process_sweep,
    iterable=tasklist,
    processes=nproc,
    preserve_order=True)

  good_results = [e for e in results if e is not None]
  print "Successfully created the following MTZs:"
  for result in good_results:
    print result

We will now describe what is in this script. The first lines are just imports to bring in modules from the Python standard library as well as easy_run and easy_mp from libtbx (part of cctbx), DataBlockFactory from dxtbx to read in the datablock and a class from the dials.test package that simplifies running commands in a new directory. Following that is a definition for the function process_sweep which will perform all the steps required to process one dataset from images to unmerged MTZ. The code block under:

if __name__ == "__main__":

are the lines that are executed when the script starts. First we check that the script has been passed a path to a datablock. We then extract the 73 sweeps from this into a list, then get the filename templates from each element in the list. We associate each of these templates with a number to form a list of ‘tasks’ to pass into process_sweep, but instead of doing this in serial we can use easy_mp to run in parallel. This will be okay because inside process_sweep, we ensure that all results are written into a new directory. First we use a facility of the easy_mp module to determine the number of processes to run in parallel and then we submit the job with parallel_map.

Within process_sweep all external commands are run within a with block where execution is controlled by the context manager cd. If you want the gory details, they are here. Essentially this is a way to write clean code that tidies up after itself properly. In this case, we will create a new directory, execute commands in that directory, then change back to the old directory afterwards. If the directory already exists, this will fail with an error.

The commands that are run inside the managed block are usual dials commands, familiar from other tutorials. There are a couple of interesting points to note though. We know that the correct space group is H 3, but it turns out that if we ask dials.index to find an H 3 cell right from the start then many of the sweeps fail to index. This is simply because the initial models contained in datablock.json are too poor to locate a cell with the symmetry constraints. However, for many of the sweeps the indexing program will refine the P 1 solution to the correct cell. For this reason we first run indexing in P 1:

dials.index datablock.json strong.pickle output.experiments=P1_experiments.json

and then we feed the refined P1_experiments.json back into dials.index specifying the correct symmetry:

dials.index P1_experiments.json strong.pickle space_group='H 3'

When dials.index is passed an experiments.json containing a crystal model rather than just a databock.json then it automatically uses a known_orientation indexer, which avoids doing the basis vector search again. It uses the basis of the refined P 1 cell and just assigns indices under the assumption of H 3 symmetry. The symmetry constraints are then enforced during the refinement steps carried out by dials.index. This procedure gives us a greater success rate of indexing in H 3, and required no manual intervention.

Following indexing we do scan-static cell refinement:

dials.refine experiments.json indexed.pickle outlier.algorithm=tukey use_all_reflections=true

Outlier rejection was switched on in an attempt to avoid any zingers or other errant spots from affecting our refined cells. Without analysing the data closer it is not clear whether there are any particularly bad outliers here. We could repeat the whole analysis with this switched off if we want to investigate more closely, or look through all the dials.refine.log files to see results of the outlier rejection step.

We elected use all reflections rather than taking a random subset because these are narrow wedges and there are few reflections anyway. Taking a random subset is only a time-saving procedure, and it won’t provide much benefit here anyway.

We don’t bother with the time-consuming step of scan-varying refinement, because it is the scan-static cell that will be written into the MTZ header. Scan- varying refinement would give us better models for integration but as we will only be running blend in ‘analysis’ mode we are in the unusual situation of not actually caring what the intensities are. In this case, the MTZ file is just a carrier for the globally refined unit cell!

Following refinement we integrate the data in a very quick and dirty way, simply to get an MTZ file as fast as possible. This is a terrible way to integrate data usually!:

dials.integrate refined_experiments.json indexed.pickle profile.fitting=False prediction.dmin=8.0 prediction.dmax=8.1

The profile.fitting=False option ensures we only do summation integration, no profile fitting, while the prediction.dmin=8.0 and prediction.dmax=8.1 options only integrate data between 8.0 and 8.1 Angstroms. As a result very few reflections will be integrated. The MTZ file here is just being used as a carrier of the cell information into blend. By restricting the resolution range this way we are making it obvious that the content of the file is useless for any other purpose.

Warning

Do not use the data produced by this script for scaling and merging. More careful processing should be done first!

Finally we use dials.export to create an MTZ file:

dials.export refined_experiments.json integrated.pickle mtz.hklout=integrated.mtz

After each of these major steps we check whether the last command ran successfully by checking for the existence of an expected output file. If the file does not exist we make no effort to rescue the dataset, we just return early from the process_sweep function, freeing up a process so that parallel_map can start up the next.

Here is the output of a run of the script:

Attempting to process the following datasets, with 5 processes
0: /home/david/xray/TehA/xta30_1_####.cbf
1: /home/david/xray/TehA/xta31_1_####.cbf
2: /home/david/xray/TehA/xta32_1_####.cbf
3: /home/david/xray/TehA/xta33_1_####.cbf
4: /home/david/xray/TehA/xta34_1_####.cbf
5: /home/david/xray/TehA/xta9_1_####.cbf
6: /home/david/xray/TehA/xta9_2_####.cbf
7: /home/david/xray/TehA/xtal10_1_####.cbf
8: /home/david/xray/TehA/xtal11_1_####.cbf
9: /home/david/xray/TehA/xtal12_1_####.cbf
10: /home/david/xray/TehA/xtal12_2_####.cbf
11: /home/david/xray/TehA/xtal13_1_####.cbf
12: /home/david/xray/TehA/xtal14_1_####.cbf
13: /home/david/xray/TehA/xtal15_1_####.cbf
14: /home/david/xray/TehA/xtal16_1_####.cbf
15: /home/david/xray/TehA/xtal17_1_####.cbf
16: /home/david/xray/TehA/xtal18_1_####.cbf
17: /home/david/xray/TehA/xtal19_1_####.cbf
18: /home/david/xray/TehA/xtal1_1_####.cbf
19: /home/david/xray/TehA/xtal20_1_####.cbf
20: /home/david/xray/TehA/xtal21_1_####.cbf
21: /home/david/xray/TehA/xtal22_1_####.cbf
22: /home/david/xray/TehA/xtal23_1_####.cbf
23: /home/david/xray/TehA/xtal24_1_####.cbf
24: /home/david/xray/TehA/xtal25_1_####.cbf
25: /home/david/xray/TehA/xtal26_1_####.cbf
26: /home/david/xray/TehA/xtal26_2_####.cbf
27: /home/david/xray/TehA/xtal27_1_####.cbf
28: /home/david/xray/TehA/xtal28_1_####.cbf
29: /home/david/xray/TehA/xtal29_1_####.cbf
30: /home/david/xray/TehA/xtal2_1_####.cbf
31: /home/david/xray/TehA/xtal35_1_####.cbf
32: /home/david/xray/TehA/xtal36_1_####.cbf
33: /home/david/xray/TehA/xtal37_1_####.cbf
34: /home/david/xray/TehA/xtal37_2_####.cbf
35: /home/david/xray/TehA/xtal38_1_####.cbf
36: /home/david/xray/TehA/xtal39_1_####.cbf
37: /home/david/xray/TehA/xtal3_2_####.cbf
38: /home/david/xray/TehA/xtal40_1_####.cbf
39: /home/david/xray/TehA/xtal40_2_####.cbf
40: /home/david/xray/TehA/xtal40_3_####.cbf
41: /home/david/xray/TehA/xtal40_4_####.cbf
42: /home/david/xray/TehA/xtal41_1_####.cbf
43: /home/david/xray/TehA/xtal42_1_####.cbf
44: /home/david/xray/TehA/xtal43_1_####.cbf
45: /home/david/xray/TehA/xtal44_1_####.cbf
46: /home/david/xray/TehA/xtal45_1_####.cbf
47: /home/david/xray/TehA/xtal46_1_####.cbf
48: /home/david/xray/TehA/xtal47_1_####.cbf
49: /home/david/xray/TehA/xtal48_1_####.cbf
50: /home/david/xray/TehA/xtal49_1_####.cbf
51: /home/david/xray/TehA/xtal4_3_####.cbf
52: /home/david/xray/TehA/xtal50_1_####.cbf
53: /home/david/xray/TehA/xtal50_2_####.cbf
54: /home/david/xray/TehA/xtal51_1_####.cbf
55: /home/david/xray/TehA/xtal52_1_####.cbf
56: /home/david/xray/TehA/xtal53_1_####.cbf
57: /home/david/xray/TehA/xtal54_1_####.cbf
58: /home/david/xray/TehA/xtal55_1_####.cbf
59: /home/david/xray/TehA/xtal55_2_####.cbf
60: /home/david/xray/TehA/xtal56_1_####.cbf
61: /home/david/xray/TehA/xtal56_2_####.cbf
62: /home/david/xray/TehA/xtal57_1_####.cbf
63: /home/david/xray/TehA/xtal58_1_####.cbf
64: /home/david/xray/TehA/xtal58_2_####.cbf
65: /home/david/xray/TehA/xtal58_3_####.cbf
66: /home/david/xray/TehA/xtal59_1_####.cbf
67: /home/david/xray/TehA/xtal5_1_####.cbf
68: /home/david/xray/TehA/xtal60_1_####.cbf
69: /home/david/xray/TehA/xtal60_2_####.cbf
70: /home/david/xray/TehA/xtal6_1_####.cbf
71: /home/david/xray/TehA/xtal7_1_####.cbf
72: /home/david/xray/TehA/xtal8_1_####.cbf
Job 04 failed in indexing
Job 06 failed in initial indexing
Job 07 failed in indexing
Job 08 failed in indexing
Job 11 failed in indexing
Job 10 failed in indexing
Job 13 failed in indexing
Job 12 failed in indexing
Job 15 failed in initial indexing
Job 21 failed in initial indexing
Job 20 failed in initial indexing
Job 32 failed in initial indexing
Job 37 failed in indexing
Job 35 failed in indexing
Job 38 failed in indexing
Job 39 failed in indexing
Job 41 failed in indexing
Job 40 failed in indexing
Job 45 failed in indexing
Job 44 failed in indexing
Job 47 failed in indexing
Job 52 failed in initial indexing
Job 49 failed in initial indexing
Job 55 failed in initial indexing
Job 57 failed in initial indexing
Job 61 failed in indexing
Job 62 failed in indexing
Job 69 failed in indexing
Job 70 failed in indexing
Job 68 failed in indexing
Job 71 failed in initial indexing
Job 72 failed in indexing
Successfully created the following MTZs:
sweep_00/integrated.mtz
sweep_01/integrated.mtz
sweep_02/integrated.mtz
sweep_03/integrated.mtz
sweep_05/integrated.mtz
sweep_09/integrated.mtz
sweep_14/integrated.mtz
sweep_16/integrated.mtz
sweep_17/integrated.mtz
sweep_18/integrated.mtz
sweep_19/integrated.mtz
sweep_22/integrated.mtz
sweep_23/integrated.mtz
sweep_24/integrated.mtz
sweep_25/integrated.mtz
sweep_26/integrated.mtz
sweep_27/integrated.mtz
sweep_28/integrated.mtz
sweep_29/integrated.mtz
sweep_30/integrated.mtz
sweep_31/integrated.mtz
sweep_33/integrated.mtz
sweep_34/integrated.mtz
sweep_36/integrated.mtz
sweep_42/integrated.mtz
sweep_43/integrated.mtz
sweep_46/integrated.mtz
sweep_48/integrated.mtz
sweep_50/integrated.mtz
sweep_51/integrated.mtz
sweep_53/integrated.mtz
sweep_54/integrated.mtz
sweep_56/integrated.mtz
sweep_58/integrated.mtz
sweep_59/integrated.mtz
sweep_60/integrated.mtz
sweep_63/integrated.mtz
sweep_64/integrated.mtz
sweep_65/integrated.mtz
sweep_66/integrated.mtz
sweep_67/integrated.mtz

real  7m45.656s
user  25m32.532s
sys   1m34.090s

Analysis of individually processed datasets

The paths to integrated.mtz files can be copied directly into a file, say individual_mtzs.dat, and passed to blend for analysis:

echo "END" | blend -a individual_mtzs.dat

The dendrogram resulting from clustering is shown here:

../../_images/tree_01.png

Immediately the dendrogram shows that dataset 27 is an extreme outlier. From FINAL_list_of_files.dat we can see that this refers to sweep_46/integrated.mtz. As we kept all the dials .log files from DIALS processing we could investigate this further, however as this is only one sweep out of 41, we decide just to throw it away and move on. So, edit individual_mtzs.dat to remove the line sweep_46/integrated.mtz and rerun blend.

Now the dendrogram looks better:

../../_images/tree_02.png

The Linear Cell Variation (LCV) is now less than 1%, with an absolute value of 1.03 Angstroms, indicating good isomorphism amongst all the remaining datasets.

Joint refinement

Now that we have done the BLEND analysis for individually processed datasets, we would like to do joint refinement of the crystals to reduce correlations between the detector or beam parameters with individual crystals. As motivation we may look at these correlations for one of these datasets. For example:

cd sweep_00
dials.refine experiments.json indexed.pickle \
  track_parameter_correlation=true correlation_plot.filename=corrplot.png
cd ..

The new file sweep_00/corrplot.png shows correlations between parameters refined with this single 8 degree dataset. Clearly parameters like the detector distance and the crystal metrical matrix parameters are highly correlated.

../../_images/sweep_00_corrplot.png

Although the DIALS toolkit has a sophisticated mechanism for modelling multi-experiment data, the user interface for handling such data is still rather limited. In order to do joint refinement of the sweeps we need to combine them into a single multi-experiment experiments.json and corresponding reflections.pickle. Whilst doing this we want to reduce the separate detector, beam and goniometer models for each experiment into a single shared model of each type. The program dials.combine_experiments can be used for this, but first we have to prepare an input file with a text editor listing the individual sweeps in order. We can use individual_mtzs.dat as a template to start with. In our case the final file looks like this:

input {
  experiments = "sweep_00/refined_experiments.json"
  experiments = "sweep_01/refined_experiments.json"
  experiments = "sweep_02/refined_experiments.json"
  experiments = "sweep_03/refined_experiments.json"
  experiments = "sweep_05/refined_experiments.json"
  experiments = "sweep_09/refined_experiments.json"
  experiments = "sweep_14/refined_experiments.json"
  experiments = "sweep_16/refined_experiments.json"
  experiments = "sweep_17/refined_experiments.json"
  experiments = "sweep_18/refined_experiments.json"
  experiments = "sweep_19/refined_experiments.json"
  experiments = "sweep_22/refined_experiments.json"
  experiments = "sweep_23/refined_experiments.json"
  experiments = "sweep_24/refined_experiments.json"
  experiments = "sweep_25/refined_experiments.json"
  experiments = "sweep_26/refined_experiments.json"
  experiments = "sweep_27/refined_experiments.json"
  experiments = "sweep_28/refined_experiments.json"
  experiments = "sweep_29/refined_experiments.json"
  experiments = "sweep_30/refined_experiments.json"
  experiments = "sweep_31/refined_experiments.json"
  experiments = "sweep_33/refined_experiments.json"
  experiments = "sweep_34/refined_experiments.json"
  experiments = "sweep_36/refined_experiments.json"
  experiments = "sweep_42/refined_experiments.json"
  experiments = "sweep_43/refined_experiments.json"
  experiments = "sweep_48/refined_experiments.json"
  experiments = "sweep_50/refined_experiments.json"
  experiments = "sweep_51/refined_experiments.json"
  experiments = "sweep_53/refined_experiments.json"
  experiments = "sweep_54/refined_experiments.json"
  experiments = "sweep_56/refined_experiments.json"
  experiments = "sweep_58/refined_experiments.json"
  experiments = "sweep_59/refined_experiments.json"
  experiments = "sweep_60/refined_experiments.json"
  experiments = "sweep_63/refined_experiments.json"
  experiments = "sweep_64/refined_experiments.json"
  experiments = "sweep_65/refined_experiments.json"
  experiments = "sweep_66/refined_experiments.json"
  experiments = "sweep_67/refined_experiments.json"
  reflections = "sweep_00/indexed.pickle"
  reflections = "sweep_01/indexed.pickle"
  reflections = "sweep_02/indexed.pickle"
  reflections = "sweep_03/indexed.pickle"
  reflections = "sweep_05/indexed.pickle"
  reflections = "sweep_09/indexed.pickle"
  reflections = "sweep_14/indexed.pickle"
  reflections = "sweep_16/indexed.pickle"
  reflections = "sweep_17/indexed.pickle"
  reflections = "sweep_18/indexed.pickle"
  reflections = "sweep_19/indexed.pickle"
  reflections = "sweep_22/indexed.pickle"
  reflections = "sweep_23/indexed.pickle"
  reflections = "sweep_24/indexed.pickle"
  reflections = "sweep_25/indexed.pickle"
  reflections = "sweep_26/indexed.pickle"
  reflections = "sweep_27/indexed.pickle"
  reflections = "sweep_28/indexed.pickle"
  reflections = "sweep_29/indexed.pickle"
  reflections = "sweep_30/indexed.pickle"
  reflections = "sweep_31/indexed.pickle"
  reflections = "sweep_33/indexed.pickle"
  reflections = "sweep_34/indexed.pickle"
  reflections = "sweep_36/indexed.pickle"
  reflections = "sweep_42/indexed.pickle"
  reflections = "sweep_43/indexed.pickle"
  reflections = "sweep_48/indexed.pickle"
  reflections = "sweep_50/indexed.pickle"
  reflections = "sweep_51/indexed.pickle"
  reflections = "sweep_53/indexed.pickle"
  reflections = "sweep_54/indexed.pickle"
  reflections = "sweep_56/indexed.pickle"
  reflections = "sweep_58/indexed.pickle"
  reflections = "sweep_59/indexed.pickle"
  reflections = "sweep_60/indexed.pickle"
  reflections = "sweep_63/indexed.pickle"
  reflections = "sweep_64/indexed.pickle"
  reflections = "sweep_65/indexed.pickle"
  reflections = "sweep_66/indexed.pickle"
  reflections = "sweep_67/indexed.pickle"
}

We called this file experiments_and_reflections.phil then run dials.combine_experiments like this:

dials.combine_experiments experiments_and_reflections.phil \
  reference_from_experiment.beam=0 \
  reference_from_experiment.goniometer=0 \
  reference_from_experiment.detector=0

The reference_from_experiment options tell the program to replace all beam, goniometer and detector models in the input experiments with those models taken from the first experiment, i.e. experiment ‘0’ using 0-based indexing. The output lists the number of reflections in each sweep contributing to the final combined_reflections.pickle:

---------------------
| Experiment | Nref |
---------------------
| 0          | 1446 |
| 1          | 1422 |
| 2          | 1209 |
| 3          | 1376 |
| 4          | 452  |
| 5          | 1664 |
| 6          | 1528 |
| 7          | 1448 |
| 8          | 1275 |
| 9          | 239  |
| 10         | 1614 |
| 11         | 1052 |
| 12         | 1845 |
| 13         | 1495 |
| 14         | 2041 |
| 15         | 1308 |
| 16         | 1839 |
| 17         | 1828 |
| 18         | 1644 |
| 19         | 243  |
| 20         | 1061 |
| 21         | 2416 |
| 22         | 1885 |
| 23         | 949  |
| 24         | 3569 |
| 25         | 2967 |
| 26         | 935  |
| 27         | 1329 |
| 28         | 650  |
| 29         | 1325 |
| 30         | 633  |
| 31         | 1233 |
| 32         | 2131 |
| 33         | 2094 |
| 34         | 2141 |
| 35         | 1661 |
| 36         | 2544 |
| 37         | 2227 |
| 38         | 982  |
| 39         | 1138 |
---------------------
Saving combined experiments to combined_experiments.json
Saving combined reflections to combined_reflections.pickle

We may also inspect the contents of combined_experiments.json, by using dials.show, for example:

dials.show combined_experiments.json

Useful though this is, it is clear how this could become unwieldy as the number of experiments increases. Work on better interfaces to multi-crystal (or generally, multi-experiment) data is ongoing within the DIALS project. Suggestions are always welcome!

Now we have the joint experiments and reflections files we can run our multi- crystal refinement job. First we try outlier rejection, so that the refinement run is similar to the jobs we ran on individual datasets:

dials.refine combined_experiments.json combined_reflections.pickle \
  use_all_reflections=true outlier.algorithm=tukey
The following parameters have been modified:

refinement {
  reflections {
    outlier {
      algorithm = null *tukey
    }
  }
}
input {
  experiments = combined_experiments.json
  reflections = combined_reflections.pickle
}

Configuring refiner

Summary statistics for observations matched to predictions:
---------------------------------------------------------------------
|                   | Min    | Q1      | Med       | Q3     | Max   |
---------------------------------------------------------------------
| Xc - Xo (mm)      | -14.68 | -0.8191 | -0.0739   | 0.7823 | 15.85 |
| Yc - Yo (mm)      | -21.75 | -0.5103 | -0.01936  | 0.4596 | 17.19 |
| Phic - Phio (deg) | -17.36 | -0.2058 | 0.0004136 | 0.2091 | 28.12 |
| X weights         | 233    | 359.2   | 379.4     | 392.9  | 405.6 |
| Y weights         | 264.7  | 392.9   | 401.3     | 404.4  | 405.6 |
| Phi weights       | 177    | 299.9   | 300       | 300    | 300   |
---------------------------------------------------------------------

16559 reflections have been rejected as outliers
Traceback (most recent call last):
  File "/home/david/bsx/cctbx-svn/build/../sources/dials/command_line/refine.py", line 370, in <module>
    halraiser(e)
  File "/home/david/bsx/cctbx-svn/build/../sources/dials/command_line/refine.py", line 368, in <module>
    script.run()
  File "/home/david/bsx/cctbx-svn/build/../sources/dials/command_line/refine.py", line 274, in run
    reflections, experiments)
  File "/home/david/bsx/cctbx-svn/sources/dials/algorithms/refinement/refiner.py", line 340, in from_parameters_data_experiments
    verbosity=verbosity)
  File "/home/david/bsx/cctbx-svn/sources/dials/algorithms/refinement/refiner.py", line 585, in _build_components
    target = cls.config_target(params, experiments, refman, pred_param, do_stills)
  File "/home/david/bsx/cctbx-svn/sources/dials/algorithms/refinement/refiner.py", line 1008, in config_target
    options.jacobian_max_nref)
  File "/home/david/bsx/cctbx-svn/sources/dials/algorithms/refinement/target.py", line 404, in __init__
    self._reflection_manager.finalise()
  File "/home/david/bsx/cctbx-svn/sources/dials/algorithms/refinement/reflection_manager.py", line 237, in finalise
    self._check_too_few()
  File "/home/david/bsx/cctbx-svn/sources/dials/algorithms/refinement/reflection_manager.py", line 262, in _check_too_few
    raise RuntimeError(msg)
RuntimeError: Please report this error to dials-support@lists.sourceforge.net: Remaining number of reflections = 8, for experiment 19, which is below the configured limit for this reflection manager

Oops! That wasn’t good. Looking at the error we see that experiment 19 provides only 8 reflections to refinement, which is disallowed by a default parameters of dials.refine, namely minimum_number_of_reflections=20. But from the output of dials.combine_experiments we see that experiment 19 has 243 indexed reflections. What happened? Well, forcing the individual experiments to share the beam and detector models of experiment 0 has led to some very poor predictions for some of these experiments. See the Summary statistics table, where the worst positional residuals are greater than 20 mm! We may put this down to the very narrow wedges of data we have. Experiment 19 is one of the narrowest, with only 4 degrees of data. Outlier rejection is not a good idea here because it selectively removes reflections from the worst fitting experiments.

Instead we try without outlier rejection:

dials.refine combined_experiments.json combined_reflections.pickle \
  use_all_reflections=true \
  output.experiments=refined_combined_experiments.json

This worked much better:

The following parameters have been modified:

output {
  experiments = refined_combined_experiments.json
}
refinement {
  reflections {
    use_all_reflections = true
  }
}
input {
  experiments = combined_experiments.json
  reflections = combined_reflections.pickle
}

Configuring refiner

Summary statistics for observations matched to predictions:
---------------------------------------------------------------------
|                   | Min    | Q1      | Med       | Q3     | Max   |
---------------------------------------------------------------------
| Xc - Xo (mm)      | -14.68 | -0.8191 | -0.0739   | 0.7823 | 15.85 |
| Yc - Yo (mm)      | -21.75 | -0.5103 | -0.01936  | 0.4596 | 17.19 |
| Phic - Phio (deg) | -17.36 | -0.2058 | 0.0004136 | 0.2091 | 28.12 |
| X weights         | 233    | 359.2   | 379.4     | 392.9  | 405.6 |
| Y weights         | 264.7  | 392.9   | 401.3     | 404.4  | 405.6 |
| Phi weights       | 177    | 299.9   | 300       | 300    | 300   |
---------------------------------------------------------------------

Performing refinement...

Refinement steps:
-----------------------------------------------
| Step | Nref  | RMSD_X  | RMSD_Y  | RMSD_Phi |
|      |       | (mm)    | (mm)    | (deg)    |
-----------------------------------------------
| 0    | 57629 | 1.6886  | 1.3984  | 1.2926   |
| 1    | 57629 | 1.3726  | 1.0295  | 0.69528  |
| 2    | 57629 | 1.1462  | 0.86286 | 0.64657  |
| 3    | 57629 | 0.88257 | 0.6659  | 0.5764   |
| 4    | 57629 | 0.61437 | 0.47405 | 0.44825  |
| 5    | 57629 | 0.38414 | 0.31317 | 0.28436  |
| 6    | 57629 | 0.22337 | 0.19783 | 0.16576  |
| 7    | 57629 | 0.1759  | 0.16573 | 0.12827  |
| 8    | 57629 | 0.17255 | 0.16354 | 0.12475  |
| 9    | 57629 | 0.17228 | 0.16336 | 0.12463  |
| 10   | 57629 | 0.17217 | 0.16325 | 0.12457  |
| 11   | 57629 | 0.17218 | 0.16322 | 0.12452  |
| 12   | 57629 | 0.17219 | 0.16322 | 0.1245   |
| 13   | 57629 | 0.17219 | 0.16321 | 0.1245   |
-----------------------------------------------
RMSD no longer decreasing

RMSDs by experiment:
---------------------------------------------
| Exp | Nref | RMSD_X  | RMSD_Y  | RMSD_Z   |
|     |      | (px)    | (px)    | (images) |
---------------------------------------------
| 0   | 1374 | 0.63002 | 0.40512 | 0.35154  |
| 1   | 1325 | 0.65204 | 0.38951 | 0.34116  |
| 2   | 1138 | 0.90682 | 0.85212 | 0.75447  |
| 3   | 1294 | 0.67566 | 0.51293 | 0.27902  |
| 4   | 406  | 0.76138 | 0.50378 | 0.36697  |
| 5   | 1579 | 1.059   | 1.5602  | 0.93859  |
| 6   | 1452 | 0.63949 | 0.32975 | 0.3447   |
| 7   | 1376 | 1.0682  | 1.1586  | 0.90346  |
| 8   | 1203 | 1.0566  | 1.4784  | 0.69921  |
| 9   | 213  | 2.0411  | 2.0389  | 1.3643   |
| 10  | 1543 | 0.78169 | 0.47908 | 0.51499  |
| 11  | 980  | 0.96025 | 1.16    | 0.72548  |
| 12  | 1783 | 0.74162 | 0.84784 | 0.6762   |
| 13  | 1424 | 0.73974 | 0.51861 | 0.37127  |
| 14  | 1937 | 1.1603  | 1.4405  | 0.84322  |
| 15  | 1237 | 0.92314 | 0.50443 | 0.42126  |
| 16  | 1751 | 0.71062 | 0.37032 | 0.34264  |
| 17  | 1742 | 0.6608  | 0.40137 | 0.2978   |
| 18  | 1550 | 0.84246 | 1.2565  | 0.71967  |
| 19  | 222  | 1.1222  | 0.77297 | 0.95399  |
---------------------------------------------
Table truncated to show the first 20 experiments only
Re-run with verbosity >= 2 to show all experiments
Saving refined experiments to refined_combined_experiments.json

The overall final RMSDs are 0.17 mm in X, 0.16 mm in Y and 0.12 degrees in \(\phi\). The RMSDs per experiment are also shown, but only for the first 20 experiments. Rerunning with verbosity=2 does give the full table, but also produces a great deal more log output, so it would be easier to find in the file dials.refine.log rather than scrolling up pages in your terminal.

We can compare the RMSDs from individually refined experiments to those from the joint experiments. For example, look at the RSMDs for experiment 0, in the logfile sweep_00/dials.refine.log:

RMSDs by experiment:
--------------------------------------------
| Exp | Nref | RMSD_X  | RMSD_Y | RMSD_Z   |
|     |      | (px)    | (px)   | (images) |
--------------------------------------------
| 0   | 1000 | 0.57553 | 0.3374 | 0.26322  |
--------------------------------------------

Clearly allowing the detector and beam to refine only against this data lets the model better fit the observations, but is it a more accurate description of reality? Given that we know or can comfortably assume that the detector and beam did not move between data collections, then the constraints applied by joint refinement seem appropriate. For better parity with the original results perhaps we should use outlier rejection though. Now the models are close enough it is safe to do so:

dials.refine refined_combined_experiments.json combined_reflections.pickle \
  use_all_reflections=true \
  outlier.algorithm=tukey \
  output.experiments=refined_combined_experiments_outrej.json

The RMSD tables resulting from this:

Refinement steps:
------------------------------------------------
| Step | Nref  | RMSD_X  | RMSD_Y   | RMSD_Phi |
|      |       | (mm)    | (mm)     | (deg)    |
------------------------------------------------
| 0    | 50918 | 0.10361 | 0.06205  | 0.05831  |
| 1    | 50918 | 0.10333 | 0.061719 | 0.057777 |
| 2    | 50918 | 0.10311 | 0.061549 | 0.057746 |
| 3    | 50918 | 0.10277 | 0.061306 | 0.057601 |
| 4    | 50918 | 0.10246 | 0.061116 | 0.057267 |
| 5    | 50918 | 0.10228 | 0.061063 | 0.056877 |
| 6    | 50918 | 0.10215 | 0.061081 | 0.05668  |
| 7    | 50918 | 0.10208 | 0.061099 | 0.05666  |
| 8    | 50918 | 0.10204 | 0.061066 | 0.056661 |
| 9    | 50918 | 0.10201 | 0.060985 | 0.056634 |
| 10   | 50918 | 0.102   | 0.0609   | 0.056573 |
| 11   | 50918 | 0.10203 | 0.060857 | 0.056504 |
| 12   | 50918 | 0.10205 | 0.060845 | 0.056468 |
| 13   | 50918 | 0.10206 | 0.060843 | 0.05646  |
| 14   | 50918 | 0.10206 | 0.060843 | 0.05646  |
------------------------------------------------
RMSD no longer decreasing

RMSDs by experiment:
---------------------------------------------
| Exp | Nref | RMSD_X  | RMSD_Y  | RMSD_Z   |
|     |      | (px)    | (px)    | (images) |
---------------------------------------------
| 0   | 1304 | 0.57371 | 0.34681 | 0.30517  |
| 1   | 1275 | 0.60022 | 0.34285 | 0.30982  |
| 2   | 1004 | 0.67823 | 0.41947 | 0.29667  |
| 3   | 1211 | 0.61019 | 0.42341 | 0.26994  |
| 4   | 374  | 0.66814 | 0.41793 | 0.28288  |
| 5   | 1429 | 0.53542 | 0.30974 | 0.25422  |
| 6   | 1426 | 0.51288 | 0.282   | 0.23681  |
| 7   | 1237 | 0.65645 | 0.32797 | 0.27486  |
| 8   | 1090 | 0.54471 | 0.34442 | 0.2591   |
| 9   | 137  | 1.2492  | 0.48144 | 0.31548  |
| 10  | 1483 | 0.54167 | 0.33374 | 0.25129  |
| 11  | 907  | 0.56563 | 0.39174 | 0.26267  |
| 12  | 1697 | 0.53376 | 0.33867 | 0.25553  |
| 13  | 1354 | 0.59745 | 0.32363 | 0.27096  |
| 14  | 1766 | 0.55775 | 0.30882 | 0.25687  |
| 15  | 1109 | 0.68372 | 0.35892 | 0.31     |
| 16  | 1636 | 0.5659  | 0.3262  | 0.30059  |
| 17  | 1656 | 0.53262 | 0.32716 | 0.26653  |
| 18  | 1401 | 0.51543 | 0.37366 | 0.2767   |
| 19  | 172  | 0.90236 | 0.38946 | 0.39827  |
---------------------------------------------
Table truncated to show the first 20 experiments only

Now we have RMSDs in X down to 0.1 mm, in Y to 0.06 mm and 0.06 degrees in \(\phi\). The RMSDs for experiment 0 are not so much worse than from the individual refinement job. We are happy with this result and move on to re-integrating the data to create MTZs for BLEND.

Analysis of jointly refined datasets

dials.integrate will not work with our refined_combined_experiments_outrej.json and combined_reflections.pickle directly, so we have to separate these into individual files for each experiment. It is best to do this inside a new directory:

mkdir joint
cd !$
dials.split_experiments ../refined_combined_experiments_outrej.json ../combined_reflections.pickle

This fills the directory with 39 individual experiments_##.json and reflections_##.pickle files. To integrate these quickly we want a script to run in parallel, similar to the one used previously:

#!/bin/env dials.python
import os
import sys
import glob
from libtbx import easy_run, easy_mp
from dials.test import cd

def process_sweep(task):
  """Process a single sweep of data. The parameter 'task' will be a
  tuple, the first element of which is an integer job number and the
  second is the path to the directory containing the data"""

  num = task[0]
  datadir = task[1]

  experiments_file = "experiments_%02d.json" % num
  reflections_file = "reflections_%02d.pickle" % num
  experiments_path = os.path.join(datadir, experiments_file)
  reflections_path = os.path.join(datadir, reflections_file)

  # create directory
  with cd("sweep_%02d" % num):
    # WARNING! Fast and dirty integration.
    # Do not use the result for scaling/merging!
    cmd = "dials.integrate %s %s " + \
          "profile.fitting=False prediction.dmin=8.0 prediction.dmax=8.1"
    cmd = cmd % (experiments_path, reflections_path)
    easy_run.fully_buffered(command=cmd)
    if not os.path.isfile("integrated.pickle"):
      print "Job %02d failed during integration" % num
      return

    # create MTZ
    cmd = "dials.export %s integrated.pickle mtz.hklout=integrated.mtz"
    cmd = cmd % experiments_path
    easy_run.fully_buffered(command=cmd)
    if not os.path.isfile("integrated.mtz"):
      print "Job %02d failed during MTZ export" % num
      return

  # if we got this far, return the path to the MTZ
  return "sweep_%02d/integrated.mtz" % num

if __name__ == "__main__":

  if len(sys.argv) != 2:
    sys.exit("Usage: dials.python integrate_joint_TehA.py ..")
  data_dir = os.path.abspath(sys.argv[1])

  pathname = os.path.join(data_dir, "experiments_*.json")
  experiments = glob.glob(pathname)

  templates = [data_dir for f in experiments]
  tasklist = list(enumerate(sorted(templates)))

  from libtbx import Auto
  nproc = easy_mp.get_processes(Auto)

  print "Attempting to process the following datasets, with {} processes".format(nproc)
  for task in tasklist:
    print "%d: %s/experiments%02d" % (task[0], task[1], task[0])

  results = easy_mp.parallel_map(
    func=process_sweep,
    iterable=tasklist,
    processes=nproc,
    preserve_order=True)

  good_results = [e for e in results if e is not None]
  print "Successfully created the following MTZs:"
  for result in good_results:
    print result

This, if saved as integrate_joint_TehA.py in the new joint directory can be run as follows:

dials.python integrate_joint_TehA.py .

As expected this creates all 40 MTZs for the jointly refined sweeps without any problem. We can copy the paths to these into a new file, say joint_mtzs.dat, and run blend:

echo "END" | blend -a joint_mtzs.dat

The tree.png resulting from this is very interesting.

../../_images/tree_03.png

The LCV is now as low as 0.36% (aLCV 0.6 Angstroms). This indicates an even higher degree of isomorphism than detected during after individual processing. So although joint refinement leads to slightly higher RMSDs for each experiment (as we expected) the resulting unit cells are more similar. It is worth remembering that no restraints were applied between unit cells in refinement. Given that we know that the detector and beam did not move between the data collections we might like to think that the joint refinement analysis is a more accurate depiction of reality, and thus the unit cells are closer to the truth.

What to do next?

This has given us a good starting point for analysis with BLEND. However, because of the shortcuts we took with integration we are not yet ready to continue with BLEND’s synthesis mode. At this point we might assess where we are and try a few things:

  • Go back and fix datasets that didn’t index properly. We could edit our processing script to attempt method=fft1d for example if the 3D FFT indexing was unsuccessful.
  • Integrate data properly for BLEND’s synthesis mode. We should remove the resolution limits and allow dials.integrate to do profile fitting as well as summation integration.

Acknowledgements

The TehA project and original BLEND analysis was performed by scientists at Diamond Light Source and the Membrane Protein Laboratory. We thank the following for access to the data: Danny Axford, Nien-Jen Hu, James Foadi, Hassanul Ghani Choudhury, So Iwata, Konstantinos Beis, Pierre Aller, Gwyndaf Evans & Yilmaz Alguel