Multiple run alignment using a minimum spanning tree (MST).
This class will align features across multiple runs using a strictly local approach. It uses a minimum spanning tree (MST) as input which is expected to allow traversal of all runs, only connecting the most similar runs. This should allow accurate local alignment between two very similar runs and transfer of identification with high accuracy. Specifically, this should for scalability to a large number of dissimilar runs where approaches that rely on a single reference run for alignment might give less accurate results.
Briefly, the algorithm will choose the best scoring peakgroup as a seed and start to traverse the MST from this seed. At each node, it will add the best matching peakgroup (by score, within a specified retention time window) to the result. After traversing all nodes, a new seed can be chosen among the peakgroups not yet belonging to a cluster and the process can be repeated to produce multiple clusters.
For example, consider a case of 5 LCMS/MS runs where 6 different feature (peakgroups) were found in each run (not all peakgroups were found in all runs):
Run 1:
 pg1_1  pg1_2  pg1_3  pg1_4  pg1_5  pg1_6
Run 2:
 pg2_1  pg2_2  pg2_3  pg2_4  pg2_5  pg2_6
Run 3:
 pg3_0  pg3_1  pg3_2  pg3_3  pg3_4  pg3_5  pg3_6
Run 4:
 pg4_0  pg4_1  pg4_2  pg4_3  pg4_4  pg4_5
Run 5:
 pg5_0  pg5_1  pg5_2  pg5_3  pg5_4  pg5_5
Assume that the corresponding MST looks like this:
/ Run4
Run1  Run2  Run3 
\ Run5
This is a case where Run1 and Run2 are very similar and Run3 and Run4 are rather similar and should be easy to align. The algorithm will start with the “best” peakgroup overall (having the best probability score), assume this peakgroups is pg1_1 from Run 1. The algorithm will then use the alignment Run1Run2 to infer that pg2_1 is the same signal as pg1_1 and add it to the group. Specifically, it will select the highestscoring peakgroup within a narrow RTwindow (max_rt_diff) in Run2  note that if the RTwindow is too wide, there is a certain chance of mismatching, e.g. pg_2 will be selected instead of pg2_1. The alignment Run2Run3 will be used to add pg3_1. Then a bifurcation in the tree occurs and Run3Run4 as well as Run3Run5 will be used to infer the identity of pg4_1 and pg5_1 and add them to the cluster. In the end, the algorithm will report (pg1_1, pg2_1, pg3_1, pg4_1, pg5_1) as a consistent cluster across multiple runs. This process can be repeated with the next best peakgroup that is not yet part of a cluster (e.g. pg1_2) until no more peakgroups are left (no more peakgroups having a score below fdr_cutoff).
Note how the algorithm only used binary alignments and purely local alignments of the runs that are most close to each other. This stands in contrast to approaches where a single reference is picked and then used for alignment which might align runs that are substantially different. On the other hand, a single error at one edge in the tree will propagate itself and could lead to whole subtrees that are wrongly aligned.
Use the MST to report all cluster.
Briefly, the algorithm will choose the best scoring peakgroup as a seed and start to traverse the MST from this seed. At each node, it will add the best matching peakgroup (by score, within a specified retention time window) to the result. After traversing all nodes, a new seed can be chosen among the peakgroups not yet belonging to a cluster and the process can be repeated to produce multiple clusters. It will add clusters until no more peptides with an fdr score better than self._fdr_cutoff are left.
Parameters: 


Returns:  None 
Use the MST to report the first cluster containing the best peptide (overall).
The algorithm will go through all multipeptides and mark those peakgroups which it deems to belong to the best peakgroup cluster (only the first cluster will be reported).
Parameters: 


Returns:  None 
Compute distance matrix of all runs.
Computes a n x n distance matrix between all runs of an experiment. The reported distance is 1 minus the Rsquared value (1R^2) from the linear regression.
Parameters: 


Returns:  None numpy (n x n) matrix(float): distance matrix 
A helper class representation of a cluster (used in AlignmentAlgorithm)
Calculate the median retention time of a cluster
Calculate the standard deviation of the retention times
Calculate the total score of a cluster (multiplication of probabilities)
Ensure that only one peakgroup is selected per run.
If there are multiple peakgroups selected, only the best one is retained.
A class of alignment algorithms
Perform the alignment on a set of multipeptides
Parameters: 


Returns:  Alignment object with alignment statistics 
Use the datasmoothing part of msproteomicstoolslib to align two runs in retention times using splines.
>>> spl_aligner = SplineAligner()
>>> transformations = spl_aligner.rt_align_all_runs(this_exp, multipeptides, options.alignment_score, options.use_scikit)
Get the error of the transformation
Returns:  transformation_error – the error of the transformation 

Return type:  TransformationError 
Align all runs contained in an MRExperiment
Parameters: 


Determine the optimal integration border by using the shortest path in the MST
Parameters: 


Returns:  A tuple of (left_integration_border, right_integration_border) 
Determine the optimal integration border by using the shortest distance (direct transformation)
Parameters: 


Returns:  A tuple of (left_integration_border, right_integration_border) 
Determine the optimal integration border by taking the mean of all other peakgroup boundaries using a reference run.
Parameters: 


Returns:  A tuple of (left_integration_border, right_integration_border) in the retention time space of the _reference_ run 
A collection of the same precursors (chromatograms) across multiple runs.
It contains individual precursors that can be accessed by their run id.
Returns True if all peakgroups are selected
Find best peakgroup across all peptides
Get precursor group for the given run
Parameters:  runid (str) – Run id of the group 

Returns:  precursor_group – Precursor group from the corresponding run 
Return type:  PrecursorGroup 
Get all precursor groups
Returns:  precursor_group – All Precursor group from the corresponding run 

Return type:  list of PrecursorGroup 
Whether the current peptide is a decoy or not
Returns:  decoy – Whether the peptide is decoy or not 

Return type:  bool 
Get all peakgroups that were selected across all runs and precursor groups
Checks whether a given run has a precursor group
Parameters:  runid (str) – Run id to check 

Returns:  check – Whether the given run has a precursor group 
Return type:  bool 
Whether there are runs in which no peptide was detected (peptide is Null)
Returns:  has_null – Whether there are Null peptides in this object (not detected in some runs) 

Return type:  bool 
Insert a PrecursorGroup into the Multipeptide
Parameters: 


Raises :  Exception – If self.hasPrecursorGroup(runid) is true 
Bases: object
An MR (multirun) Experiment is a container for multiple experimental runs.
In some of the runs the same peptidde precursors may be identified and the job of this object is to keep track of these experiments and the identified precursors across multiple runs.
Example usage:
>>> # Read the files
>>> fdr_cutoff = 0.01 # 1% FDR
>>> reader = SWATHScoringReader.newReader(infiles, "openswath")
>>> this_exp = Experiment()
>>> this_exp.set_runs( reader.parse_files(options.realign_runs) )
>>> multipeptides = this_exp.get_all_multipeptides(fdr_cutoff)
Match all precursors in different runs to each other.
Find all precursors that are above the fdr cutoff in each run and build a union of those precursors. Then search for each of those precursors in all the other runs and build a multipeptide / multiprecursor.
Parameters: 


Bases: object
Parameter estimation object
In a first step the percentage of decoys of all peakgroups at the target fdr is computed (which is then taken as the “aim”). For this “aim” of decoy percentage, the class will try to estimate an fdr_cutoff such that the percentage of decoy precursors in the final reported result will correspond to the “aim”.
If the parameter min_runs (at initialization) is higher than 1, only precursors that are identified in min_runs above the fdr_cutoff will be reported.
>>> p = ParamEst()
>>> decoy_frac = p.compute_decoy_frac(multipeptides, target_fdr)
>>> fdr_cutoff_calculated = p.find_iterate_fdr(multipeptides, decoy_frac)
Calculate how many of the peakgroups are decoy for a given cutoff.
Iteratively find an qvalue cutoff to reach the specified decoy fraction
This function will step through multiple qvalue thresholds and evaluate how many peptides have at least one peakgroup whose qvalue (using get_fdr_score) is below that threshold. The qvalue is then adapted until the fraction of decoys in the result is equal to the specified fraction given as input.
Parameters: 


Returns:  None 