Format

Several classes and functions to deal with common mass spectrometric format (mostly dealing with File I/O).

Transformation Collection Module

TransformationCollection

class msproteomicstoolslib.format.TransformationCollection.TransformationCollection

A class to store a transformation between retention times of multiple runs.

It allows to add transformation data (e.g. a pair of arrays which map coordinates from one RT space to the other). Once all data is added, one can initialize from the data:

Compute a new transformation and write to file:
# data1 = reference data (master) with ref_id # data2 = data to be aligned (slave) with current_id
>>> tcoll = TransformationCollection()
>>> tcoll.setReferenceRunID( ref_id )
>>> tcoll.addTransformationData([data2, data1], current_id, ref_id )
>>> tcoll.writeTransformationData( "outfile", current_id, ref_id)

Read a set of transformations from files:

>>> tcoll = TransformationCollection()
>>> for filename in ["file1.tr", "file2.tr"]:
>>>   tcoll.readTransformationData(filename)
>>> tcoll.initialize_from_data(reverse=True)

Compute a transformation:

>>> norm_value = tcoll.getTransformation(orig_runid, ref_id).predict( [ value ] )[0]
addTransformationData(data, s_from, s_to)

Add raw data points to the collection

Parameters:
  • data (list(data_slave, data_master)) – two data two data vectors containing the raw data points from two runs. The first data vector is the master (reference) data and the second one is the slave (to be aligned).
  • s_from (String) – run ID of the slave (to be aligned) run
  • s_to (String) – run ID of the master (reference) run
addTransformedData(data, s_from, s_to)

Add transformed data points to the collection

The idea is to add the anchor points of s_from in the space of s_to so that one could compute the transformation using a simple linear transform.

Parameters:
  • s_from (String) – run ID of the slave (to be aligned) run
  • s_to (String) – run ID of the master (reference) run
getReferenceRunID()
getTransformation(s_from, s_to)
getTransformationData(s_from, s_to)
getTransformedData(s_from, s_to)
initialize_from_data(reverse=False, smoother='lowess')
printTransformationData(s_from, s_to)
readTransformationData(filename)

Read the transformation present in the file.

The header is either:
#Transformation Null #Transformation Data “from_id” to “to_id” reference_id “ref_id”
setReferenceRunID(value)
writeTransformationData(filename, s_from, s_to)

Write the transformation s_from to s_to to a file.

The header is either:
#Transformation Null #Transformation Data “from_id” to “to_id” reference_id “ref_id”

LightTransformationData

class msproteomicstoolslib.format.TransformationCollection.LightTransformationData(ref=None)

A lightweight data structure to store a transformation between retention times of multiple runs.

addData(run1, data1, run2, data2, doSort=True)

Add raw data for the transformation between two runs

addTrafo(run1, run2, trafo, stdev=None)

Add transformation between two runs

getData(run1, run2)
getReferenceRunID()
getStdev(run1, run2)
getTrafo(run1, run2)
getTransformation(run1, run2)

File Reader Module

SWATHScoringReader

class msproteomicstoolslib.format.SWATHScoringReader.ReadFilter

Bases: object

A callable class which can pre-filters a row and determine whether the row can be skipped.

If the call returns true, the row is examined but if it returns false, the row should be skipped.

class msproteomicstoolslib.format.SWATHScoringReader.SWATHScoringReader
static newReader(infiles, filetype, readmethod="minimal", readfilter=ReadFilter(), errorHandling="strict", enable_isotopic_grouping=False)

Factory to create a new reader

parse_files(read_exp_RT=True, verbosity=10)

Parse the input file(s) (CSV).

Parameters:read_exp_RT (bool) – to read the real, experimental retention time (default behavior) or the delta iRT should be used instead.
Returns:runs(list(SWATHScoringReader.Run))

A single CSV file might contain more than one run and thus to create unique run ids, we number the runs as xx_yy where xx is the current file number and yy is the run found in the current file. However, if an alignment has already been performed and each run has already obtained a unique run id, we can directly use the previous alignment id.

parse_row(run, this_row, read_exp_RT)
class msproteomicstoolslib.format.SWATHScoringReader.OpenSWATH_SWATHScoringReader(infiles, readmethod='minimal', readfilter=<msproteomicstoolslib.format.SWATHScoringReader.ReadFilter object at 0x3e3f250>, errorHandling='strict', enable_isotopic_grouping=False)

Bases: msproteomicstoolslib.format.SWATHScoringReader.SWATHScoringReader

Parser for OpenSWATH output

parse_row(run, this_row, read_exp_RT)
class msproteomicstoolslib.format.SWATHScoringReader.mProphet_SWATHScoringReader(infiles, readmethod='minimal', readfilter=<msproteomicstoolslib.format.SWATHScoringReader.ReadFilter object at 0x3e4b3d0>, enable_isotopic_grouping=False)

Bases: msproteomicstoolslib.format.SWATHScoringReader.SWATHScoringReader

Parser for mProphet output

parse_row(run, this_row, read_exp_RT)
class msproteomicstoolslib.format.SWATHScoringReader.Peakview_SWATHScoringReader(infiles, readmethod='minimal', readfilter=<msproteomicstoolslib.format.SWATHScoringReader.ReadFilter object at 0x3e4b490>, enable_isotopic_grouping=False)

Bases: msproteomicstoolslib.format.SWATHScoringReader.SWATHScoringReader

Parser for Peakview output

parse_row(run, this_row, read_exp_RT)
msproteomicstoolslib.format.SWATHScoringReader.inferMapping(rawdata_files, aligned_pg_files, mapping, precursors_mapping, sequences_mapping, verbose=False, throwOnMismatch=False)

Infers a mapping between raw chromatogram files (mzML) and processed feature TSV files

Usually on feature file can contain multiple aligned runs and maps to multiple chromatogram files (mzML). This function will try to guess the original name of the mzML based on the align_origfilename column in the TSV. Note that both files have some typical endings that are _not_ shared, these are generally removed before comparison.

Only an excact match is allowed.

Data Matrix Module

Functions for handling the output data matrix

MatrixWriters

msproteomicstoolslib.format.MatrixWriters.getwriter(matrix_outfile)

Factory function to get the correct writer depending on the file ending

Parameters:matrix_outfile (str) – Filename of output - used to determine output format. Valid formats are .xlsx .xls .csv or .tsv
class msproteomicstoolslib.format.MatrixWriters.IWriter(outfile, delim=None)

Interface. you need to implement init, write, newline and del

newline()
write(entry, color=None)
class msproteomicstoolslib.format.MatrixWriters.CsvWriter(outfile, delim='t')

Bases: msproteomicstoolslib.format.MatrixWriters.IWriter

newline()
write(entry, color='ignored')
class msproteomicstoolslib.format.MatrixWriters.XlsWriter(outfile, delim='ignored')

Bases: msproteomicstoolslib.format.MatrixWriters.IWriter

newline()
write(entry, color='d')
class msproteomicstoolslib.format.MatrixWriters.XlsxWriter(outfile, delim='ignored')

Bases: msproteomicstoolslib.format.MatrixWriters.IWriter

newline()
write(entry, color='d')

Spectral library Module

Functions for handling SpectraST spectral library format

Spectral library handler

class msproteomicstoolslib.format.speclib_db_lib.Library(lkey=None)

This class contains one spectral library, whatever that means. It provides an read/write interface to the database. It provides an read/write interface to the SpectraST *.splib and *.pepidx files. One can easily add spectra or retrive the spectra

add_spectra(s)
all_spectra()

Iterate over all specra in the library

Yield:
spectrum(Spectra): current spectrum
annotate_with_libkey()

Annotate spectra with the key of the current library

count_modifications()
delete_library_from_DB(library_key, db)

Delete current library from SQL database

delete_reverse_spectra()
find_by_sequence(sequence, db)

This function can be used to access spectra using a sequence search

find_by_sql(query_in, db)

This function can be used to access spectra using an sql query. The query should produce a single coloumn with spectra_keys. This can be very slow, use find_by_sql_fast instead (~400x faster).

find_by_sql_fast(subQuery, db, tmp_db)

This function can be used to access spectra using an sql query. The query should produce a single coloumn with spectra_keys (ids) which MUST be called tmp_spectra_keys. You need create table privileges in the databse tmp_db for this. But it can be 400x times faster than plain find_by_sql.

get_all_spectra()
get_fileheader(splibFileName)

Get the header preceding the first spectrum in a spectrast file.

get_first_offset(splibFileName)
get_rawspectrum_with_offset(splibFileName, offset)

Get a raw spectrum as it is from a spectrast file by using an offset to locate it.

get_spectra_by_sequence(sequence)

Get all spectra that match a specific sequence

init_with_self(library)

Initialize with another library. Doesnt do a very deep copy

measure_nr_spectra()
nr_unique_peptides()
read_fromDB(library_key, db)

This function can be used to access one complete library from the DB.

static read_from_db_to_file(library_key, db, filePrefix)

This function can be used to access one complete library from the DB directly to a file.

static read_library_to_db(splibFileName, pepidxFileName, db, library_key)

Read directly from a spectral library into the database.

read_pepidx(filename)
read_spectrum_sptxt_idx(splibFileName, idx, library_key)

“Fetch a spectrum from the spectral library, by using the binary index

read_sptxt(filename)
read_sptxt_pepidx(splibFileName, pepidxFileName, library_key)

Read directly from a spectral library into memory.

read_sptxt_with_offset(splibFileName, offset)

Read a sptxt spectra library file by using an offset to keep memory free

remove_duplicate_entries()
set_library_key(lkey)
write(filePrefix, append=False)

Write the current library to a file.

write_sorted(filePrefix)
write_toDB(db, cursor)

Write all spectra into a SQL database

class msproteomicstoolslib.format.speclib_db_lib.SequenceHandler

Container class of spectra with the same sequence in a spectral library

Acts as a container of all spectra mapping to the same sequence inside a spectral library

add_meta(meta)
add_spectra(spectra)
add_spectra_no_duplicates(spectra)
empty()
init_with_self(handler)
remove(s)
remove_duplicate_entries()
class msproteomicstoolslib.format.speclib_db_lib.Spectra

A single spectrum inside a spectral library

acetyl_len()
add_meta(sequence, modifications, library_key)
analyse_mod()
carbamido_len()
escape_string(string)
find(id, db)
get_known_modifications()
get_meta_headers()
get_peaks()
get_spectra_headers()
icat_len()
initialize()

Initialize spectrum

is_tryptic()
methyl_len()
other_known_len()
other_len()
oxidations_len()
parse_SearchEngineInfo(searchEngineInfo)
parse_comments(comment)
parse_sptxt(stack)

Parse an sptxt entry and initialize spectrum

phospho_len()
phosphos_len()
save(db)
to_pepidx_str()

Convert spectrum object to pepidx format

to_splib_str()

Convert spectrum object to splib format

validate()