TantanApp#

class biotite.application.tantan.TantanApp(sequence, matrix=None, bin_path='tantan')[source]#

Bases: LocalApp

Mask sequence repeat regions using tantan. [1]

Parameters:
sequence(list of) NucleotideSequence or ProteinSequence

The sequence(s) to be masked. Either a single sequence or multiple sequences can be masked. Masking multiple sequences in a single run decreases the run time compared to multiple runs with a single sequence. All sequences must be of the same type.

matrixSubstitutionMatrix, optional

The substitution matrix to use for repeat identification. A sequence segment is considered to be a repeat of another segment, if the substitution score between these segments is greater than a threshold value.

bin_pathstr, optional

Path of the tantan binary.

References

Examples

>>> sequence = NucleotideSequence("GGCATCGATATATATATATAGTCAA")
>>> app = TantanApp(sequence)
>>> app.start()
>>> app.join()
>>> repeat_mask = app.get_mask()
>>> print(repeat_mask)
[False False False False False False False False False  True  True  True
  True  True  True  True  True  True  True  True False False False False
 False]
>>> print(sequence, "\n" + "".join(["^" if e else " " for e in repeat_mask]))
GGCATCGATATATATATATAGTCAA
         ^^^^^^^^^^^
add_additional_options(options)#

Add additional options for the command line program. These options are put before the arguments automatically determined by the respective LocalApp subclass.

This method is focused on advanced users, who have knowledge on the available options of the command line program and the options already used by the LocalApp subclasses. Ignoring the already used options may result in conflicting CLI arguments and potential unexpected results. It is recommended to use this method only, when the respective LocalApp subclass does not provide a method to set the desired option.

Parameters:
optionslist of str

A list of strings representing the command line options.

Notes

In order to see which options the command line execution used, try the get_command() method.

Examples

>>> seq1 = ProteinSequence("BIQTITE")
>>> seq2 = ProteinSequence("TITANITE")
>>> seq3 = ProteinSequence("BISMITE")
>>> seq4 = ProteinSequence("IQLITE")
>>> # Run application without additional arguments
>>> app = ClustalOmegaApp([seq1, seq2, seq3, seq4])
>>> app.start()
>>> app.join()
>>> print(app.get_command())
clustalo --in ...fa --out ...fa --force --output-order=tree-order --seqtype Protein --guidetree-out ...tree
>>> # Run application with additional argument
>>> app = ClustalOmegaApp([seq1, seq2, seq3, seq4])
>>> app.add_additional_options(["--full"])
>>> app.start()
>>> app.join()
>>> print(app.get_command())
clustalo --full --in ...fa --out ...fa --force --output-order=tree-order --seqtype Protein --guidetree-out ...tree
cancel()#

Cancel the application when in RUNNING or FINISHED state.

clean_up()#

Do clean up work after the application terminates.

PROTECTED: Optionally override when inheriting.

evaluate()#

Evaluate application results. Called in join().

PROTECTED: Override when inheriting.

get_app_state()#

Get the current app state.

Returns:
app_stateAppState

The current app state.

get_command()#

Get the executed command.

Cannot be called until the application has been started.

Returns:
commandstr

The executed command.

Examples

>>> seq1 = ProteinSequence("BIQTITE")
>>> seq2 = ProteinSequence("TITANITE")
>>> seq3 = ProteinSequence("BISMITE")
>>> seq4 = ProteinSequence("IQLITE")
>>> app = ClustalOmegaApp([seq1, seq2, seq3, seq4])
>>> app.start()
>>> print(app.get_command())
clustalo --in ...fa --out ...fa --force --output-order=tree-order --seqtype Protein --guidetree-out ...tree
get_exit_code()#

Get the exit code of the process.

PROTECTED: Do not call from outside.

Returns:
codeint

The exit code.

get_mask()#

Get a boolean mask covering identified repeat regions of each input sequence.

Returns:
repeat_mask(list of) ndarray, shape=(n,), dtype=bool

A boolean mask that is true for each sequence position that is identified as repeat. If a list of sequences were given as input, a list of masks is returned instead.

get_process()#

Get the Popen instance.

PROTECTED: Do not call from outside.

Returns:
processPopen

The Popen instance

get_stderr()#

Get the STDERR pipe content of the process.

PROTECTED: Do not call from outside.

Returns:
stdoutstr

The standard error.

get_stdout()#

Get the STDOUT pipe content of the process.

PROTECTED: Do not call from outside.

Returns:
stdoutstr

The standard output.

is_finished()#

Check if the application has finished.

PROTECTED: Override when inheriting.

Returns:
finishedbool

True of the application has finished, false otherwise

join(timeout=None)#

Conclude the application run and set its state to JOINED. This can only be done from the RUNNING or FINISHED state.

If the application is FINISHED the joining process happens immediately, if otherwise the application is RUNNING, this method waits until the application is FINISHED.

Parameters:
timeoutfloat, optional

If this parameter is specified, the Application only waits for finishing until this value (in seconds) runs out. After this time is exceeded a TimeoutError is raised and the application is cancelled.

Raises:
TimeoutError

If the joining process exceeds the timeout value.

static mask_repeats(sequence, matrix=None, bin_path='tantan')#

Mask repeat regions of the given input sequence(s).

Parameters:
sequence(list of) NucleotideSequence or ProteinSequence

The sequence(s) to be masked. Either a single sequence or multiple sequences can be masked. Masking multiple sequences in a single run decreases the run time compared to multiple runs with a single sequence. All sequences must be of the same type.

matrixSubstitutionMatrix, optional

The substitution matrix to use for repeat identification. A sequence segment is considered to be a repeat of another segment, if the substitution score between these segments is greater than a threshold value.

bin_pathstr, optional

Path of the tantan binary.

Returns:
repeat_mask(list of) ndarray, shape=(n,), dtype=bool

A boolean mask that is true for each sequence position that is identified as repeat. If a list of sequences were given as input, a list of masks is returned instead.

run()#

Commence the application run. Called in start().

PROTECTED: Override when inheriting.

set_arguments(arguments)#

Set command line arguments for the application run.

PROTECTED: Do not call from outside.

Parameters:
argumentslist of str

A list of strings representing the command line options.

set_exec_dir(exec_dir)#

Set the directory where the application should be executed. If not set, it will be executed in the working directory at the time the application was created.

PROTECTED: Do not call from outside.

Parameters:
exec_dirstr

The execution directory.

set_stdin(file)#

Set a file as standard input for the application run.

PROTECTED: Do not call from outside.

Parameters:
filefile object

The file for the standard input. Must have a valid file descriptor, e.g. file-like objects such as StringIO are invalid.

start()#

Start the application run and set its state to RUNNING. This can only be done from the CREATED state.

wait_interval()#

The time interval of is_finished() calls in the joining process.

PROTECTED: Override when inheriting.

Returns:
intervalfloat

Time (in seconds) between calls of is_finished() in join()