API

API reference manual

MHC bound peptides (T antigen)

initialize

snaf.__init__.initialize(df, db_dir, gtex_mode='count', software_path=None, binding_method=None, t_min=20, n_max=3, normal_cutoff=5, tumor_cutoff=20, normal_prevalance_cutoff=0.01, tumor_prevalance_cutoff=0.1, add_control=None)[source]

Setting up global variable for running the program

Parameters
  • df – dataframe, the currently tested cancer cohort dataframe, this will limit number of junctions being loaded into memory for control database

  • db_dir – string, the path to the database

  • gtex_mode – string, either ‘psi’ or ‘count’

  • software_path – string or None, either the path to the netMHCpan4.1 executable or None (using MHCflurry)

  • binding_method – string or None, either ‘netMHCpan’ or ‘MHCflurry’

  • t_min – int, the minimum number of read count the tumor sample should be larget than average number in normal database

  • n_max – int, the maximum number of average read count normal database count have

  • normal_cutoff – int, below which read count we consider a junction is not expressed in normal tissue

  • tumor_cutoff – int, above which read count we consider a junction is expressed in tumor tissue

  • normal_prevalance_cutoff – float, if below this fraction, we consider a junction is not present in normal tissue at an appreciable amount

  • tumor_prevalance_cutoff – float, if above this fraction, we consider a junction is present in tumor tissue at an appreciable amount

  • add_control – None or a dictionary containing additional controls, additional controls can a dataframe or anndata, for instance, if adding two controls asides from internal GTEx, {‘tcga_matched_control’:df,’gtex_skin’:adata}, when added database contains multiple tissue types, it is recommended to pass as a AnnData with the tissue information stored as adata.var[‘tissue’], otherwise, if passed as a dataframe, or samples will be recognized as a single tissue type, it will affect tumor specifcity score calculation if using MLE or bayesian hierarchical models.

Example:

# database directory (where you extract the reference tarball file)
db_dir = '/user/ligk2e/download'
# instantiate (if using netMHCpan)
netMHCpan_path = '/user/ligk2e/netMHCpan-4.1/netMHCpan'
snaf.initialize(db_dir=db_dir,gtex_mode='count',binding_method='netMHCpan',software_path=netMHCpan_path)
# instantiate (if not using netMHCpan)
snaf.initialize(db_dir=db_dir,gtex_mode='count',binding_method='MHCflurry',software_path=None)

JunctionCountMatrixQuery class

class snaf.snaf.JunctionCountMatrixQuery(junction_count_matrix, cores=None, add_control=None, not_in_db=False, outdir='.', filter_mode='maxmin')[source]

Instantiate the JunctionCountMatrixQuery class

Parameters
  • junction_count_matrix – The pandas dataframe for the junction_count_matrix, outputted by AltAnalyze

  • cores – int, how many cores you’d like to use, if None, then will automatically detect the maximum amount of cores in the OS

  • add_control – None or a dictionary containing additional controls, additional controls can a dataframe or anndata, for instance, if adding two controls asides from internal GTEx, {‘tcga_matched_control’:df,’gtex_skin’:adata}, when added database contains multiple tissue types, it is recommended to pass as a AnnData with the tissue information stored as adata.var[‘tissue’], otherwise, if passed as a dataframe, or samples will be recognized as a single tissue type, it will affect tumor specifcity score calculation if using MLE or bayesian hierarchical models.

  • not_in_db – boolean, whether to remove junctions that are present in any Ensembl documented transcript, remember some of the documented transcript in Ensembl are tumor-specific as well, doing so may remove some bona fide hits. But good for reducing number of neoantigens for validation.

  • outdir – string, the output folder for storing results

  • filter_mode – string, either ‘maxmin’ or ‘prevalance’, this correspond to the parameters that will be used in snaf.initialize, if maxmin, use t_min and n_max, if prevalance, use normal and tumor cutoff

Example:

jcmq = JunctionCountMatrixQuery(junction_count_matrix=df,cores=50,add_control=control_df,not_in_db=True)
static generate_results(path, outdir, criterion=None)[source]

wrapper function to automatically generate all necessary output

Parameters
  • path – string, where the pickle result file lie on the disk

  • outdir – string, where to generate all the necessary result

Example:

JunctionCountMatrixQuery.generate_results(path='result/after_prediction.p',outdir='result')
static get_membrane_tuples(df, **kwargs)[source]

this function is used by SurfaceAntigen pipeline to filter out splicing evnets that are not tumor-specific and also compute useful informations for each membrane protein.

Parameters
  • df – pandas dataframe, the junction count matrix

  • **kwargs

    will be passed to __init__ of JunctionCountMatrixQuery class function

Return membrane_tuples

a list, in which each item is a tuple (uid,mean_gtex,df_gtex,ed,freq).

  • uid: the uid of the splicing evnet

  • mean_gtex: the mean raw read count across GTEx

  • df_gtex: a dataframe of gtex sample name, tissue types and read count values

  • ed: expression dictionary for tumor samples

  • freq: expression frequency for tumor samples

Example:

membrane_tuples = snaf.JunctionCountMatrixQuery.get_membrane_tuples(df)
run(hlas, strict=False, outdir='.', name='after_prediction.p')[source]

main function to run mhc bound peptide T antigen pipeline

Parameters
  • hlas – list, each item is a list of 6 HLA types associated with each sample, the order of this list must be consistent with the sample name in the df column

  • strict – boolean, default is False, determine whether to filter out peptide which doesn’t have canonical start codon support

  • outdir – string, the path where all the results will go into

  • name – string, the name of the generated pickle result object

Example:

jcmq.run(hlas=hlas,outdir='result',name='after_prediction.p')
show_neoantigen_as_fasta(outdir, name, stage, verbosity, contain_uid, sample=None, criterion=None)[source]

write the neoantigen as a fasta file for MS validation or other purpose

Parameters
  • outdir – string ,the output directory for the generated fasta file

  • name – string, the name of the output fasta file

  • stage – int, either 0,1,2,3, the stage of neoantigen you want to report

  • verbosity – int, either 1,2,3, the verbosity of candidates

  • contain_uid – boolean, whether you want to contain the junction UID along with the neoantigen

  • sample – string, the name of the sample in which you want to extract the neoantigen

  • criterion – nested list, the criterion for filtering out the candidate

Examples:

jcmq = snaf.JunctionCountMatrixQuery.deserialize('result/after_prediction.p')
sample = 'SRR5933735.Aligned.sortedByCoord.out'
jcmq.show_neoantigen_as_fasta(outdir='./fasta',name='neoantigen_{}.fasta'.format(sample),stage=2,verbosity=1,contain_uid=True,sample=sample)
visualize(uid, sample, outdir, tumor=False, criterion=[('netMHCpan_el', 0, '<=', 2), ('deepimmuno_immunogenicity', 1, '==', 'True')])[source]

Visualize certain Neojunction in certain sample

Parameters
  • uid – string, the uid for the event you want to check

  • sample – string, the sample name

  • outdir – string, where to deposite the figure

  • tumor – bool, whether to show the expression level in tumor sample as well, default is not

  • criterion – a nested tuple, if none of the neoantigens are bound and immunogenic, no candidates will be plotted, so you can relax the criterion a bit based on the format specified.

Example:

jcmq.visualize(uid='ENSG00000167291:E38.6-E39.1',sample='TCGA-DA-A1I1-06A-12R-A18U-07.bed',outdir='./result')

mutation_analysis

snaf.downstream.mutation_analysis(mode, burden, mutation, output, n_sample_cutoff=10, gene_column='gene', genes_to_plot=None)[source]

Run mutation association analysis with neoantigen burden

Parameters
  • mode – string, either ‘compute’ or ‘plot’, compute will compute all the association, plot will visualize certain assciation in box plot

  • burden – pandas dataframe, the burden result file generated by T pipeline, and sample ID should be consistent with the sample ID in mutation dataframe index

  • mutation – pandas dataframe, the mutation data, make sure sample ID column is the index

  • output – string, the path in which the result will be generated, if compute mode, a txt file will be generated, if plot mode, a plot will be generated

  • n_sample_cutoff – int, default is 10, if a mutation occur in < n_sample_cutoff samples, it won’t be considered

  • gene_column – string, default is gene, in the mutation df, which column represents the gene

  • gene_to_plot – list, each item is a gene name, for the plot mode

Example:

# compute mode
snaf.mutation_analysis(mode='compute',burden=burden,mutation=mutation,output='result/stage3_mutation.txt')
# plot mode
snaf.mutation_analysis(mode='plot',burden=burden,mutation=mutation,output='result/stage3_mutation_CAMKK2.pdf',genes_to_plot=['CAMKK2'])

survival_analysis

snaf.downstream.survival_analysis(burden, survival, n, stratification_plot, survival_plot, survival_duration='OS.time', survival_event='OS')[source]

Run survival analysis based on neoantigen burden, the burden`and `survival`must be pandas dataframe, and make sure the sample ID are consistent with each other, we only consider samples in `burden that have clinical survival data.

Parameters
  • burden – pandas dataframe, the burden result file generated by T pipeline

  • survival – pandas dataframe, the survival data

  • n – int, how many groups to stratigy, support 2 (median), 3 (33%,66%), 4 (quantile)

  • stratification_plot – string, the path and name for the stratification plot

  • survival_plot – string, the path plus name for the survival plot

  • survival_duration – string, which column in survival data represent the duration

  • survival_event – string, which column in survival data represent the event

Example:

snaf.survival_analysis(burden,survival,n=2,stratification_plot='result/stage2_stratify.pdf',survival_plot='result/stage2_survival.pdf')

survival_regression_cox_model

snaf.downstream.survival_regression(freq, remove_quote, rename_func, survival, pea, outdir='.', cores=None, mode='binary', survival_duration='OS.time', survival_event='OS', n_effective_obs=3)[source]

conduct cox regression to identify neoantigens whose parental junctions expression is significantly associated with survival

Parameters
  • freq – string, path to the T antigen generated frequency dataframe

  • remove_quote – bool, whether to remove the quote of samples column for freq df or not

  • rename_func – function, a function to rename columns in freq and event annotation file to be consistent with survival file

  • survival – string, the path to the survival dataframe

  • pea – string, the path to the altanalyze generated event annotation file

  • outdir – string, the output directory

  • cores – None or int, how many cores to use for computation

  • mode – binary or psi, if binary, we test whether having a neoantigen can predict survival, psi means whether its parental junction’s psi value is associated with survival

  • survival_duration – string, the column name for duration in survival dataframe

  • survival_event – string, the column name for event in survival dataframe

  • n_effective_obs – int, default is 3, at least three obs whose event=1, otherwise all obs are right-sensored

Examples:

snaf.downstream.survival_regression(freq='result/frequency_stage3_verbosity1_uid_gene_symbol_coord_mean_mle.txt',remove_quote=True,
                                    rename_func=lambda x:'-'.join(x.split('-')[:4]),survival='TCGA-SKCM.survival.tsv',
                                    pea='Hs_RNASeq_top_alt_junctions-PSI_EventAnnotation.txt',outdir='result/survival')

analyze_neoantigens

snaf.downstream.analyze_neoantigens(freq_path, junction_path, total_samples, outdir, fasta=False, mers=None, columns=None, cutoffs=(0.1, 0.9), junction_bl=0.1)[source]

Users need to execute this function to prepare input for the t antigen dash viewer

Parameters
  • freq_path – string, the path to the frequency file in the result folder, with uid appended

  • junction_path – string, the path to the burden0 file in the result folder

  • total_samples – int, the total number of samples, will be used for determine shared versus unique neoantigen

  • outdir – string, where the output will go into

  • fasta – bool, whether to output the fasta file as well to check the motif

  • mers – None or list, if [9,10], it will generate mer9 and mer10 for fasta and separate dataframe.

  • columns – None or string, which column to include in the output df

  • cutoffs – tuple, the cutoffs to determine shared versus unique neoantigen

  • junction_bl – float, the cutff for parent junction, if junction occur < junction_bl sample, we don’t consider neoantigens from those junction

Example:

# just output for viewer
snaf.downstream.analyze_neoantigens(freq_path='result/frequency_stage2_verbosity1_uid.txt',junction_path='result/burden_stage0.txt',total_samples=472,outdir='result',mers=None,fasta=False)
# to run MEME for motif after that
snaf.downstream.analyze_neoantigens(freq_path='result/frequency_stage2_verbosity1_uid.txt',junction_path='result/burden_stage0.txt',total_samples=472,outdir='result',mers=[9,10],fasta=True)

run_dash_T_antigen

snaf.dash_app.app.run_dash_T_antigen(input_abs_path, remove_cols=['uid'], host=None, port='8050', output_abs_path=None)[source]

run the dash T antigen viewer

Parameters
  • input_abs_path – string, the absolute path to the input df

  • remove_cols – list, the column name to remove from the input df

  • host – string or None, if None, program will run hostname to automatically detect

  • port – string, default is 8050

  • output_abs_path – string or None, if you want to have the umap embedding, specify the output path and name

Example:

snaf.run_dash_T_antigen(input_abs_path='/data/salomonis2/LabFiles/Frank-Li/neoantigen/TCGA/SKCM/snaf_analysis/result/shared_vs_unique_neoantigen_all.txt')

report_candidates

snaf.downstream.report_candidates(jcmq, df, sample, outdir, remove_quote=True, metrics={'deepimmuno_immunogenicity': 'immunogenicity', 'netMHCpan_el': 'binding_affinity'}, criterion=[('netMHCpan_el', 0, '<=', 2), ('deepimmuno_immunogenicity', 1, '==', 'True')])[source]

this function will report a txt file for all the T antigen candidate for a specific tumor sample.

Parameters
  • jcmq – the JunctionCountMatrixQuery object, usually you need to deserialize after_prediction.p file produced by T antigen pipeline

  • df – DataFrame,the frequency table generated by T antigen pipeline

  • sample – String, the sample name to report T antigen candiate

  • outdir – string, the path to the output directory

  • remove_quote – boolean, whether to remove the quotation or not, as one column in frequency table df is list, when loaded in memory using pandas, it will be added a quote, we can remove it

  • metrics – dict, the key is the metric we registered to each peptide-hla, values is the corresponding name we want to output in the report for each metric

  • criterion – list with each element as a tuple, this is a very specialized expression to specificy the criterion we are going to use to only report peptide-hla that meet the criterion. each tuple has four element, the first element is the metric name we registered, the second is either 0 or 1, 0 means looking for the continuous value score, 1 means looking for the string identity value. See below note for further explanation. the third is an operator, and the fourth is the cutoff.

Note

By default, we run netMHCpan or MHCflurry for MHC binding prediction, and DeepImmuno for immunogenicity prediction. When we add the predicted value for each peptide-hla pair, we format the prediction result into a dataframe compliant with something like below:

peptide mer hla score identity

AAAAAAAAA 9 HLA-A*01:01 0.3 SB

here is the dataframe for MHC binding prediction, we have a column recording continous value (score), and another column (identity) recording whether it is a SB (Strong Binder) or WB (Weaker Binder). Those are all useful information coming out of the predictor. Now we just need to register this df to the NeoJunction class by running nj.enhanced_peptide.register_attr(df,'netMHCpan_el). We can keep adding scores to the NeoJunction class. This explain what the 0/1 in criterion mean.

Example:

snaf.report_candidates(jcmq,df,'SRR067783.bed','result',True)
# a txt file will be written to the outdir you specified

get_coverage

snaf.downstream.get_coverage(t_result, allele, outdir='.')[source]

Get the population coverage of certain neoantigen, based on the different HLA allele frequency from US bone marrow registry. The 21 race code can be found here (https://www.sciencedirect.com/science/article/pii/S0198885913001821?via%3Dihub#t0005)

Parameters
  • t_result – string, the path to the T_antigen_candidates_all.txt file

  • allele – string, either A, B or C

  • outdir – string, default is current directory

Examples:

snaf.downstream.get_coverage(t_result='result_new/T_candidates/T_antigen_candidates_all.txt',allele='A')
annotation txt file

peptide

hlas

uid

symbol

coord

tumor_specificity_mean

tumor_specificity_mle

AAFA

AFB

AINDI

AMIND

CARB

CARHIS

EURCAU

FILII

JAPI

KORI

MENAFC

MSWHIS

NCHI

SCAHIS

SCSEAI

VIET

AAAAAGHFW

B*57:01,B*46:01,B*35:01

ENSG00000106571:E2.2-E4.1

GLI3

chr7:42223295-42236971(-)

1.3304086

0.423474388

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

AAAAAGSSSW

B*57:01,A*25:01,A*32:01

ENSG00000099960:E2.2-E3.2

SLC7A4

chr22:21031852-21032531(-)

0.9078695

0.37040754

0.01847

0.01771

0.03875

0.04018

0.01601

0.04102

0.05655

0.00405

0.0024

0.00622

0.05907

0.03597

0.00636

0.03612

0.03019

0.00191

AAAAAPPLQR

A*03:01

ENSG00000066697:E20.3-E27.1_100442077

MSANTD3

chr9:100427393-100442077(+)

0.06224294

0.089645936

0.08395

0.07834

0.06358

0.10437

0.07825

0.0863

0.13987

0.01309

0.00897

0.01905

0.10098

0.08147

0.01401

0.07384

0.04987

0.00973

AAAAAPVIGK

A*03:01

ENSG00000105613:E2.1-E4.1

MAST1

chr19:12838300-12840446(+)

0.5440088

0.625632982

0.08395

0.07834

0.06358

0.10437

0.07825

0.0863

0.13987

0.01309

0.00897

0.01905

0.10098

0.08147

0.01401

0.07384

0.04987

0.00973

AAAAFRFPR

A*30:01,A*31:01,A*11:02,A*66:01,A*11:01,A*33:01,A*68:01

ENSG00000270816:E1.3-E2.1

LINC00221

chr14:106482637-106488768(+)

0.45516863

0.999993827

0.16699

0.15773

0.25935

0.18128

0.16724

0.17499

0.14498

0.2024

0.17736

0.19134

0.166

0.19398

0.3579

0.18561

0.2764

0.30457

AAAAGHFWSK

A*03:01

ENSG00000106571:E2.2-E4.1

GLI3

chr7:42223295-42236971(-)

1.3304086

0.423474388

0.08395

0.07834

0.06358

0.10437

0.07825

0.0863

0.13987

0.01309

0.00897

0.01905

0.10098

0.08147

0.01401

0.07384

0.04987

0.00973

AAAAGSSSW

B*35:01,C*03:03,B*13:01,B*44:03,C*02:02,A*26:01,C*16:01,C*03:04,A*25:01,B*57:01,A*32:01,B*15:01

ENSG00000099960:E2.2-E3.2

SLC7A4

chr22:21031852-21032531(-)

0.9078695

0.37040754

0.03303

0.03014

0.08059

0.06006

0.02826

0.07102

0.08743

0.01928

0.0822

0.04373

0.10325

0.06192

0.03005

0.06562

0.0689

0.02475

AAAAGSSSWL

C*03:04,C*03:03,B*07:02,B*07:05

ENSG00000099960:E2.2-E3.2

SLC7A4

chr22:21031852-21032531(-)

0.9078695

0.37040754

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

AAAALAGIL

C*03:03,C*12:02

ENSG00000065308:E9.1-E10.1_52504687

TRAM2

chr6:52504687-52505599(-)

0.25774613

0.155925177

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Interface to proteomics

chop_normal_pep_db

snaf.proteomics.chop_normal_pep_db(fasta_path, output_path, mers, allow_duplicates)[source]

chop any normal human proteome to certain mers

Parameters
  • fasta_path – string, the path to the human protein fasta file

  • output_path – string, the path to the output fasta file

  • mers – list, like [9,10] will generate 9mer and 10mer

  • allow_duplicates – boolean. whether allow duplicate or not

Example:

chop_normal_pep_db(fasta_path='human_uniprot_proteome.fasta',output_path='./human_uniprot_proteome_mer9_10.fasta',mers=[9,10],allow_duplicates=False)

compare_two_fasta

snaf.proteomics.compare_two_fasta(fa1_path, fa2_path, outdir='.', write_comm=False, write_unique1=False, write_unique2=False, prefix='')[source]

Compare arbitracy two fasta files, report unique and common ones

Parameters
  • fa1_path – the first fasta file path that you want to compare with

  • fa2_path – the second fasta file path that you want to compare with

  • write_comm – boolean, whether write the common one between fa1 and fa2

  • write_unique1 – boolean, whether write the one unique to fa1

  • write_unique2 – boolean, whether write the one unique to fa2

  • prefix – string, whether to add a prefix to the output fasta file name

Example:

snaf.proteomics.compare_two_fasta(fa1_path='./fasta/human_proteome_uniprot_9_10_mers_unique.fasta', 
                              fa2_path='./fasta/neoantigen_{}_unique.fasta'.format(sample),outdir='./fasta',
                              write_unique2=True,prefix='{}_'.format(sample))

remove_redundant

snaf.proteomics.remove_redundant(fasta_path, out_path)[source]

remove redundant entries in any fasta files

Parameters
  • fasta_path – string, the path to the input fasta file

  • out_path – string, the path to the output folder

Example:

snaf.proteomics.remove_redundant('./fasta/neoantigen_{}.fasta'.format(sample),'./fasta/neoantigen_{}_unique.fasta'.format(sample))

set_maxquant_configuration

snaf.proteomics.set_maxquant_configuration(dbs, n_threads, inputs, enzymes, enzyme_mode, outdir, outname='mqpar.xml', protein_fdr=0.01, peptide_fdr=0.01, site_fdr=0.01, include_contaminants=True, var_mods=['Oxidation (M)', 'Acetyl (Protein N-term)', 'Carbamidomethyl (C)'], fix_mods=None, minPepLen=8, maxPeptideMass=4600, minPeptideLengthForUnspecificSearch=8, maxPeptideLengthForUnspecificSearch=25)[source]

automatically build MaxQuant configuration file mqpar.xml, the template mqpar.xml is generated using maxquant 2.0.3.1 on a windows PC GUI, we load a single raw file, a single fasta file, reduce carbamidomethyl from fixed modification to variable modification and allow match run, otherwise, we retain all default parameters.

Parameters
  • dbs – list, each element is the path to the database fasta file

  • n_threads – int, how many threads will be used for MaxQuant search

  • inputs – list, each element is the path to the raw file

  • enzymes – list, what enzymes will be used for search

  • enzyme_mode – int, 0 -> specific, 3 -> semi-specific, 4 -> unspecific, 5 -> no digestion

  • outdir – string, the output directory for the conf file

  • outname – string, default is mqpar.xml

  • protein_fdr – float, default is 0.01

  • peptide_fdr – float, default is 0.01

  • site_fdr – float, default is 0.01

  • include_contaminants – boolean, default is True

  • var_mods – list or None, default is [‘Oxidation (M)’, ‘Acetyl (Protein N-term)’, ‘Carbamidomethyl (C)’]

  • fix_mods – list or None, default is []

  • minPepLen – int, default is 8

  • maxPeptideMass – int, default is 4600

  • minPeptideLengthForUnspecificSearch – int, default is 8

  • maxPeptideLengthForUnspecificSearch – int, default is 25

Example:

dbs = ['/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/RNA/snaf_analysis/fasta/SRR5933726.Aligned.sortedByCoord.out.bed_unique2.fasta']
inputs = ['/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/MS/OvCa48/OvCa48_classI_Rep#1.raw',
        '/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/MS/OvCa48/OvCa48_classI_Rep#2.raw',
        '/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/MS/OvCa48/OvCa48_classI_Rep#3.raw',
        '/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/MS/OvCa48/OvCa48_classI_Rep#4.raw',
        '/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/MS/OvCa48/OvCa48_classI_Rep#5.raw',
        '/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/MS/OvCa48/OvCa48_classI_Rep#6.raw']
outdir = '/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/MS/OvCa48'
snaf.proteomics.set_maxquant_configuration(dbs=dbs,n_threads=20,inputs=inputs,enzymes=None,enzyme_mode=5,protein_fdr=1,peptide_fdr=0.05,site_fdr=1,outdir=outdir)

summarize_ms_result

snaf.proteomics.summarize_ms_result(peptide, msms, freq, outdir)[source]

After finishing running Maxquant, this function can help figuring out what peptide for further validation

Parameters
  • peptide – string, the path to MaxQuant peptide.txt file

  • msms – string, the path to the MaxQuant msms.txt file

  • freq – string, the path to the SNAF frequency_stage2_verbosity1_uid_gene_symbol_coord_mean_mle.txt file

  • outdir – string, the path to the output folder

Return df_peptide

dataframe, the valid MS-evidenced neoantigens with associated properties

Examples:

df = snaf.proteomics.summarize_ms_result(peptide='MS/OvCa114/combined/txt/peptides.txt',msms='MS/OvCa114/combined/txt/msms.txt',freq='result/frequency_stage2_verbosity1_uid_gene_symbol_coord_mean_mle.txt')

Surface Antigen (B antigen)

initialize

snaf.surface.main.initialize(db_dir)[source]

setting up additional global variables for running B antigen program

Parameters

db_dir – string, the path to the database folder

Examples:

from snaf import surface
surface.initialize(db_dir=db_dir)

run

snaf.surface.main.run(uids, outdir, prediction_mode='short_read', n_stride=2, gtf=None, tmhmm=False, software_path=None, serialize=True)[source]

main function for run B antigen pipeline

Parameters
  • uids – list, the membrane tuples

  • outdir – string, the path where all the output will go into

  • prediction_mode – string, either ‘short_read’ or ‘long_read’

  • n_stride – int, how many exons define a early stop codon, for NMD check

  • gtf – None or string, if prediction mode is long_read, supply the path to the gtf file

  • tmhmm – bool, use tmhmm or not

  • software_path – None or string, if tmhmm=True, specify the tmhmm software path

  • serialize – bool, serialize to the pickle file for the result

Examples:

# if using TMHMM
surface.run(membrane_tuples,outdir='result',tmhmm=True,software_path='/data/salomonis2/LabFiles/Frank-Li/python3/TMHMM/tmhmm-2.0c/bin/tmhmm')
# if not using TMHMM
surface.run(membrane_tuples,outdir='result',tmhmm=False,software_path=None)

generate_full_results

snaf.surface.main.generate_full_results(outdir, freq_path, mode, validation_gtf)[source]

preferred api for wraping both generate_results and report_candiates, good for general usage

Parameters
  • outdir – string, the output folder

  • freq_path – string, the path to the frequency_stage0_verbosity1_uid_gene_symbol_coord_mean_mle.txt file from T antigen pipeline

  • mode – string, either long_read or short_read

  • validation_gtf – string, the path to the validation long_read gtf when mode=short_read

Examples:

# long read mode
surface.generate_full_results(outdir='result/surface',freq_path='result/frequency_stage0_verbosity1_uid_gene_symbol_coord_mean_mle.txt',mode='long_read')
# short read mode
surface.generate_full_results(outdir='result/surface',freq_path='result/frequency_stage0_verbosity1_uid_gene_symbol_coord_mean_mle.txt',mode='short_read',validation_gtf='/data/salomonis2/LabFiles/Frank-Li/neoantigen/TCGA/SKCM/snaf_analysis/SQANTI-all/collapse_isoforms_classification.filtered_lite.gtf')

generate_results

snaf.surface.main.generate_results(pickle_path, strigency=3, outdir='.', gtf=None, long_read=False, style=None, overlap_extracellular=False)[source]

Generate candidates for B antigen

Parameters
  • pickle_path – string, the path to the pickle result file

  • strigency

    int, from 1-5, how strigent you want to program to be for calling B antigen

    • strigency1: novel isoform needs to be absent in uniprot database

    • strigency2: novel isoform also needs to be a documented protein-coding gene

    • strigency3: novel isoform also needs to not be subjected to Nonsense Mediated Decay (NMD)

    • strigency4: novel isoform also needs to have long-read or EST support (as long as the novel junction present in full-length)

    • strigency5: novel isoform also needs to have long-read or EST support (whole ORF needs to be the same as full-length)

  • outdir – string, path to the output folder

  • gtf – string, if strigency>3, you need to specify the path to the long-read or EST gtf file

  • long_read – boolean, whether the last run was using prediction_mode as long_read or not, default to False

  • style – None or string, can either be ‘deletion’ or ‘insertion’ meaning only generate result with deleted or inserted region

  • overlap_extracellular – boolean, default is True, only generate result whose junction overlap with extracellular domain

Example:

# if having gtf file for long-read data
surface.generate_results(pickle_path='./result/surface_antigen.p',outdir='result',strigency=5,gtf='./SQANTI-all/collapse_isoforms_classification.filtered_lite.gtf')
# if not having
surface.generate_results(pickle_path='./result/surface_antigen.p',outdir='result',strigency=3,gtf=None)

run_dash_B_antigen

snaf.surface.main.run_dash_B_antigen(pkl, prediction_mode, candidates, python_executable, host=None, port='8050')[source]

run the dash B antigen viewer

Parameters
  • pkl – string, the path to the surface B pipeline pickle file

  • prediction_mode – string, either short_read or long_read

  • candidates – string ,the path to the candidates.txt file

  • python_executable – string ,the path to the python executable you are using

  • host – None or string, string or None, if None, program will run hostname to automatically detect

  • port – string, default is 8050

Example:

surface.run_dash_B_antigen(pkl='result/surface_antigen.p',candidates='result/candidates_5.txt',prediction_mode='long_read',
                   python_executable='/data/salomonis2/LabFiles/Frank-Li/refactor/neo_env/bin/python3.7')

report_candidates

snaf.surface.main.report_candidates(pickle_path, candidates_path, validation_path, freq_df_path, mode, outdir='.', name=None)[source]

report SNAF-B antigen candidates, a more organized and tidy file format for all the candidate.

Parameters
  • pickle_path – string ,the path to the saved pickle object from run command.

  • candidates_path – string, the path to the saved candidates.txt file which we’d like to reformat

  • validation_path – string, the path to the saved validation.txt file which accoompanying the candidates.txt file

  • freq_df_path – string, the path to the frequency table from SNAF-T pipeline, this will provide the frequency of each neojunction in whole cohort.

  • mode – string, either ‘long_read’ or ‘short_read’,default is short_read

  • outdir – string, the folder for the output directory, will create one if not exist

  • name – string, the name of the saved output file.

Example:

surface.report_candidates('result/surface_antigen.p','result/candidates_3.txt','result/validation_3.txt',
                          'result/frequency_stage0_verbosity1_uid_gene_symbol_coord_mean_mle.txt','short_read',
                          'result','sr_str3_report.txt')

GTEx Viewer (tumor specificity)

gtex_visual_combine

snaf.gtex_viewer.gtex_visual_combine(uid, norm=False, outdir='.', figsize=(6.4, 4.8), tumor=None, ylim=None, group_by_tissue=True)[source]

Visualize the gtex expression and tumor specificity for splicing event (combine into one plot)

Parameters
  • uid – string, the uid for the splicing event

  • norm – bool. whether normalize for the sequencing depth or not

  • outdir – string, where the figure go into

  • figsize – tuple, the (width,height) of the figure

  • tumor – pandas dataframe, the tumor df to compare with

  • ylim – tuple, modify the ylim of the (bottom, top) of the figure

  • group_by_tissue – bool, whether to group the normal smaples by tissue or not, default is True

Example:

snaf.gtex_visual_combine(uid='ENSG00000090339:E4.3-E4.5',norm=True,tumor=df)  
snaf.gtex_visual_combine(uid='ENSG00000112149:E7.1-E9.1',norm=True,tumor=df)

gtex_visual_combine_plotly

snaf.gtex_viewer.gtex_visual_combine_plotly(uid, outdir='', norm=False, tumor=None)[source]

Generate combined plot but interactively

Parameters
  • uid – string, the uid for the splicing event

  • norm – bool. whether normalize for the sequencing depth or not

  • outdir – string, where the figure go into

  • tumor – pandas dataframe, the tumor df to compare with

Example:

snaf.gtex_visual_combine(uid='ENSG00000090339:E4.3-E4.5',norm=True,tumor=df)  
snaf.gtex_visual_combine(uid='ENSG00000112149:E7.1-E9.1',norm=True,tumor=df)

gtex_visual_subplots

snaf.gtex_viewer.gtex_visual_subplots(uid, norm=True, top=100, outdir='.')[source]

Visualize the gtex expression and tumor specificity for splicing event (subplots)

Parameters
  • uid – string, the uid for the splicing event

  • norm – bool. whether normalize for the sequencing depth or not

  • outdir – string, where the figure go into

Example:

snaf.gtex_visual_subplots(query='ENSG00000090339:E4.3-E4.5',norm=True)
snaf.gtex_visual_subplots(query='ENSG00000112149:E7.1-E9.1',norm=True)

gtex_visual_norm_count_combined

snaf.gtex_viewer.gtex_visual_norm_count_combined(uid, xlim=None, ylim=None, save_df=False, outdir='.')[source]

This function is to visualize the normalized count (CPM) distribution, it should follow a half-norm distribution, this plot will help determine the choice in bayesian modeling for tumor specificity score

Parameters
  • uid – string, the uid for the splicing event

  • xlim – None or tuple, whether to constrain the xlim of the histplot

  • ylim – None or tuple, whether to constrain the ylim of the histplot

  • save_df – boolean, whether to save a txt file reporting all the cpm values and the tissue information for debug purpose

  • outdir – string, where the figure go into

Example:

snaf.gtex_visual_norm_count_combined(uid='ENSG00000090339:E4.3-E4.5')  

gtex_visual_per_tissue_count

snaf.gtex_viewer.gtex_visual_per_tissue_count(uid, total_count=10, count_cutoff=1, total=25, outdir='.')[source]

This function is to visualize the random variable X, which repesents the number of samples in each tissue type that express this junction. It should follow a zeroinflated poisson distribution. We do not consider tissues with less than total_count samples, and we use count_cutoff to represent how many count can be defined as “expressed” versus “non-expressed”. We scaled all the number of samples to the total of “total” to make sure they are comparable. Because if a tissue type has 1000 samples, another has 25 samples, this random variable is not directly comparable. this plot will help determine the choice in bayesian modeling for tumor specificity score

Parameters
  • uid – string, the uid for the splicing event

  • outdir – string, where the figure go into

Example:

snaf.gtex_visual_per_tissue_count(uid='ENSG00000090339:E4.3-E4.5')  

Miscellaneous

Add gene symbol

snaf.downstream.add_gene_symbol_frequency_table(df, remove_quote=True)[source]

This function will convert the ENSG id to gene sysmbol and add a column in place for your dataframe

Parameters
  • df – the input df, make sure the index is the uid

  • remove_quote – bool, depending on how your df is loaded, if directly read from the disk, certain column will contain quotation, whether to remove it or not

Return df

the output df, with added gene symbol column

Example:

add_gene_symbol_frequency_table(df=df,remove_quote=False)

Add chromsome coordinate

snaf.snaf.add_coord_frequency_table(df, remove_quote=True)[source]

Convert the uid to chromsome coordinates

Parameters
  • df – the input df, index must be uid

  • remove_quote – bool, depending on how your df is loaded, if directly read from the disk, certain column will contain quotation, whether to remove it or not

Return df

the output df, with added coordinate column

Exmaple:

add_coord_frequency_table(df=df,remove_quote=True)

Add tumor specificity scores

snaf.gtex.add_tumor_specificity_frequency_table(df, method='mean', remove_quote=True, cores=None)[source]

add tumor specificty to each neoantigen-uid in the frequency table produced by SNAF T pipeline

Parameters
  • df – DataFrame, the frequency table produced by SNAF T pipeline

  • method – string, either ‘mean’, or ‘mle’, or ‘bayesian’

  • quote (remove) – boolean, whether to remove the quotation or not, as one column in frequency table df is list, when loaded in memory using pandas, it will be added a quote, we can remove it

  • cores – int, how many cpu cores to use for this computation, default None and use all the cpu the program detected

Return new_df

a dataframe with one added column containing tumor specificity score

Example:

snaf.add_tumor_specificity_frequency_table(df,'mle',remove_quote=True)

Add All three info at once

snaf.snaf.enhance_frequency_table(df, remove_quote=True, save=True, outdir='', name=None)[source]

This is a wrapper function to add (1) gene symbol, (2) chromosome coordinate, (3) tumor specificity to the frequency table at once

Parameters
  • df – the dataframe for the frequency table (freq_stage{n}_verbosity1_uid.txt)

  • remove_quote – boolean, depending on how your df is loaded, if directly read from the disk, certain column will contain quotation, whether to remove it or not

  • save – boolean, whether to save the result or not, default is True

  • outdir – string, the output folder.

  • name – string, the output file name

Example:

enhance_frequency_table(df=df,True,True,'result','freq_stage3_verbosity1_uid_gene_symbol_coord_mean_mle.txt')

Prepare DEG analysis

snaf.downstream.prepare_DEG_analysis(burden_path, patient_strat_path, rename_func, outdir='.', encoding={'high': '2', 'low': '1'})[source]

This function generate groups and comps file for performing AltAnalyze Differential Gene Analysis using empirical Bayes

Parameters
  • burden_path – string, the path to the burden file generated by T pipeline

  • patient_strat_path – string, the path to the patient stratification file returned by survival_analysis function

  • rename_func – function, it will be used to transform sample_id in burden_path (same as exp file) to sample_id in stat file (as it is adjusted for survival file)

  • outdir – string, the folder for generating comps and groups files

  • encoding – dict, the encoding to specificy how to transform identity string and what identities will be used for the DEG analysis

Examples:

snaf.downstream.prepare_DEG_analysis('result/burden_stage3.txt','result/survival/burden3_patient_high_low_group.txt',
                                     '/data/salomonis2/NCI-R01/TCGA-SKCM/TCGA_SKCM_hg38/ExpressionInput/exp.TCGA-SKCM.txt',
                                     outdir='result/survival',encoding={'low':'1','high':'2'})

Visuliaze DEG result (Volcano)

snaf.downstream.visualize_DEG_result(result_path, mode, outdir='.', genes_to_highlight=[], up_cutoff=0.8, down_cutoff=- 0.8, xlims=(- 2, 2), ylims=None)[source]

Generate volcano plot for the AltAnalyze DEG results

Parameters
  • result_path – string, the path to the AltAnalyze DEG results

  • mode – string, either interactive or static

  • outdir – string, the folder where the plots will go into

  • genes_to_highlight – list, the gene names to highlight in the plot, only for static mode

  • up_cutoff – float, the up-regulated cutoff for logfoldchange, only for static mode

  • down_cutoff – float, the down-regulated cutoff for logfoldchagne, only for static mode

  • xlims – None or tuple, force the xlims of the volcano plot, only for static mode

  • ylims – None or tuple, force the ylims of the volcano plot, only for static mode

Examples:

snaf.downstream.visualize_DEG_result('/data/salomonis2/LabFiles/Frank-Li/neoantigen/revision/TCGA_melanoma/result/survival/DEGs-LogFold_0.58_adjp/GE.low_vs_high.txt',
                                     mode='static',outdir='.',genes_to_highlight=['LST1','HCST','IL32','CD3D','S100A8','MZB1','IGLC4','ADAM10','ARFGEF2','MIB1','KIF3B','TNPO1','PTPN11','ANKRD52','TGFBR1'])

Prepare GO analysis

snaf.downstream.prepare_GO_analysis(result_path, lc_cutoff=0.5, direction='>', adjp_cutoff=0.05, n=None, sortby='adjp', ascending=True, outdir='.')[source]

Generate gene list for GO analysis

Parameters
  • result_path – string, the path to the AltAnalyze generated DEG result

  • lc_cutoff – float, the cutoff above which will be considered desirable genes

  • direction – string, either “>” or “<”, indicating whether to use greater than or less than of lc_cutoff

  • adjp_cutoff – float, the cutoff below which will be considered desirable genes

  • n – None or int, the number of genes to contrain for GO analysis

  • sortby – string, ‘adjp’ or ‘LogFold’

  • ascending – bool, sort in ascending order or not

  • outdir – string, the directory to write outputs

Examples:

snaf.downstream.prepare_GO_analysis('result/survival/DEGs-LogFold_0.58_adjp/GE.low_vs_high.txt',outdir='result/survival')

Visualize GO result

snaf.downstream.visualize_GO_result(path_list, skiprows_list, category_list, mode='interactive', outdir='', ontology_to_highlight={}, xlims=None, ylims=None)[source]

Visualize the GO analysis rssults

Parameters
  • path_list – list, each element points to the GO-Elite result txt file under ORA folder

  • skiprows_list – list, each element represents the number of lines needing to be skipped for each result file

  • category_list – list, each element points to the column name in each result file that represent GO terms

  • mode – string, interactive or static

  • outdir – string, the output directory

  • ontology_to_highlight – dict, key is the GO terms in the result, values are the corresponding text you want them to be displayed on the plot

  • xlims – None or tuple.

  • ylims – None or tuple.

Examples:

snaf.downstream.visualize_GO_result(path_list=['result/survival/GO_Elite_result_GeneOntology/GO-Elite_results/CompleteResults/ORA/archived-20221101-142116/gene_list-GO.txt','result/survival/GO_Elite_result_BioMarkers/GO-Elite_results/CompleteResults/ORA/archived-20221101-142037/gene_list-BioMarkers.txt'],
                            skiprows_list=[17,16],category_list=['Ontology Name','Gene-Set Name'],outdir='result/survival',
                            mode='static',ontology_to_highlight={'Adult Peripheral Blood Activated T cell (PMID32214235 top 100)':'T cells','antigen binding':'antigen binding','complement activation':'Complement Activation','immune response':'immune response','humoral immune response':'humoral immune response'},ylims=(10e-50,10e-1))

Prepare Sashimi plot

snaf.downstream.prepare_sashimi_plot(bam_path_list, bai_path_list, outdir, sif_anno_path, bam_contig_rename, query_region, skip_copy=False, min_junction=3, width=10, ann_height=4, height=2, fix_y_scale=True, task_name='')[source]

Given a bunch of bam file and cognate bai files, we want to generate sashimi plot in an automatic way using ggsashimi package singularity image, this image is pulled using:

singularity build ggsashimi.sif docker://gencode.v36.annotation.gtf

Now specifiy a folder (outdir), we are going to copy bam and bai to a subdirectory called ggsashimi_bams, automatically generate input_bams.tsv and palette.txt file, and then return a string to run singularity images:

singularity run -W $PWD -B $PWD:$PWD ggsashimi.sif -b input_bam.tsv -c chr10:27040584-27048100 -C 3 -M 5 --width 10 --fix-y-scale -g gencode.v36.annotation.gtf --ann-height=4 --height=2 -P palette.txt
Parameters
  • bam_path_list – list, each element represents the path to the bam file

  • bai_path_list – list, each element represents the path to the bai file

  • outdir – string, the output directory where we are going to execute singularity run and get the plot, we prefer non-existing path or empty existing folder

  • sif_anno_path – string, the folder where ggsashimi.sif and gencode.v36.annotation.gtf reside

  • bam_contig_rename – list of boolean, whether to convert the respective bam contig from 10 to chr10, require the samtools are loaded

  • query_region – string, format like chr10:27040584-27048100

  • skip_copy – boolean, whether to skip copying step

  • min_junction – int, the minimum number of junction to show sashimi line

  • width – float, the width in pct for the plot

  • ann_height – float, the height in pct for the annotation

  • height – float, the height in pct for the sashimi plot

  • fix_y_scale – bool, whether to fix y scale, default to True

  • task_name – string, default is empty string, add this as a suffix to generated sashimi.pdf

Return cmd

string, the singularity cmd that you should type in the console when cd to the outdir

Examples:

root_dir = '/data/salomonis-archive/BAMs/NCI-R01/TCGA/TCGA-SKCM/TCGA_SKCM-BAMs/bams'
bam_file_path = [root_dir+'/'+item+'.bam' for item in ['TCGA-3N-A9WB-06A-11R-A38C-07','TCGA-3N-A9WC-06A-11R-A38C-07','TCGA-3N-A9WD-06A-11R-A38C-07']]
bai_file_path = [root_dir+'/'+item+'.bam.bai' for item in ['TCGA-3N-A9WB-06A-11R-A38C-07','TCGA-3N-A9WC-06A-11R-A38C-07','TCGA-3N-A9WD-06A-11R-A38C-07']]
sif_anno_path = '/data/salomonis2/software/ggsashimi'
cmd = snaf.downstream.prepare_sashimi_plot(bam_file_path,bai_file_path,'test',sif_anno_path,bam_contig_rename=[True,False,True],query_region='chr3:49099853-49099994')
print(cmd)
# singularity run -W $PWD -B $PWD:$PWD /data/salomonis2/software/ggsashimi/ggsashimi.sif -b input_bams.tsv -c chr3:49099853-49099994 -C 3 -M 3 --width 10 --fix-y-scale -g gencode.v36.annotation.gtf --ann-height=4 --height=2 -P palette.txt

Plot Splicing Event Type

snaf.downstream.plot_event_type(pea, uids, rel, outdir)[source]

plot the event type based on provided uid list

Parameters
  • pea – string, the path to the EventAnnotation file

  • uids – dict, key is the name of each category, value is a list of non-overlapping uids

  • rel – bool, whether to compute relative or absolute value

  • outdir – string, the output directory

Examples:

snaf.downstream.plot_event_type(pea='Hs_RNASeq_top_alt_junctions-PSI_EventAnnotation.txt',uids={'common':common_uid,'unique':unique_uid},rel=True,outdir='result_new')