API¶
API reference manual
MHC bound peptides (T antigen)¶
initialize¶
- snaf.__init__.initialize(df, db_dir, gtex_mode='count', software_path=None, binding_method=None, t_min=20, n_max=3, normal_cutoff=5, tumor_cutoff=20, normal_prevalance_cutoff=0.01, tumor_prevalance_cutoff=0.1, add_control=None)[source]¶
Setting up global variable for running the program
- Parameters
df – dataframe, the currently tested cancer cohort dataframe, this will limit number of junctions being loaded into memory for control database
db_dir – string, the path to the database
gtex_mode – string, either ‘psi’ or ‘count’
software_path – string or None, either the path to the netMHCpan4.1 executable or None (using MHCflurry)
binding_method – string or None, either ‘netMHCpan’ or ‘MHCflurry’
t_min – int, the minimum number of read count the tumor sample should be larget than average number in normal database
n_max – int, the maximum number of average read count normal database count have
normal_cutoff – int, below which read count we consider a junction is not expressed in normal tissue
tumor_cutoff – int, above which read count we consider a junction is expressed in tumor tissue
normal_prevalance_cutoff – float, if below this fraction, we consider a junction is not present in normal tissue at an appreciable amount
tumor_prevalance_cutoff – float, if above this fraction, we consider a junction is present in tumor tissue at an appreciable amount
add_control – None or a dictionary containing additional controls, additional controls can a dataframe or anndata, for instance, if adding two controls asides from internal GTEx, {‘tcga_matched_control’:df,’gtex_skin’:adata}, when added database contains multiple tissue types, it is recommended to pass as a AnnData with the tissue information stored as adata.var[‘tissue’], otherwise, if passed as a dataframe, or samples will be recognized as a single tissue type, it will affect tumor specifcity score calculation if using MLE or bayesian hierarchical models.
Example:
# database directory (where you extract the reference tarball file) db_dir = '/user/ligk2e/download' # instantiate (if using netMHCpan) netMHCpan_path = '/user/ligk2e/netMHCpan-4.1/netMHCpan' snaf.initialize(db_dir=db_dir,gtex_mode='count',binding_method='netMHCpan',software_path=netMHCpan_path) # instantiate (if not using netMHCpan) snaf.initialize(db_dir=db_dir,gtex_mode='count',binding_method='MHCflurry',software_path=None)
JunctionCountMatrixQuery class¶
- class snaf.snaf.JunctionCountMatrixQuery(junction_count_matrix, cores=None, add_control=None, not_in_db=False, outdir='.', filter_mode='maxmin')[source]¶
Instantiate the JunctionCountMatrixQuery class
- Parameters
junction_count_matrix – The pandas dataframe for the junction_count_matrix, outputted by AltAnalyze
cores – int, how many cores you’d like to use, if None, then will automatically detect the maximum amount of cores in the OS
add_control – None or a dictionary containing additional controls, additional controls can a dataframe or anndata, for instance, if adding two controls asides from internal GTEx, {‘tcga_matched_control’:df,’gtex_skin’:adata}, when added database contains multiple tissue types, it is recommended to pass as a AnnData with the tissue information stored as adata.var[‘tissue’], otherwise, if passed as a dataframe, or samples will be recognized as a single tissue type, it will affect tumor specifcity score calculation if using MLE or bayesian hierarchical models.
not_in_db – boolean, whether to remove junctions that are present in any Ensembl documented transcript, remember some of the documented transcript in Ensembl are tumor-specific as well, doing so may remove some bona fide hits. But good for reducing number of neoantigens for validation.
outdir – string, the output folder for storing results
filter_mode – string, either ‘maxmin’ or ‘prevalance’, this correspond to the parameters that will be used in snaf.initialize, if maxmin, use t_min and n_max, if prevalance, use normal and tumor cutoff
Example:
jcmq = JunctionCountMatrixQuery(junction_count_matrix=df,cores=50,add_control=control_df,not_in_db=True)
- static generate_results(path, outdir, criterion=None)[source]¶
wrapper function to automatically generate all necessary output
- Parameters
path – string, where the pickle result file lie on the disk
outdir – string, where to generate all the necessary result
Example:
JunctionCountMatrixQuery.generate_results(path='result/after_prediction.p',outdir='result')
- static get_membrane_tuples(df, **kwargs)[source]¶
this function is used by SurfaceAntigen pipeline to filter out splicing evnets that are not tumor-specific and also compute useful informations for each membrane protein.
- Parameters
df – pandas dataframe, the junction count matrix
**kwargs –
will be passed to __init__ of JunctionCountMatrixQuery class function
- Return membrane_tuples
a list, in which each item is a tuple (uid,mean_gtex,df_gtex,ed,freq).
uid: the uid of the splicing evnet
mean_gtex: the mean raw read count across GTEx
df_gtex: a dataframe of gtex sample name, tissue types and read count values
ed: expression dictionary for tumor samples
freq: expression frequency for tumor samples
Example:
membrane_tuples = snaf.JunctionCountMatrixQuery.get_membrane_tuples(df)
- run(hlas, strict=False, outdir='.', name='after_prediction.p')[source]¶
main function to run mhc bound peptide T antigen pipeline
- Parameters
hlas – list, each item is a list of 6 HLA types associated with each sample, the order of this list must be consistent with the sample name in the df column
strict – boolean, default is False, determine whether to filter out peptide which doesn’t have canonical start codon support
outdir – string, the path where all the results will go into
name – string, the name of the generated pickle result object
Example:
jcmq.run(hlas=hlas,outdir='result',name='after_prediction.p')
- show_neoantigen_as_fasta(outdir, name, stage, verbosity, contain_uid, sample=None, criterion=None)[source]¶
write the neoantigen as a fasta file for MS validation or other purpose
- Parameters
outdir – string ,the output directory for the generated fasta file
name – string, the name of the output fasta file
stage – int, either 0,1,2,3, the stage of neoantigen you want to report
verbosity – int, either 1,2,3, the verbosity of candidates
contain_uid – boolean, whether you want to contain the junction UID along with the neoantigen
sample – string, the name of the sample in which you want to extract the neoantigen
criterion – nested list, the criterion for filtering out the candidate
Examples:
jcmq = snaf.JunctionCountMatrixQuery.deserialize('result/after_prediction.p') sample = 'SRR5933735.Aligned.sortedByCoord.out' jcmq.show_neoantigen_as_fasta(outdir='./fasta',name='neoantigen_{}.fasta'.format(sample),stage=2,verbosity=1,contain_uid=True,sample=sample)
- visualize(uid, sample, outdir, tumor=False, criterion=[('netMHCpan_el', 0, '<=', 2), ('deepimmuno_immunogenicity', 1, '==', 'True')])[source]¶
Visualize certain Neojunction in certain sample
- Parameters
uid – string, the uid for the event you want to check
sample – string, the sample name
outdir – string, where to deposite the figure
tumor – bool, whether to show the expression level in tumor sample as well, default is not
criterion – a nested tuple, if none of the neoantigens are bound and immunogenic, no candidates will be plotted, so you can relax the criterion a bit based on the format specified.
Example:
jcmq.visualize(uid='ENSG00000167291:E38.6-E39.1',sample='TCGA-DA-A1I1-06A-12R-A18U-07.bed',outdir='./result')
mutation_analysis¶
- snaf.downstream.mutation_analysis(mode, burden, mutation, output, n_sample_cutoff=10, gene_column='gene', genes_to_plot=None)[source]¶
Run mutation association analysis with neoantigen burden
- Parameters
mode – string, either ‘compute’ or ‘plot’, compute will compute all the association, plot will visualize certain assciation in box plot
burden – pandas dataframe, the burden result file generated by T pipeline, and sample ID should be consistent with the sample ID in mutation dataframe index
mutation – pandas dataframe, the mutation data, make sure sample ID column is the index
output – string, the path in which the result will be generated, if compute mode, a txt file will be generated, if plot mode, a plot will be generated
n_sample_cutoff – int, default is 10, if a mutation occur in < n_sample_cutoff samples, it won’t be considered
gene_column – string, default is gene, in the mutation df, which column represents the gene
gene_to_plot – list, each item is a gene name, for the plot mode
Example:
# compute mode snaf.mutation_analysis(mode='compute',burden=burden,mutation=mutation,output='result/stage3_mutation.txt') # plot mode snaf.mutation_analysis(mode='plot',burden=burden,mutation=mutation,output='result/stage3_mutation_CAMKK2.pdf',genes_to_plot=['CAMKK2'])
survival_analysis¶
- snaf.downstream.survival_analysis(burden, survival, n, stratification_plot, survival_plot, survival_duration='OS.time', survival_event='OS')[source]¶
Run survival analysis based on neoantigen burden, the burden`and `survival`must be pandas dataframe, and make sure the sample ID are consistent with each other, we only consider samples in `burden that have clinical survival data.
- Parameters
burden – pandas dataframe, the burden result file generated by T pipeline
survival – pandas dataframe, the survival data
n – int, how many groups to stratigy, support 2 (median), 3 (33%,66%), 4 (quantile)
stratification_plot – string, the path and name for the stratification plot
survival_plot – string, the path plus name for the survival plot
survival_duration – string, which column in survival data represent the duration
survival_event – string, which column in survival data represent the event
Example:
snaf.survival_analysis(burden,survival,n=2,stratification_plot='result/stage2_stratify.pdf',survival_plot='result/stage2_survival.pdf')
survival_regression_cox_model¶
- snaf.downstream.survival_regression(freq, remove_quote, rename_func, survival, pea, outdir='.', cores=None, mode='binary', survival_duration='OS.time', survival_event='OS', n_effective_obs=3)[source]¶
conduct cox regression to identify neoantigens whose parental junctions expression is significantly associated with survival
- Parameters
freq – string, path to the T antigen generated frequency dataframe
remove_quote – bool, whether to remove the quote of samples column for freq df or not
rename_func – function, a function to rename columns in freq and event annotation file to be consistent with survival file
survival – string, the path to the survival dataframe
pea – string, the path to the altanalyze generated event annotation file
outdir – string, the output directory
cores – None or int, how many cores to use for computation
mode – binary or psi, if binary, we test whether having a neoantigen can predict survival, psi means whether its parental junction’s psi value is associated with survival
survival_duration – string, the column name for duration in survival dataframe
survival_event – string, the column name for event in survival dataframe
n_effective_obs – int, default is 3, at least three obs whose event=1, otherwise all obs are right-sensored
Examples:
snaf.downstream.survival_regression(freq='result/frequency_stage3_verbosity1_uid_gene_symbol_coord_mean_mle.txt',remove_quote=True, rename_func=lambda x:'-'.join(x.split('-')[:4]),survival='TCGA-SKCM.survival.tsv', pea='Hs_RNASeq_top_alt_junctions-PSI_EventAnnotation.txt',outdir='result/survival')
analyze_neoantigens¶
- snaf.downstream.analyze_neoantigens(freq_path, junction_path, total_samples, outdir, fasta=False, mers=None, columns=None, cutoffs=(0.1, 0.9), junction_bl=0.1)[source]¶
Users need to execute this function to prepare input for the t antigen dash viewer
- Parameters
freq_path – string, the path to the frequency file in the result folder, with uid appended
junction_path – string, the path to the burden0 file in the result folder
total_samples – int, the total number of samples, will be used for determine shared versus unique neoantigen
outdir – string, where the output will go into
fasta – bool, whether to output the fasta file as well to check the motif
mers – None or list, if [9,10], it will generate mer9 and mer10 for fasta and separate dataframe.
columns – None or string, which column to include in the output df
cutoffs – tuple, the cutoffs to determine shared versus unique neoantigen
junction_bl – float, the cutff for parent junction, if junction occur < junction_bl sample, we don’t consider neoantigens from those junction
Example:
# just output for viewer snaf.downstream.analyze_neoantigens(freq_path='result/frequency_stage2_verbosity1_uid.txt',junction_path='result/burden_stage0.txt',total_samples=472,outdir='result',mers=None,fasta=False) # to run MEME for motif after that snaf.downstream.analyze_neoantigens(freq_path='result/frequency_stage2_verbosity1_uid.txt',junction_path='result/burden_stage0.txt',total_samples=472,outdir='result',mers=[9,10],fasta=True)
run_dash_T_antigen¶
- snaf.dash_app.app.run_dash_T_antigen(input_abs_path, remove_cols=['uid'], host=None, port='8050', output_abs_path=None)[source]¶
run the dash T antigen viewer
- Parameters
input_abs_path – string, the absolute path to the input df
remove_cols – list, the column name to remove from the input df
host – string or None, if None, program will run hostname to automatically detect
port – string, default is 8050
output_abs_path – string or None, if you want to have the umap embedding, specify the output path and name
Example:
snaf.run_dash_T_antigen(input_abs_path='/data/salomonis2/LabFiles/Frank-Li/neoantigen/TCGA/SKCM/snaf_analysis/result/shared_vs_unique_neoantigen_all.txt')
report_candidates¶
- snaf.downstream.report_candidates(jcmq, df, sample, outdir, remove_quote=True, metrics={'deepimmuno_immunogenicity': 'immunogenicity', 'netMHCpan_el': 'binding_affinity'}, criterion=[('netMHCpan_el', 0, '<=', 2), ('deepimmuno_immunogenicity', 1, '==', 'True')])[source]¶
this function will report a txt file for all the T antigen candidate for a specific tumor sample.
- Parameters
jcmq – the JunctionCountMatrixQuery object, usually you need to deserialize after_prediction.p file produced by T antigen pipeline
df – DataFrame,the frequency table generated by T antigen pipeline
sample – String, the sample name to report T antigen candiate
outdir – string, the path to the output directory
remove_quote – boolean, whether to remove the quotation or not, as one column in frequency table df is list, when loaded in memory using pandas, it will be added a quote, we can remove it
metrics – dict, the key is the metric we registered to each peptide-hla, values is the corresponding name we want to output in the report for each metric
criterion – list with each element as a tuple, this is a very specialized expression to specificy the criterion we are going to use to only report peptide-hla that meet the criterion. each tuple has four element, the first element is the metric name we registered, the second is either 0 or 1, 0 means looking for the continuous value score, 1 means looking for the string identity value. See below note for further explanation. the third is an operator, and the fourth is the cutoff.
Note
By default, we run netMHCpan or MHCflurry for MHC binding prediction, and DeepImmuno for immunogenicity prediction. When we add the predicted value for each peptide-hla pair, we format the prediction result into a dataframe compliant with something like below:
peptide mer hla score identity
AAAAAAAAA 9 HLA-A*01:01 0.3 SB
here is the dataframe for MHC binding prediction, we have a column recording continous value (score), and another column (identity) recording whether it is a SB (Strong Binder) or WB (Weaker Binder). Those are all useful information coming out of the predictor. Now we just need to register this df to the NeoJunction class by running
nj.enhanced_peptide.register_attr(df,'netMHCpan_el). We can keep adding scores to the NeoJunction class. This explain what the 0/1 in criterion mean.Example:
snaf.report_candidates(jcmq,df,'SRR067783.bed','result',True) # a txt file will be written to the outdir you specified
get_coverage¶
- snaf.downstream.get_coverage(t_result, allele, outdir='.')[source]¶
Get the population coverage of certain neoantigen, based on the different HLA allele frequency from US bone marrow registry. The 21 race code can be found here (https://www.sciencedirect.com/science/article/pii/S0198885913001821?via%3Dihub#t0005)
- Parameters
t_result – string, the path to the T_antigen_candidates_all.txt file
allele – string, either A, B or C
outdir – string, default is current directory
Examples:
snaf.downstream.get_coverage(t_result='result_new/T_candidates/T_antigen_candidates_all.txt',allele='A')
annotation txt file¶ peptide
hlas
uid
symbol
coord
tumor_specificity_mean
tumor_specificity_mle
AAFA
AFB
AINDI
AMIND
CARB
CARHIS
EURCAU
FILII
JAPI
KORI
MENAFC
MSWHIS
NCHI
SCAHIS
SCSEAI
VIET
AAAAAGHFW
B*57:01,B*46:01,B*35:01
ENSG00000106571:E2.2-E4.1
GLI3
chr7:42223295-42236971(-)
1.3304086
0.423474388
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
AAAAAGSSSW
B*57:01,A*25:01,A*32:01
ENSG00000099960:E2.2-E3.2
SLC7A4
chr22:21031852-21032531(-)
0.9078695
0.37040754
0.01847
0.01771
0.03875
0.04018
0.01601
0.04102
0.05655
0.00405
0.0024
0.00622
0.05907
0.03597
0.00636
0.03612
0.03019
0.00191
AAAAAPPLQR
A*03:01
ENSG00000066697:E20.3-E27.1_100442077
MSANTD3
chr9:100427393-100442077(+)
0.06224294
0.089645936
0.08395
0.07834
0.06358
0.10437
0.07825
0.0863
0.13987
0.01309
0.00897
0.01905
0.10098
0.08147
0.01401
0.07384
0.04987
0.00973
AAAAAPVIGK
A*03:01
ENSG00000105613:E2.1-E4.1
MAST1
chr19:12838300-12840446(+)
0.5440088
0.625632982
0.08395
0.07834
0.06358
0.10437
0.07825
0.0863
0.13987
0.01309
0.00897
0.01905
0.10098
0.08147
0.01401
0.07384
0.04987
0.00973
AAAAFRFPR
A*30:01,A*31:01,A*11:02,A*66:01,A*11:01,A*33:01,A*68:01
ENSG00000270816:E1.3-E2.1
LINC00221
chr14:106482637-106488768(+)
0.45516863
0.999993827
0.16699
0.15773
0.25935
0.18128
0.16724
0.17499
0.14498
0.2024
0.17736
0.19134
0.166
0.19398
0.3579
0.18561
0.2764
0.30457
AAAAGHFWSK
A*03:01
ENSG00000106571:E2.2-E4.1
GLI3
chr7:42223295-42236971(-)
1.3304086
0.423474388
0.08395
0.07834
0.06358
0.10437
0.07825
0.0863
0.13987
0.01309
0.00897
0.01905
0.10098
0.08147
0.01401
0.07384
0.04987
0.00973
AAAAGSSSW
B*35:01,C*03:03,B*13:01,B*44:03,C*02:02,A*26:01,C*16:01,C*03:04,A*25:01,B*57:01,A*32:01,B*15:01
ENSG00000099960:E2.2-E3.2
SLC7A4
chr22:21031852-21032531(-)
0.9078695
0.37040754
0.03303
0.03014
0.08059
0.06006
0.02826
0.07102
0.08743
0.01928
0.0822
0.04373
0.10325
0.06192
0.03005
0.06562
0.0689
0.02475
AAAAGSSSWL
C*03:04,C*03:03,B*07:02,B*07:05
ENSG00000099960:E2.2-E3.2
SLC7A4
chr22:21031852-21032531(-)
0.9078695
0.37040754
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
AAAALAGIL
C*03:03,C*12:02
ENSG00000065308:E9.1-E10.1_52504687
TRAM2
chr6:52504687-52505599(-)
0.25774613
0.155925177
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Interface to proteomics¶
chop_normal_pep_db¶
- snaf.proteomics.chop_normal_pep_db(fasta_path, output_path, mers, allow_duplicates)[source]¶
chop any normal human proteome to certain mers
- Parameters
fasta_path – string, the path to the human protein fasta file
output_path – string, the path to the output fasta file
mers – list, like [9,10] will generate 9mer and 10mer
allow_duplicates – boolean. whether allow duplicate or not
Example:
chop_normal_pep_db(fasta_path='human_uniprot_proteome.fasta',output_path='./human_uniprot_proteome_mer9_10.fasta',mers=[9,10],allow_duplicates=False)
compare_two_fasta¶
- snaf.proteomics.compare_two_fasta(fa1_path, fa2_path, outdir='.', write_comm=False, write_unique1=False, write_unique2=False, prefix='')[source]¶
Compare arbitracy two fasta files, report unique and common ones
- Parameters
fa1_path – the first fasta file path that you want to compare with
fa2_path – the second fasta file path that you want to compare with
write_comm – boolean, whether write the common one between fa1 and fa2
write_unique1 – boolean, whether write the one unique to fa1
write_unique2 – boolean, whether write the one unique to fa2
prefix – string, whether to add a prefix to the output fasta file name
Example:
snaf.proteomics.compare_two_fasta(fa1_path='./fasta/human_proteome_uniprot_9_10_mers_unique.fasta', fa2_path='./fasta/neoantigen_{}_unique.fasta'.format(sample),outdir='./fasta', write_unique2=True,prefix='{}_'.format(sample))
remove_redundant¶
- snaf.proteomics.remove_redundant(fasta_path, out_path)[source]¶
remove redundant entries in any fasta files
- Parameters
fasta_path – string, the path to the input fasta file
out_path – string, the path to the output folder
Example:
snaf.proteomics.remove_redundant('./fasta/neoantigen_{}.fasta'.format(sample),'./fasta/neoantigen_{}_unique.fasta'.format(sample))
set_maxquant_configuration¶
- snaf.proteomics.set_maxquant_configuration(dbs, n_threads, inputs, enzymes, enzyme_mode, outdir, outname='mqpar.xml', protein_fdr=0.01, peptide_fdr=0.01, site_fdr=0.01, include_contaminants=True, var_mods=['Oxidation (M)', 'Acetyl (Protein N-term)', 'Carbamidomethyl (C)'], fix_mods=None, minPepLen=8, maxPeptideMass=4600, minPeptideLengthForUnspecificSearch=8, maxPeptideLengthForUnspecificSearch=25)[source]¶
automatically build MaxQuant configuration file mqpar.xml, the template mqpar.xml is generated using maxquant 2.0.3.1 on a windows PC GUI, we load a single raw file, a single fasta file, reduce carbamidomethyl from fixed modification to variable modification and allow match run, otherwise, we retain all default parameters.
- Parameters
dbs – list, each element is the path to the database fasta file
n_threads – int, how many threads will be used for MaxQuant search
inputs – list, each element is the path to the raw file
enzymes – list, what enzymes will be used for search
enzyme_mode – int, 0 -> specific, 3 -> semi-specific, 4 -> unspecific, 5 -> no digestion
outdir – string, the output directory for the conf file
outname – string, default is mqpar.xml
protein_fdr – float, default is 0.01
peptide_fdr – float, default is 0.01
site_fdr – float, default is 0.01
include_contaminants – boolean, default is True
var_mods – list or None, default is [‘Oxidation (M)’, ‘Acetyl (Protein N-term)’, ‘Carbamidomethyl (C)’]
fix_mods – list or None, default is []
minPepLen – int, default is 8
maxPeptideMass – int, default is 4600
minPeptideLengthForUnspecificSearch – int, default is 8
maxPeptideLengthForUnspecificSearch – int, default is 25
Example:
dbs = ['/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/RNA/snaf_analysis/fasta/SRR5933726.Aligned.sortedByCoord.out.bed_unique2.fasta'] inputs = ['/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/MS/OvCa48/OvCa48_classI_Rep#1.raw', '/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/MS/OvCa48/OvCa48_classI_Rep#2.raw', '/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/MS/OvCa48/OvCa48_classI_Rep#3.raw', '/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/MS/OvCa48/OvCa48_classI_Rep#4.raw', '/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/MS/OvCa48/OvCa48_classI_Rep#5.raw', '/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/MS/OvCa48/OvCa48_classI_Rep#6.raw'] outdir = '/data/salomonis2/LabFiles/Frank-Li/neoantigen/MS/schuster/MS/OvCa48' snaf.proteomics.set_maxquant_configuration(dbs=dbs,n_threads=20,inputs=inputs,enzymes=None,enzyme_mode=5,protein_fdr=1,peptide_fdr=0.05,site_fdr=1,outdir=outdir)
summarize_ms_result¶
- snaf.proteomics.summarize_ms_result(peptide, msms, freq, outdir)[source]¶
After finishing running Maxquant, this function can help figuring out what peptide for further validation
- Parameters
peptide – string, the path to MaxQuant peptide.txt file
msms – string, the path to the MaxQuant msms.txt file
freq – string, the path to the SNAF frequency_stage2_verbosity1_uid_gene_symbol_coord_mean_mle.txt file
outdir – string, the path to the output folder
- Return df_peptide
dataframe, the valid MS-evidenced neoantigens with associated properties
Examples:
df = snaf.proteomics.summarize_ms_result(peptide='MS/OvCa114/combined/txt/peptides.txt',msms='MS/OvCa114/combined/txt/msms.txt',freq='result/frequency_stage2_verbosity1_uid_gene_symbol_coord_mean_mle.txt')
Surface Antigen (B antigen)¶
initialize¶
run¶
- snaf.surface.main.run(uids, outdir, prediction_mode='short_read', n_stride=2, gtf=None, tmhmm=False, software_path=None, serialize=True)[source]¶
main function for run B antigen pipeline
- Parameters
uids – list, the membrane tuples
outdir – string, the path where all the output will go into
prediction_mode – string, either ‘short_read’ or ‘long_read’
n_stride – int, how many exons define a early stop codon, for NMD check
gtf – None or string, if prediction mode is long_read, supply the path to the gtf file
tmhmm – bool, use tmhmm or not
software_path – None or string, if tmhmm=True, specify the tmhmm software path
serialize – bool, serialize to the pickle file for the result
Examples:
# if using TMHMM surface.run(membrane_tuples,outdir='result',tmhmm=True,software_path='/data/salomonis2/LabFiles/Frank-Li/python3/TMHMM/tmhmm-2.0c/bin/tmhmm') # if not using TMHMM surface.run(membrane_tuples,outdir='result',tmhmm=False,software_path=None)
generate_full_results¶
- snaf.surface.main.generate_full_results(outdir, freq_path, mode, validation_gtf)[source]¶
preferred api for wraping both generate_results and report_candiates, good for general usage
- Parameters
outdir – string, the output folder
freq_path – string, the path to the frequency_stage0_verbosity1_uid_gene_symbol_coord_mean_mle.txt file from T antigen pipeline
mode – string, either long_read or short_read
validation_gtf – string, the path to the validation long_read gtf when mode=short_read
Examples:
# long read mode surface.generate_full_results(outdir='result/surface',freq_path='result/frequency_stage0_verbosity1_uid_gene_symbol_coord_mean_mle.txt',mode='long_read') # short read mode surface.generate_full_results(outdir='result/surface',freq_path='result/frequency_stage0_verbosity1_uid_gene_symbol_coord_mean_mle.txt',mode='short_read',validation_gtf='/data/salomonis2/LabFiles/Frank-Li/neoantigen/TCGA/SKCM/snaf_analysis/SQANTI-all/collapse_isoforms_classification.filtered_lite.gtf')
generate_results¶
- snaf.surface.main.generate_results(pickle_path, strigency=3, outdir='.', gtf=None, long_read=False, style=None, overlap_extracellular=False)[source]¶
Generate candidates for B antigen
- Parameters
pickle_path – string, the path to the pickle result file
strigency –
int, from 1-5, how strigent you want to program to be for calling B antigen
strigency1: novel isoform needs to be absent in uniprot database
strigency2: novel isoform also needs to be a documented protein-coding gene
strigency3: novel isoform also needs to not be subjected to Nonsense Mediated Decay (NMD)
strigency4: novel isoform also needs to have long-read or EST support (as long as the novel junction present in full-length)
strigency5: novel isoform also needs to have long-read or EST support (whole ORF needs to be the same as full-length)
outdir – string, path to the output folder
gtf – string, if strigency>3, you need to specify the path to the long-read or EST gtf file
long_read – boolean, whether the last run was using prediction_mode as long_read or not, default to False
style – None or string, can either be ‘deletion’ or ‘insertion’ meaning only generate result with deleted or inserted region
overlap_extracellular – boolean, default is True, only generate result whose junction overlap with extracellular domain
Example:
# if having gtf file for long-read data surface.generate_results(pickle_path='./result/surface_antigen.p',outdir='result',strigency=5,gtf='./SQANTI-all/collapse_isoforms_classification.filtered_lite.gtf') # if not having surface.generate_results(pickle_path='./result/surface_antigen.p',outdir='result',strigency=3,gtf=None)
run_dash_B_antigen¶
- snaf.surface.main.run_dash_B_antigen(pkl, prediction_mode, candidates, python_executable, host=None, port='8050')[source]¶
run the dash B antigen viewer
- Parameters
pkl – string, the path to the surface B pipeline pickle file
prediction_mode – string, either short_read or long_read
candidates – string ,the path to the candidates.txt file
python_executable – string ,the path to the python executable you are using
host – None or string, string or None, if None, program will run hostname to automatically detect
port – string, default is 8050
Example:
surface.run_dash_B_antigen(pkl='result/surface_antigen.p',candidates='result/candidates_5.txt',prediction_mode='long_read', python_executable='/data/salomonis2/LabFiles/Frank-Li/refactor/neo_env/bin/python3.7')
report_candidates¶
- snaf.surface.main.report_candidates(pickle_path, candidates_path, validation_path, freq_df_path, mode, outdir='.', name=None)[source]¶
report SNAF-B antigen candidates, a more organized and tidy file format for all the candidate.
- Parameters
pickle_path – string ,the path to the saved pickle object from run command.
candidates_path – string, the path to the saved candidates.txt file which we’d like to reformat
validation_path – string, the path to the saved validation.txt file which accoompanying the candidates.txt file
freq_df_path – string, the path to the frequency table from SNAF-T pipeline, this will provide the frequency of each neojunction in whole cohort.
mode – string, either ‘long_read’ or ‘short_read’,default is short_read
outdir – string, the folder for the output directory, will create one if not exist
name – string, the name of the saved output file.
Example:
surface.report_candidates('result/surface_antigen.p','result/candidates_3.txt','result/validation_3.txt', 'result/frequency_stage0_verbosity1_uid_gene_symbol_coord_mean_mle.txt','short_read', 'result','sr_str3_report.txt')
GTEx Viewer (tumor specificity)¶
gtex_visual_combine¶
- snaf.gtex_viewer.gtex_visual_combine(uid, norm=False, outdir='.', figsize=(6.4, 4.8), tumor=None, ylim=None, group_by_tissue=True)[source]¶
Visualize the gtex expression and tumor specificity for splicing event (combine into one plot)
- Parameters
uid – string, the uid for the splicing event
norm – bool. whether normalize for the sequencing depth or not
outdir – string, where the figure go into
figsize – tuple, the (width,height) of the figure
tumor – pandas dataframe, the tumor df to compare with
ylim – tuple, modify the ylim of the (bottom, top) of the figure
group_by_tissue – bool, whether to group the normal smaples by tissue or not, default is True
Example:
snaf.gtex_visual_combine(uid='ENSG00000090339:E4.3-E4.5',norm=True,tumor=df) snaf.gtex_visual_combine(uid='ENSG00000112149:E7.1-E9.1',norm=True,tumor=df)
gtex_visual_combine_plotly¶
- snaf.gtex_viewer.gtex_visual_combine_plotly(uid, outdir='', norm=False, tumor=None)[source]¶
Generate combined plot but interactively
- Parameters
uid – string, the uid for the splicing event
norm – bool. whether normalize for the sequencing depth or not
outdir – string, where the figure go into
tumor – pandas dataframe, the tumor df to compare with
Example:
snaf.gtex_visual_combine(uid='ENSG00000090339:E4.3-E4.5',norm=True,tumor=df) snaf.gtex_visual_combine(uid='ENSG00000112149:E7.1-E9.1',norm=True,tumor=df)
gtex_visual_subplots¶
- snaf.gtex_viewer.gtex_visual_subplots(uid, norm=True, top=100, outdir='.')[source]¶
Visualize the gtex expression and tumor specificity for splicing event (subplots)
- Parameters
uid – string, the uid for the splicing event
norm – bool. whether normalize for the sequencing depth or not
outdir – string, where the figure go into
Example:
snaf.gtex_visual_subplots(query='ENSG00000090339:E4.3-E4.5',norm=True) snaf.gtex_visual_subplots(query='ENSG00000112149:E7.1-E9.1',norm=True)
gtex_visual_norm_count_combined¶
- snaf.gtex_viewer.gtex_visual_norm_count_combined(uid, xlim=None, ylim=None, save_df=False, outdir='.')[source]¶
This function is to visualize the normalized count (CPM) distribution, it should follow a half-norm distribution, this plot will help determine the choice in bayesian modeling for tumor specificity score
- Parameters
uid – string, the uid for the splicing event
xlim – None or tuple, whether to constrain the xlim of the histplot
ylim – None or tuple, whether to constrain the ylim of the histplot
save_df – boolean, whether to save a txt file reporting all the cpm values and the tissue information for debug purpose
outdir – string, where the figure go into
Example:
snaf.gtex_visual_norm_count_combined(uid='ENSG00000090339:E4.3-E4.5')
gtex_visual_per_tissue_count¶
- snaf.gtex_viewer.gtex_visual_per_tissue_count(uid, total_count=10, count_cutoff=1, total=25, outdir='.')[source]¶
This function is to visualize the random variable X, which repesents the number of samples in each tissue type that express this junction. It should follow a zeroinflated poisson distribution. We do not consider tissues with less than total_count samples, and we use count_cutoff to represent how many count can be defined as “expressed” versus “non-expressed”. We scaled all the number of samples to the total of “total” to make sure they are comparable. Because if a tissue type has 1000 samples, another has 25 samples, this random variable is not directly comparable. this plot will help determine the choice in bayesian modeling for tumor specificity score
- Parameters
uid – string, the uid for the splicing event
outdir – string, where the figure go into
Example:
snaf.gtex_visual_per_tissue_count(uid='ENSG00000090339:E4.3-E4.5')
Miscellaneous¶
Add gene symbol¶
- snaf.downstream.add_gene_symbol_frequency_table(df, remove_quote=True)[source]¶
This function will convert the ENSG id to gene sysmbol and add a column in place for your dataframe
- Parameters
df – the input df, make sure the index is the uid
remove_quote – bool, depending on how your df is loaded, if directly read from the disk, certain column will contain quotation, whether to remove it or not
- Return df
the output df, with added gene symbol column
Example:
add_gene_symbol_frequency_table(df=df,remove_quote=False)
Add chromsome coordinate¶
- snaf.snaf.add_coord_frequency_table(df, remove_quote=True)[source]¶
Convert the uid to chromsome coordinates
- Parameters
df – the input df, index must be uid
remove_quote – bool, depending on how your df is loaded, if directly read from the disk, certain column will contain quotation, whether to remove it or not
- Return df
the output df, with added coordinate column
Exmaple:
add_coord_frequency_table(df=df,remove_quote=True)
Add tumor specificity scores¶
- snaf.gtex.add_tumor_specificity_frequency_table(df, method='mean', remove_quote=True, cores=None)[source]¶
add tumor specificty to each neoantigen-uid in the frequency table produced by SNAF T pipeline
- Parameters
df – DataFrame, the frequency table produced by SNAF T pipeline
method – string, either ‘mean’, or ‘mle’, or ‘bayesian’
quote (remove) – boolean, whether to remove the quotation or not, as one column in frequency table df is list, when loaded in memory using pandas, it will be added a quote, we can remove it
cores – int, how many cpu cores to use for this computation, default None and use all the cpu the program detected
- Return new_df
a dataframe with one added column containing tumor specificity score
Example:
snaf.add_tumor_specificity_frequency_table(df,'mle',remove_quote=True)
Add All three info at once¶
- snaf.snaf.enhance_frequency_table(df, remove_quote=True, save=True, outdir='', name=None)[source]¶
This is a wrapper function to add (1) gene symbol, (2) chromosome coordinate, (3) tumor specificity to the frequency table at once
- Parameters
df – the dataframe for the frequency table (freq_stage{n}_verbosity1_uid.txt)
remove_quote – boolean, depending on how your df is loaded, if directly read from the disk, certain column will contain quotation, whether to remove it or not
save – boolean, whether to save the result or not, default is True
outdir – string, the output folder.
name – string, the output file name
Example:
enhance_frequency_table(df=df,True,True,'result','freq_stage3_verbosity1_uid_gene_symbol_coord_mean_mle.txt')
Prepare DEG analysis¶
- snaf.downstream.prepare_DEG_analysis(burden_path, patient_strat_path, rename_func, outdir='.', encoding={'high': '2', 'low': '1'})[source]¶
This function generate groups and comps file for performing AltAnalyze Differential Gene Analysis using empirical Bayes
- Parameters
burden_path – string, the path to the burden file generated by T pipeline
patient_strat_path – string, the path to the patient stratification file returned by survival_analysis function
rename_func – function, it will be used to transform sample_id in burden_path (same as exp file) to sample_id in stat file (as it is adjusted for survival file)
outdir – string, the folder for generating comps and groups files
encoding – dict, the encoding to specificy how to transform identity string and what identities will be used for the DEG analysis
Examples:
snaf.downstream.prepare_DEG_analysis('result/burden_stage3.txt','result/survival/burden3_patient_high_low_group.txt', '/data/salomonis2/NCI-R01/TCGA-SKCM/TCGA_SKCM_hg38/ExpressionInput/exp.TCGA-SKCM.txt', outdir='result/survival',encoding={'low':'1','high':'2'})
Visuliaze DEG result (Volcano)¶
- snaf.downstream.visualize_DEG_result(result_path, mode, outdir='.', genes_to_highlight=[], up_cutoff=0.8, down_cutoff=- 0.8, xlims=(- 2, 2), ylims=None)[source]¶
Generate volcano plot for the AltAnalyze DEG results
- Parameters
result_path – string, the path to the AltAnalyze DEG results
mode – string, either interactive or static
outdir – string, the folder where the plots will go into
genes_to_highlight – list, the gene names to highlight in the plot, only for static mode
up_cutoff – float, the up-regulated cutoff for logfoldchange, only for static mode
down_cutoff – float, the down-regulated cutoff for logfoldchagne, only for static mode
xlims – None or tuple, force the xlims of the volcano plot, only for static mode
ylims – None or tuple, force the ylims of the volcano plot, only for static mode
Examples:
snaf.downstream.visualize_DEG_result('/data/salomonis2/LabFiles/Frank-Li/neoantigen/revision/TCGA_melanoma/result/survival/DEGs-LogFold_0.58_adjp/GE.low_vs_high.txt', mode='static',outdir='.',genes_to_highlight=['LST1','HCST','IL32','CD3D','S100A8','MZB1','IGLC4','ADAM10','ARFGEF2','MIB1','KIF3B','TNPO1','PTPN11','ANKRD52','TGFBR1'])
Prepare GO analysis¶
- snaf.downstream.prepare_GO_analysis(result_path, lc_cutoff=0.5, direction='>', adjp_cutoff=0.05, n=None, sortby='adjp', ascending=True, outdir='.')[source]¶
Generate gene list for GO analysis
- Parameters
result_path – string, the path to the AltAnalyze generated DEG result
lc_cutoff – float, the cutoff above which will be considered desirable genes
direction – string, either “>” or “<”, indicating whether to use greater than or less than of lc_cutoff
adjp_cutoff – float, the cutoff below which will be considered desirable genes
n – None or int, the number of genes to contrain for GO analysis
sortby – string, ‘adjp’ or ‘LogFold’
ascending – bool, sort in ascending order or not
outdir – string, the directory to write outputs
Examples:
snaf.downstream.prepare_GO_analysis('result/survival/DEGs-LogFold_0.58_adjp/GE.low_vs_high.txt',outdir='result/survival')
Visualize GO result¶
- snaf.downstream.visualize_GO_result(path_list, skiprows_list, category_list, mode='interactive', outdir='', ontology_to_highlight={}, xlims=None, ylims=None)[source]¶
Visualize the GO analysis rssults
- Parameters
path_list – list, each element points to the GO-Elite result txt file under ORA folder
skiprows_list – list, each element represents the number of lines needing to be skipped for each result file
category_list – list, each element points to the column name in each result file that represent GO terms
mode – string, interactive or static
outdir – string, the output directory
ontology_to_highlight – dict, key is the GO terms in the result, values are the corresponding text you want them to be displayed on the plot
xlims – None or tuple.
ylims – None or tuple.
Examples:
snaf.downstream.visualize_GO_result(path_list=['result/survival/GO_Elite_result_GeneOntology/GO-Elite_results/CompleteResults/ORA/archived-20221101-142116/gene_list-GO.txt','result/survival/GO_Elite_result_BioMarkers/GO-Elite_results/CompleteResults/ORA/archived-20221101-142037/gene_list-BioMarkers.txt'], skiprows_list=[17,16],category_list=['Ontology Name','Gene-Set Name'],outdir='result/survival', mode='static',ontology_to_highlight={'Adult Peripheral Blood Activated T cell (PMID32214235 top 100)':'T cells','antigen binding':'antigen binding','complement activation':'Complement Activation','immune response':'immune response','humoral immune response':'humoral immune response'},ylims=(10e-50,10e-1))
Prepare Sashimi plot¶
- snaf.downstream.prepare_sashimi_plot(bam_path_list, bai_path_list, outdir, sif_anno_path, bam_contig_rename, query_region, skip_copy=False, min_junction=3, width=10, ann_height=4, height=2, fix_y_scale=True, task_name='')[source]¶
Given a bunch of bam file and cognate bai files, we want to generate sashimi plot in an automatic way using ggsashimi package singularity image, this image is pulled using:
singularity build ggsashimi.sif docker://gencode.v36.annotation.gtf
Now specifiy a folder (outdir), we are going to copy bam and bai to a subdirectory called ggsashimi_bams, automatically generate input_bams.tsv and palette.txt file, and then return a string to run singularity images:
singularity run -W $PWD -B $PWD:$PWD ggsashimi.sif -b input_bam.tsv -c chr10:27040584-27048100 -C 3 -M 5 --width 10 --fix-y-scale -g gencode.v36.annotation.gtf --ann-height=4 --height=2 -P palette.txt
- Parameters
bam_path_list – list, each element represents the path to the bam file
bai_path_list – list, each element represents the path to the bai file
outdir – string, the output directory where we are going to execute singularity run and get the plot, we prefer non-existing path or empty existing folder
sif_anno_path – string, the folder where ggsashimi.sif and gencode.v36.annotation.gtf reside
bam_contig_rename – list of boolean, whether to convert the respective bam contig from 10 to chr10, require the samtools are loaded
query_region – string, format like chr10:27040584-27048100
skip_copy – boolean, whether to skip copying step
min_junction – int, the minimum number of junction to show sashimi line
width – float, the width in pct for the plot
ann_height – float, the height in pct for the annotation
height – float, the height in pct for the sashimi plot
fix_y_scale – bool, whether to fix y scale, default to True
task_name – string, default is empty string, add this as a suffix to generated sashimi.pdf
- Return cmd
string, the singularity cmd that you should type in the console when cd to the outdir
Examples:
root_dir = '/data/salomonis-archive/BAMs/NCI-R01/TCGA/TCGA-SKCM/TCGA_SKCM-BAMs/bams' bam_file_path = [root_dir+'/'+item+'.bam' for item in ['TCGA-3N-A9WB-06A-11R-A38C-07','TCGA-3N-A9WC-06A-11R-A38C-07','TCGA-3N-A9WD-06A-11R-A38C-07']] bai_file_path = [root_dir+'/'+item+'.bam.bai' for item in ['TCGA-3N-A9WB-06A-11R-A38C-07','TCGA-3N-A9WC-06A-11R-A38C-07','TCGA-3N-A9WD-06A-11R-A38C-07']] sif_anno_path = '/data/salomonis2/software/ggsashimi' cmd = snaf.downstream.prepare_sashimi_plot(bam_file_path,bai_file_path,'test',sif_anno_path,bam_contig_rename=[True,False,True],query_region='chr3:49099853-49099994') print(cmd) # singularity run -W $PWD -B $PWD:$PWD /data/salomonis2/software/ggsashimi/ggsashimi.sif -b input_bams.tsv -c chr3:49099853-49099994 -C 3 -M 3 --width 10 --fix-y-scale -g gencode.v36.annotation.gtf --ann-height=4 --height=2 -P palette.txt
Plot Splicing Event Type¶
- snaf.downstream.plot_event_type(pea, uids, rel, outdir)[source]¶
plot the event type based on provided uid list
- Parameters
pea – string, the path to the EventAnnotation file
uids – dict, key is the name of each category, value is a list of non-overlapping uids
rel – bool, whether to compute relative or absolute value
outdir – string, the output directory
Examples:
snaf.downstream.plot_event_type(pea='Hs_RNASeq_top_alt_junctions-PSI_EventAnnotation.txt',uids={'common':common_uid,'unique':unique_uid},rel=True,outdir='result_new')