R/junction_annot.R
, R/junction_filter.R
, R/junction_norm.R
, and 2 more
junction_process.Rd
The set of functions prefixed with "junction_" are used to
process junction data. They are designed to be run in a sequential manner
in the order junction_annot
, junction_filter
, junction_norm
,
junction_score
. Or, alternatively the wrapper function junction_process
can be used to run all 4 of the functions stated above in one go. For more
details of the individual functions, see "Details".
junction_annot(
junctions,
ref,
ref_cols = c("gene_id", "tx_name", "exon_id"),
ref_cols_to_merge = c("gene_id")
)
junction_filter(
junctions,
count_thresh = c(raw = 5),
n_samp = c(raw = 1),
width_range = NULL,
types = NULL,
regions = NULL
)
junction_norm(junctions)
junction_process(
junctions,
ref,
ref_cols = c("gene_id", "tx_name", "exon_name"),
ref_cols_to_merge = c("gene_id"),
count_thresh = c(raw = 5),
n_samp = c(raw = 1),
width_range = NULL,
types = NULL,
regions = NULL,
score_func = .zscore,
...
)
junction_score(junctions, score_func = .zscore, ...)
junction data as a RangedSummarizedExperiment-class object.
either path to gtf/gff3 or object of class TxDb-class or EnsDb-class. EnsDb-class is required if you intend to annotate junctions with gene symbols/names.
character vector listing the names of the columns in ref
for which to annotate junctions with. Must contain "gene_id", used for
categorising junctions.
character vector listing which of the annotation
columns ref_cols
should be merged into in columns to merge into a single
column per junction. Must contain "gene_id", used for categorising
junctions.
named vector with names matching the names of the
assays in junctions
.
Values denote the number of counts below which a junction will be filtered
out.
named vector with names matching the names of the
assays in junctions
.
Values denotes number of samples that have to express the junction above
the count_thresh
in order for that junction to not be filtered.
numeric vector of length 2. The first element denoting the lower limit of junction width and the second the upper limit. Junctions with widths outside this range will be filtered out.
any junctions matching these types, derived form junction_annot will be filtered out.
any junctions overlapping this set of regions (in a GRanges-class format) will be filtered out.
function to score junctions by their abnormality. By
default, will use a z-score but can be switched to a user-defined function.
This function must take as input an x
and y
argument, containing case
and control counts respectively. This must return a numeric vector equal to
the length of x
with elements corresponding to a abnormality of each
junction.
additional arguments passed to score_func
.
RangedSummarizedExperiment-classobject containing filtered, annotated, normalised junction data with abnormality scores.
junction_process
wraps all "junction_" prefixed functions in
dasper except junction_load. This is designed to
simplify processing of the junction data for those familiar or uninterested
with the intermediates.
junction_annot
annotates junctions by 1. whether their start and/or end
position precisely overlaps with an annotated exon boundary and 2. whether
that junction matches an intron definition from existing annotation. Using
this information along with the strand, junctions are categorised into
"annotated", "novel_acceptor", "novel_donor", "novel_combo",
"novel_exon_skip", "ambig_gene" and "unannotated".
junction_filter
filters out "noisy" junctions based on counts, the width
of junctions, annotation category of the junction returned from
junction_annot and whether the junction overlaps with a set of
(blacklist) regions.
junction_norm
normalises the raw junction counts by 1. building junction
clusters by finding junctions that share an acceptor or donor position and
2. calculating a proportion-spliced-in (PSI) for each junction by dividing
the raw junction count by the total number of counts in it's associated
cluster.
junction_score
will use the counts contained within the "norm"
assay to calculate a
deviation of each patient junction from the expected distribution of
control junction counts. The function used to calculate this abnormality
score can be user-inputted or left as the default z-score. Junctions will
also be labelled based on whether they are up-regulated (+1) or
down-regulated (-1) with respect to controls junction and this information
is stored in the assay
"direction" for use in outlier_aggregate.
junction_annot
: Annotate junctions using reference annotation
junction_filter
: Filter junctions by count, width, annotation or region
junction_norm
: Normalise junction counts by cluster
junction_score
: Score patient junctions by their abnormality
ENCODE blacklist regions are recommended to be included as regions
for junction_filter
and can be downloaded from
https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz.
Further information can be found via the publication
https://www.nature.com/articles/s41598-019-45839-z.
##### Set up txdb #####
# use GenomicState to load txdb (GENCODE v31)
ref <- GenomicState::GenomicStateHub(
version = "31",
genome = "hg38",
filetype = "TxDb"
)[[1]]
#> snapshotDate(): 2021-10-20
#> loading from cache
##### junction_annot #####
junctions <- junction_annot(junctions_example, ref)
#> [1] "2022-03-26 18:59:24 - Obtaining co-ordinates of annotated exons and junctions..."
#> [1] "2022-03-26 18:59:38 - Getting junction annotation using overlapping exons..."
#> [1] "2022-03-26 18:59:41 - Tidying junction annotation..."
#> [1] "2022-03-26 18:59:41 - Deriving junction categories..."
#> [1] "2022-03-26 18:59:46 - done!"
##### junction_filter #####
junctions <- junction_filter(
junctions,
types = c("ambig_gene", "unannotated")
)
#> [1] "2022-03-26 18:59:46 - Filtering junctions..."
#> [1] "2022-03-26 18:59:46 - by count..."
#> [1] "2022-03-26 18:59:46 - by type..."
#> [1] "2022-03-26 18:59:46 - done!"
##### junction_norm #####
junctions <- junction_norm(junctions)
#> [1] "2022-03-26 18:59:46 - Clustering junctions..."
#> [1] "2022-03-26 18:59:47 - Normalising junction counts..."
#> [1] "2022-03-26 18:59:47 - done!"
##### junction_score #####
junctions <- junction_score(junctions)
#> [1] "2022-03-26 18:59:47 - Calculating the direction of change of junctions..."
#> [1] "2022-03-26 18:59:47 - Generating junction abnormality score..."
#> [1] "2022-03-26 18:59:47 - done!"
##### junction_process #####
junctions_processed <- junction_process(
junctions_example,
ref,
types = c("ambig_gene", "unannotated")
)
#> [1] "# Filtering junctions -----------------------------------------------------"
#> [1] "2022-03-26 18:59:47 - Filtering junctions..."
#> [1] "2022-03-26 18:59:47 - by count..."
#> [1] "2022-03-26 18:59:47 - done!"
#> [1] "# Annotating junctions ----------------------------------------------------"
#> [1] "2022-03-26 18:59:47 - Obtaining co-ordinates of annotated exons and junctions..."
#> [1] "2022-03-26 19:00:01 - Getting junction annotation using overlapping exons..."
#> [1] "2022-03-26 19:00:02 - Tidying junction annotation..."
#> [1] "2022-03-26 19:00:03 - Deriving junction categories..."
#> [1] "2022-03-26 19:00:03 - done!"
#> [1] "# Filtering junctions -----------------------------------------------------"
#> [1] "2022-03-26 19:00:03 - Filtering junctions..."
#> [1] "2022-03-26 19:00:03 - by type..."
#> [1] "2022-03-26 19:00:03 - done!"
#> [1] "# Normalise junctions -----------------------------------------------------"
#> [1] "2022-03-26 19:00:03 - Clustering junctions..."
#> [1] "2022-03-26 19:00:03 - Normalising junction counts..."
#> [1] "2022-03-26 19:00:04 - done!"
#> [1] "# Score junctions ---------------------------------------------------------"
#> [1] "2022-03-26 19:00:04 - Calculating the direction of change of junctions..."
#> [1] "2022-03-26 19:00:04 - Generating junction abnormality score..."
#> [1] "2022-03-26 19:00:04 - done!"
# the two objects are equivalent
all.equal(junctions_processed, junctions, check.attributes = FALSE)
#> [1] TRUE