R/outlier_aggregate.R
, R/outlier_detect.R
, R/outlier_process.R
outlier_process.Rd
The set of functions prefixed with "outlier_" are used to detect
outliers. They are designed to be run after you have extracted your
junctions and coverage based features, in the order outlier_detect
,
outlier_aggregate
. Or, alternatively the wrapper function
outlier_process
can be used to run the 2 functions stated above in one
go. For more details of the individual functions, see "Details".
outlier_aggregate(
junctions,
samp_id_col = "samp_id",
bp_param = BiocParallel::SerialParam()
)
outlier_detect(
junctions,
feature_names = c("score", "coverage_score"),
bp_param = BiocParallel::SerialParam(),
...
)
outlier_process(
junctions,
feature_names = c("score", "coverage_score"),
samp_id_col = "samp_id",
bp_param = BiocParallel::SerialParam(),
...
)
junction data as a RangedSummarizedExperiment-class object.
name of the column in the SummarizedExperiment that details the sample ids.
a BiocParallelParam-class instance denoting whether to parallelise the calculating of outlier scores across samples.
names of assays in junctions
that are to be used as
input into the outlier detection model.
additional arguments passed to the outlier detection model (isolation forest) for setting parameters.
DataFrame
with one row per cluster detailing each cluster's
associated junctions, outlier scores, ranks and genes.
outlier_process
wraps all "outlier_" prefixed functions in
dasper. This is designed to simplify processing of the detecting outlier
junctions for those familiar or uninterested with the intermediates.
outlier_detect
will use the features in
assays named
feature_names
as input into an unsupervised outlier detection algorithm
to score each junction based on how outlier-y it looks in relation to other
junctions in the patient. The default expected score
and coverage_score
features can be calculated using the junction_process and
coverage_process respectively.
outlier_aggregate
will aggregate the outlier scores into a cluster-level.
It will then rank each cluster based on this aggregated score and annotate
each cluster with it's associated gene and transcript.
outlier_aggregate
: Aggregate outlier scores from per junction to
cluster-level
outlier_detect
: Detecting outlier junctions
for more details on the isolation forest model used: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
# \donttest{
##### Set up txdb #####
# use GenomicState to load txdb (GENCODE v31)
ref <- GenomicState::GenomicStateHub(
version = "31",
genome = "hg38",
filetype = "TxDb"
)[[1]]
#> snapshotDate(): 2021-10-20
#> loading from cache
##### Set up BigWig #####
# obtain path to example bw on recount2
bw_path <- recount::download_study(
project = "SRP012682",
type = "samples",
download = FALSE
)[[1]]
# cache the bw for speed in later
# examples/testing during R CMD Check
bw_path <- dasper:::.file_cache(bw_path)
##### junction_process #####
junctions_processed <- junction_process(
junctions_example,
ref,
types = c("ambig_gene", "unannotated"),
)
#> [1] "# Filtering junctions -----------------------------------------------------"
#> [1] "2022-03-26 19:00:13 - Filtering junctions..."
#> [1] "2022-03-26 19:00:13 - by count..."
#> [1] "2022-03-26 19:00:14 - done!"
#> [1] "# Annotating junctions ----------------------------------------------------"
#> [1] "2022-03-26 19:00:14 - Obtaining co-ordinates of annotated exons and junctions..."
#> [1] "2022-03-26 19:00:27 - Getting junction annotation using overlapping exons..."
#> [1] "2022-03-26 19:00:28 - Tidying junction annotation..."
#> [1] "2022-03-26 19:00:28 - Deriving junction categories..."
#> [1] "2022-03-26 19:00:29 - done!"
#> [1] "# Filtering junctions -----------------------------------------------------"
#> [1] "2022-03-26 19:00:29 - Filtering junctions..."
#> [1] "2022-03-26 19:00:29 - by type..."
#> [1] "2022-03-26 19:00:29 - done!"
#> [1] "# Normalise junctions -----------------------------------------------------"
#> [1] "2022-03-26 19:00:29 - Clustering junctions..."
#> [1] "2022-03-26 19:00:29 - Normalising junction counts..."
#> [1] "2022-03-26 19:00:30 - done!"
#> [1] "# Score junctions ---------------------------------------------------------"
#> [1] "2022-03-26 19:00:30 - Calculating the direction of change of junctions..."
#> [1] "2022-03-26 19:00:30 - Generating junction abnormality score..."
#> [1] "2022-03-26 19:00:30 - done!"
##### coverage_process #####
junctions_w_coverage <- coverage_process(
junctions_processed,
ref,
coverage_paths_case = rep(bw_path, 2),
coverage_paths_control = rep(bw_path, 3)
)
#> [1] "# Loading and normalising coverage ---------------------------------------------"
#> [1] "2022-03-26 19:00:30 - Obtaining exonic and intronic regions to load coverage from..."
#> [1] "2022-03-26 19:00:30 - Obtaining regions to use to normalise coverage..."
#> [1] "2022-03-26 19:00:35 - Loading coverage..."
#> [1] "2022-03-26 19:00:47 - Normalising coverage..."
#> [1] "2022-03-26 19:00:47 - done!"
#> [1] "# Scoring coverage ---------------------------------------------"
#> [1] "2022-03-26 19:00:47 - Generating coverage abnormality score..."
#> [1] "2022-03-26 19:00:47 - Obtaining regions with greatest coverage dysruption..."
#> [1] "2022-03-26 19:00:47 - done!"
##### outlier_detect #####
junctions_w_outliers <- outlier_detect(junctions_w_coverage)
#> [1] "2022-03-26 19:00:47 - generating outlier scores for sample 1/2"
#> [1] "2022-03-26 19:00:50 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:51 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:51 - generating outlier scores for sample 2/2"
#> [1] "2022-03-26 19:00:51 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:51 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:51 - done!"
##### outlier_aggregate #####
outlier_scores <- outlier_aggregate(junctions_w_outliers)
#> [1] "2022-03-26 19:00:52 - Aggregating outlier scores to cluster level... "
#> [1] "2022-03-26 19:00:52 - Annotating clusters with gene details..."
#> [1] "2022-03-26 19:00:52 - done!"
##### outlier_process #####
# this wrapper will obtain outlier scores identical to those
# obtained through running the individual wrapped functions shown below
outlier_processed <- outlier_process(junctions_w_coverage)
#> [1] "# Detecting outliers using an isolation forest ------------------------------"
#> [1] "2022-03-26 19:00:52 - generating outlier scores for sample 1/2"
#> [1] "2022-03-26 19:00:52 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:53 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:53 - generating outlier scores for sample 2/2"
#> [1] "2022-03-26 19:00:53 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:54 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:54 - done!"
#> [1] "# Aggregating outlier scores to cluster-level -----------------------------"
#> [1] "2022-03-26 19:00:54 - Aggregating outlier scores to cluster level... "
#> [1] "2022-03-26 19:00:54 - Annotating clusters with gene details..."
#> [1] "2022-03-26 19:00:54 - done!"
# the two objects are equivalent
all.equal(outlier_processed, outlier_scores, check.attributes = FALSE)
#> [1] TRUE
# }