The set of functions prefixed with "outlier_" are used to detect outliers. They are designed to be run after you have extracted your junctions and coverage based features, in the order outlier_detect, outlier_aggregate. Or, alternatively the wrapper function outlier_process can be used to run the 2 functions stated above in one go. For more details of the individual functions, see "Details".

outlier_aggregate(
  junctions,
  samp_id_col = "samp_id",
  bp_param = BiocParallel::SerialParam()
)

outlier_detect(
  junctions,
  feature_names = c("score", "coverage_score"),
  bp_param = BiocParallel::SerialParam(),
  ...
)

outlier_process(
  junctions,
  feature_names = c("score", "coverage_score"),
  samp_id_col = "samp_id",
  bp_param = BiocParallel::SerialParam(),
  ...
)

Arguments

junctions

junction data as a RangedSummarizedExperiment-class object.

samp_id_col

name of the column in the SummarizedExperiment that details the sample ids.

bp_param

a BiocParallelParam-class instance denoting whether to parallelise the calculating of outlier scores across samples.

feature_names

names of assays in junctions that are to be used as input into the outlier detection model.

...

additional arguments passed to the outlier detection model (isolation forest) for setting parameters.

Value

DataFrame with one row per cluster detailing each cluster's associated junctions, outlier scores, ranks and genes.

Details

outlier_process wraps all "outlier_" prefixed functions in dasper. This is designed to simplify processing of the detecting outlier junctions for those familiar or uninterested with the intermediates.

outlier_detect will use the features in assays named feature_names as input into an unsupervised outlier detection algorithm to score each junction based on how outlier-y it looks in relation to other junctions in the patient. The default expected score and coverage_score features can be calculated using the junction_process and coverage_process respectively.

outlier_aggregate will aggregate the outlier scores into a cluster-level. It will then rank each cluster based on this aggregated score and annotate each cluster with it's associated gene and transcript.

Functions

  • outlier_aggregate: Aggregate outlier scores from per junction to cluster-level

  • outlier_detect: Detecting outlier junctions

See also

for more details on the isolation forest model used: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html

Examples

# \donttest{

##### Set up txdb #####

# use GenomicState to load txdb (GENCODE v31)
ref <- GenomicState::GenomicStateHub(
    version = "31",
    genome = "hg38",
    filetype = "TxDb"
)[[1]]
#> snapshotDate(): 2021-10-20
#> loading from cache

##### Set up BigWig #####

# obtain path to example bw on recount2
bw_path <- recount::download_study(
    project = "SRP012682",
    type = "samples",
    download = FALSE
)[[1]]

# cache the bw for speed in later
# examples/testing during R CMD Check
bw_path <- dasper:::.file_cache(bw_path)


##### junction_process #####

junctions_processed <- junction_process(
    junctions_example,
    ref,
    types = c("ambig_gene", "unannotated"),
)
#> [1] "# Filtering junctions -----------------------------------------------------"
#> [1] "2022-03-26 19:00:13 - Filtering junctions..."
#> [1] "2022-03-26 19:00:13 - by count..."
#> [1] "2022-03-26 19:00:14 - done!"
#> [1] "# Annotating junctions ----------------------------------------------------"
#> [1] "2022-03-26 19:00:14 - Obtaining co-ordinates of annotated exons and junctions..."
#> [1] "2022-03-26 19:00:27 - Getting junction annotation using overlapping exons..."
#> [1] "2022-03-26 19:00:28 - Tidying junction annotation..."
#> [1] "2022-03-26 19:00:28 - Deriving junction categories..."
#> [1] "2022-03-26 19:00:29 - done!"
#> [1] "# Filtering junctions -----------------------------------------------------"
#> [1] "2022-03-26 19:00:29 - Filtering junctions..."
#> [1] "2022-03-26 19:00:29 - by type..."
#> [1] "2022-03-26 19:00:29 - done!"
#> [1] "# Normalise junctions -----------------------------------------------------"
#> [1] "2022-03-26 19:00:29 - Clustering junctions..."
#> [1] "2022-03-26 19:00:29 - Normalising junction counts..."
#> [1] "2022-03-26 19:00:30 - done!"
#> [1] "# Score junctions ---------------------------------------------------------"
#> [1] "2022-03-26 19:00:30 - Calculating the direction of change of junctions..."
#> [1] "2022-03-26 19:00:30 - Generating junction abnormality score..."
#> [1] "2022-03-26 19:00:30 - done!"

##### coverage_process #####

junctions_w_coverage <- coverage_process(
    junctions_processed,
    ref,
    coverage_paths_case = rep(bw_path, 2),
    coverage_paths_control = rep(bw_path, 3)
)
#> [1] "# Loading and normalising coverage ---------------------------------------------"
#> [1] "2022-03-26 19:00:30 - Obtaining exonic and intronic regions to load coverage from..."
#> [1] "2022-03-26 19:00:30 - Obtaining regions to use to normalise coverage..."
#> [1] "2022-03-26 19:00:35 - Loading coverage..."
#> [1] "2022-03-26 19:00:47 - Normalising coverage..."
#> [1] "2022-03-26 19:00:47 - done!"
#> [1] "# Scoring coverage ---------------------------------------------"
#> [1] "2022-03-26 19:00:47 - Generating coverage abnormality score..."
#> [1] "2022-03-26 19:00:47 - Obtaining regions with greatest coverage dysruption..."
#> [1] "2022-03-26 19:00:47 - done!"

##### outlier_detect #####

junctions_w_outliers <- outlier_detect(junctions_w_coverage)
#> [1] "2022-03-26 19:00:47 - generating outlier scores for sample 1/2"
#> [1] "2022-03-26 19:00:50 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:51 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:51 - generating outlier scores for sample 2/2"
#> [1] "2022-03-26 19:00:51 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:51 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:51 - done!"

##### outlier_aggregate #####

outlier_scores <- outlier_aggregate(junctions_w_outliers)
#> [1] "2022-03-26 19:00:52 - Aggregating outlier scores to cluster level... "
#> [1] "2022-03-26 19:00:52 - Annotating clusters with gene details..."
#> [1] "2022-03-26 19:00:52 - done!"

##### outlier_process #####

# this wrapper will obtain outlier scores identical to those
# obtained through running the individual wrapped functions shown below
outlier_processed <- outlier_process(junctions_w_coverage)
#> [1] "# Detecting outliers using an isolation forest ------------------------------"
#> [1] "2022-03-26 19:00:52 - generating outlier scores for sample 1/2"
#> [1] "2022-03-26 19:00:52 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:53 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:53 - generating outlier scores for sample 2/2"
#> [1] "2022-03-26 19:00:53 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:54 - fitting outlier detection model with parameters: behaviour=deprecated, bootstrap=FALSE, contamination=auto, max_features=1, max_samples=auto, n_estimators=100, n_jobs=NULL, random_state=NULL, verbose=0, warm_start=FALSE"
#> [1] "2022-03-26 19:00:54 - done!"
#> [1] "# Aggregating outlier scores to cluster-level -----------------------------"
#> [1] "2022-03-26 19:00:54 - Aggregating outlier scores to cluster level... "
#> [1] "2022-03-26 19:00:54 - Annotating clusters with gene details..."
#> [1] "2022-03-26 19:00:54 - done!"

# the two objects are equivalent
all.equal(outlier_processed, outlier_scores, check.attributes = FALSE)
#> [1] TRUE
# }