R/coverage_norm.R
, R/coverage_process.R
, R/coverage_score.R
coverage_process.Rd
The set of functions prefixed with "coverage_" are used to
process coverage data. They are designed to be run after you have processed
your junctions in the order coverage_norm
, coverage_score
. Or,
alternatively the wrapper function coverage_process
can be used to run
the 2 functions stated above in one go. For more details of the individual
functions, see "Details".
coverage_norm(
junctions,
ref,
unannot_width = 20,
coverage_paths_case,
coverage_paths_control,
coverage_chr_control = NULL,
load_func = .coverage_load,
bp_param = BiocParallel::SerialParam(),
norm_const = 1
)
coverage_process(
junctions,
ref,
unannot_width = 20,
coverage_paths_case,
coverage_paths_control,
coverage_chr_control = NULL,
load_func = .coverage_load,
bp_param = BiocParallel::SerialParam(),
norm_const = 1,
score_func = .zscore,
...
)
coverage_score(junctions, coverage, score_func = .zscore, ...)
junction data as a RangedSummarizedExperiment-class object.
either path to gtf/gff3 or object of class TxDb-class or EnsDb-class. EnsDb-class is required if you intend to annotate junctions with gene symbols/names.
integer scalar determining the width of the region to obtain coverage from when the end of of a junction does not overlap an existing exon.
paths to the BigWig files containing the coverage of your case samples.
paths to the BigWig files for control samples.
either "chr" or "no_chr", indicating the chromosome format of control coverage data. Only required if the chromosome format of the control BigWig files is different to that of your cases.
a function to use to load coverage. Currently only for internal use to increase testing speed.
a BiocParallelParam-class instance denoting whether to parallelise the loading of coverage across BigWig files.
numeric scaler to add to the normalisation coverage to avoid dividing by 0s and resulting NaN or Inf values.
function to score junctions by their abnormality. By
default, will use a z-score but can be switched to a user-defined function.
This function must take as input an x
and y
argument, containing case
and control counts respectively. This must return a numeric vector equal to
the length of x
with elements corresponding to a abnormality of each
junction.
additional arguments passed to score_func
.
list containing normalised coverage data that is outputted from coverage_norm.
junctions as
SummarizedExperimentobject with additional assays
named "coverage_region" and
"coverage_score". "coverage_region" labels the region of greatest
disruption (1 = exon_start, 2 = exon_end, 3 = intron) and "coverage_score"
contains the abnormality scores of the region with the greatest disruption.
coverage_process
wraps all "coverage_" prefixed functions in
dasper. This is designed to simplify processing of the coverage data for
those familiar or uninterested with the intermediates.
coverage_norm
obtains regions of interest for each junction where
coverage disruptions would be expected. These consist of the intron itself
the overlapping exon definitions (if ends of junctions are annotated),
picking the shortest exon when multiple overlap one end. If ends are
unannotated, coverage_norm
will use a user-defined width set by
unannot_width
. Then, coverage will be loaded using
megadepth and
normalised to a set region per junction. By default, the boundaries of
each gene associated to a junction are used as the region to normalise to.
coverage_score
will score disruptions in the coverage across the
intronic/exonic regions associated with each junction. This abnormality
score generated by score_func
operates by calculating the deviation of
the coverage in patients to a coverage across the same regions in controls.
Then, for each junction it obtains the score of the region with the
greatest disruption.
coverage_norm
: Load and normalise coverage from RNA-sequencing data
coverage_score
: Score coverage by their abnormality
##### Set up txdb #####
# use GenomicState to load txdb (GENCODE v31)
ref <- GenomicState::GenomicStateHub(
version = "31",
genome = "hg38",
filetype = "TxDb"
)[[1]]
#> snapshotDate(): 2021-10-20
#> loading from cache
#> Loading required package: GenomicFeatures
#> Loading required package: BiocGenerics
#>
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#>
#> Attaching package: ‘S4Vectors’
#> The following objects are masked from ‘package:base’:
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: GenomicRanges
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
##### Set up BigWig #####
# obtain path to example bw on recount2
bw_path <- recount::download_study(
project = "SRP012682",
type = "samples",
download = FALSE
)[[1]]
#> Setting options('download.file.method.GEOquery'='auto')
#> Setting options('GEOquery.inmemory.gpl'=FALSE)
# \dontshow{
# cache the bw for speed in later
# examples/testing during R CMD Check
bw_path <- dasper:::.file_cache(bw_path)
# }
##### junction_process #####
junctions_processed <- junction_process(
junctions_example,
ref,
types = c("ambig_gene", "unannotated"),
)
#> [1] "# Filtering junctions -----------------------------------------------------"
#> [1] "2022-03-26 18:58:13 - Filtering junctions..."
#> [1] "2022-03-26 18:58:13 - by count..."
#> [1] "2022-03-26 18:58:13 - done!"
#> [1] "# Annotating junctions ----------------------------------------------------"
#> [1] "2022-03-26 18:58:13 - Obtaining co-ordinates of annotated exons and junctions..."
#> [1] "2022-03-26 18:58:33 - Getting junction annotation using overlapping exons..."
#> [1] "2022-03-26 18:58:34 - Tidying junction annotation..."
#> [1] "2022-03-26 18:58:34 - Deriving junction categories..."
#> [1] "2022-03-26 18:58:35 - done!"
#> [1] "# Filtering junctions -----------------------------------------------------"
#> [1] "2022-03-26 18:58:35 - Filtering junctions..."
#> [1] "2022-03-26 18:58:35 - by type..."
#> [1] "2022-03-26 18:58:35 - done!"
#> [1] "# Normalise junctions -----------------------------------------------------"
#> [1] "2022-03-26 18:58:35 - Clustering junctions..."
#> [1] "2022-03-26 18:58:35 - Normalising junction counts..."
#> [1] "2022-03-26 18:58:35 - done!"
#> [1] "# Score junctions ---------------------------------------------------------"
#> [1] "2022-03-26 18:58:35 - Calculating the direction of change of junctions..."
#> [1] "2022-03-26 18:58:35 - Generating junction abnormality score..."
#> [1] "2022-03-26 18:58:36 - done!"
##### install megadepth #####
# required to load coverage in coverage_norm()
megadepth::install_megadepth(force = FALSE)
#> The latest megadepth version is 1.2.0
#> This is not an interactive session, therefore megadepth has been installed temporarily to
#> /tmp/RtmpMrdPeC/megadepth
##### coverage_norm #####
coverage_normed <- coverage_norm(
junctions_processed,
ref,
unannot_width = 20,
coverage_paths_case = rep(bw_path, 2),
coverage_paths_control = rep(bw_path, 2)
)
#> [1] "2022-03-26 18:58:37 - Obtaining exonic and intronic regions to load coverage from..."
#> [1] "2022-03-26 18:58:38 - Obtaining regions to use to normalise coverage..."
#> [1] "2022-03-26 18:58:43 - Loading coverage..."
#> [1] "2022-03-26 18:58:53 - Normalising coverage..."
#> [1] "2022-03-26 18:58:53 - done!"
##### coverage_score #####
junctions <- coverage_score(junctions_processed, coverage_normed)
#> [1] "2022-03-26 18:58:53 - Generating coverage abnormality score..."
#> [1] "2022-03-26 18:58:54 - Obtaining regions with greatest coverage dysruption..."
#> [1] "2022-03-26 18:58:54 - done!"
##### coverage_process #####
# this wrapper will obtain coverage scores identical to those
# obtained through running the individual wrapped functions shown below
junctions_w_coverage <- coverage_process(
junctions_processed,
ref,
coverage_paths_case = rep(bw_path, 2),
coverage_paths_control = rep(bw_path, 3)
)
#> [1] "# Loading and normalising coverage ---------------------------------------------"
#> [1] "2022-03-26 18:58:54 - Obtaining exonic and intronic regions to load coverage from..."
#> [1] "2022-03-26 18:58:54 - Obtaining regions to use to normalise coverage..."
#> [1] "2022-03-26 18:59:00 - Loading coverage..."
#> [1] "2022-03-26 18:59:12 - Normalising coverage..."
#> [1] "2022-03-26 18:59:12 - done!"
#> [1] "# Scoring coverage ---------------------------------------------"
#> [1] "2022-03-26 18:59:12 - Generating coverage abnormality score..."
#> [1] "2022-03-26 18:59:13 - Obtaining regions with greatest coverage dysruption..."
#> [1] "2022-03-26 18:59:13 - done!"
# the two objects are equivalent
all.equal(junctions_w_coverage, junctions, check.attributes = FALSE)
#> [1] TRUE