Processing coverage — coverage

The set of functions prefixed with "coverage_" are used to process coverage data. They are designed to be run after you have processed your junctions in the order coverage_norm, coverage_score. Or, alternatively the wrapper function coverage_process can be used to run the 2 functions stated above in one go. For more details of the individual functions, see "Details".

coverage_norm(
  junctions,
  ref,
  unannot_width = 20,
  coverage_paths_case,
  coverage_paths_control,
  coverage_chr_control = NULL,
  load_func = .coverage_load,
  bp_param = BiocParallel::SerialParam(),
  norm_const = 1
)

coverage_process(
  junctions,
  ref,
  unannot_width = 20,
  coverage_paths_case,
  coverage_paths_control,
  coverage_chr_control = NULL,
  load_func = .coverage_load,
  bp_param = BiocParallel::SerialParam(),
  norm_const = 1,
  score_func = .zscore,
  ...
)

coverage_score(junctions, coverage, score_func = .zscore, ...)

Arguments

junctions: junction data as a RangedSummarizedExperiment-class object.
ref: either path to gtf/gff3 or object of class TxDb-class or EnsDb-class. EnsDb-class is required if you intend to annotate junctions with gene symbols/names.
unannot_width: integer scalar determining the width of the region to obtain coverage from when the end of of a junction does not overlap an existing exon.
coverage_paths_case: paths to the BigWig files containing the coverage of your case samples.
coverage_paths_control: paths to the BigWig files for control samples.
coverage_chr_control: either "chr" or "no_chr", indicating the chromosome format of control coverage data. Only required if the chromosome format of the control BigWig files is different to that of your cases.
load_func: a function to use to load coverage. Currently only for internal use to increase testing speed.
bp_param: a BiocParallelParam-class instance denoting whether to parallelise the loading of coverage across BigWig files.
norm_const: numeric scaler to add to the normalisation coverage to avoid dividing by 0s and resulting NaN or Inf values.
score_func: function to score junctions by their abnormality. By default, will use a z-score but can be switched to a user-defined function. This function must take as input an x and y argument, containing case and control counts respectively. This must return a numeric vector equal to the length of x with elements corresponding to a abnormality of each junction.
...: additional arguments passed to score_func.
coverage: list containing normalised coverage data that is outputted from coverage_norm.

Value

junctions as SummarizedExperimentobject with additional assays named "coverage_region" and "coverage_score". "coverage_region" labels the region of greatest disruption (1 = exon_start, 2 = exon_end, 3 = intron) and "coverage_score" contains the abnormality scores of the region with the greatest disruption.

Details

coverage_process wraps all "coverage_" prefixed functions in dasper. This is designed to simplify processing of the coverage data for those familiar or uninterested with the intermediates.

coverage_norm obtains regions of interest for each junction where coverage disruptions would be expected. These consist of the intron itself the overlapping exon definitions (if ends of junctions are annotated), picking the shortest exon when multiple overlap one end. If ends are unannotated, coverage_norm will use a user-defined width set by unannot_width. Then, coverage will be loaded using megadepth and normalised to a set region per junction. By default, the boundaries of each gene associated to a junction are used as the region to normalise to.

coverage_score will score disruptions in the coverage across the intronic/exonic regions associated with each junction. This abnormality score generated by score_func operates by calculating the deviation of the coverage in patients to a coverage across the same regions in controls. Then, for each junction it obtains the score of the region with the greatest disruption.

Functions

coverage_norm: Load and normalise coverage from RNA-sequencing data
coverage_score: Score coverage by their abnormality

Examples


##### Set up txdb #####

# use GenomicState to load txdb (GENCODE v31)
ref <- GenomicState::GenomicStateHub(
    version = "31",
    genome = "hg38",
    filetype = "TxDb"
)[[1]]
#> snapshotDate(): 2021-10-20
#> loading from cache
#> Loading required package: GenomicFeatures
#> Loading required package: BiocGenerics
#> 
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: ‘S4Vectors’
#> The following objects are masked from ‘package:base’:
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: GenomicRanges
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.

##### Set up BigWig #####

# obtain path to example bw on recount2
bw_path <- recount::download_study(
    project = "SRP012682",
    type = "samples",
    download = FALSE
)[[1]]
#> Setting options('download.file.method.GEOquery'='auto')
#> Setting options('GEOquery.inmemory.gpl'=FALSE)
# \dontshow{
# cache the bw for speed in later
# examples/testing during R CMD Check
bw_path <- dasper:::.file_cache(bw_path)
# }

##### junction_process #####

junctions_processed <- junction_process(
    junctions_example,
    ref,
    types = c("ambig_gene", "unannotated"),
)
#> [1] "# Filtering junctions -----------------------------------------------------"
#> [1] "2022-03-26 18:58:13 - Filtering junctions..."
#> [1] "2022-03-26 18:58:13 - by count..."
#> [1] "2022-03-26 18:58:13 - done!"
#> [1] "# Annotating junctions ----------------------------------------------------"
#> [1] "2022-03-26 18:58:13 - Obtaining co-ordinates of annotated exons and junctions..."
#> [1] "2022-03-26 18:58:33 - Getting junction annotation using overlapping exons..."
#> [1] "2022-03-26 18:58:34 - Tidying junction annotation..."
#> [1] "2022-03-26 18:58:34 - Deriving junction categories..."
#> [1] "2022-03-26 18:58:35 - done!"
#> [1] "# Filtering junctions -----------------------------------------------------"
#> [1] "2022-03-26 18:58:35 - Filtering junctions..."
#> [1] "2022-03-26 18:58:35 - by type..."
#> [1] "2022-03-26 18:58:35 - done!"
#> [1] "# Normalise junctions -----------------------------------------------------"
#> [1] "2022-03-26 18:58:35 - Clustering junctions..."
#> [1] "2022-03-26 18:58:35 - Normalising junction counts..."
#> [1] "2022-03-26 18:58:35 - done!"
#> [1] "# Score junctions ---------------------------------------------------------"
#> [1] "2022-03-26 18:58:35 - Calculating the direction of change of junctions..."
#> [1] "2022-03-26 18:58:35 - Generating junction abnormality score..."
#> [1] "2022-03-26 18:58:36 - done!"

##### install megadepth #####

# required to load coverage in coverage_norm()
megadepth::install_megadepth(force = FALSE)
#> The latest megadepth version is 1.2.0
#> This is not an interactive session, therefore megadepth has been installed temporarily to 
#> /tmp/RtmpMrdPeC/megadepth

##### coverage_norm #####

coverage_normed <- coverage_norm(
    junctions_processed,
    ref,
    unannot_width = 20,
    coverage_paths_case = rep(bw_path, 2),
    coverage_paths_control = rep(bw_path, 2)
)
#> [1] "2022-03-26 18:58:37 - Obtaining exonic and intronic regions to load coverage from..."
#> [1] "2022-03-26 18:58:38 - Obtaining regions to use to normalise coverage..."
#> [1] "2022-03-26 18:58:43 - Loading coverage..."
#> [1] "2022-03-26 18:58:53 - Normalising coverage..."
#> [1] "2022-03-26 18:58:53 - done!"

##### coverage_score #####

junctions <- coverage_score(junctions_processed, coverage_normed)
#> [1] "2022-03-26 18:58:53 - Generating coverage abnormality score..."
#> [1] "2022-03-26 18:58:54 - Obtaining regions with greatest coverage dysruption..."
#> [1] "2022-03-26 18:58:54 - done!"

##### coverage_process #####

# this wrapper will obtain coverage scores identical to those
# obtained through running the individual wrapped functions shown below
junctions_w_coverage <- coverage_process(
    junctions_processed,
    ref,
    coverage_paths_case = rep(bw_path, 2),
    coverage_paths_control = rep(bw_path, 3)
)
#> [1] "# Loading and normalising coverage ---------------------------------------------"
#> [1] "2022-03-26 18:58:54 - Obtaining exonic and intronic regions to load coverage from..."
#> [1] "2022-03-26 18:58:54 - Obtaining regions to use to normalise coverage..."
#> [1] "2022-03-26 18:59:00 - Loading coverage..."
#> [1] "2022-03-26 18:59:12 - Normalising coverage..."
#> [1] "2022-03-26 18:59:12 - done!"
#> [1] "# Scoring coverage ---------------------------------------------"
#> [1] "2022-03-26 18:59:12 - Generating coverage abnormality score..."
#> [1] "2022-03-26 18:59:13 - Obtaining regions with greatest coverage dysruption..."
#> [1] "2022-03-26 18:59:13 - done!"

# the two objects are equivalent
all.equal(junctions_w_coverage, junctions, check.attributes = FALSE)
#> [1] TRUE