Processing junctions

The set of functions prefixed with "junction_" are used to process junction data. They are designed to be run in a sequential manner in the order junction_annot, junction_filter, junction_norm, junction_score. Or, alternatively the wrapper function junction_process can be used to run all 4 of the functions stated above in one go. For more details of the individual functions, see "Details".

junction_annot(
  junctions,
  ref,
  ref_cols = c("gene_id", "tx_name", "exon_id"),
  ref_cols_to_merge = c("gene_id")
)

junction_filter(
  junctions,
  count_thresh = c(raw = 5),
  n_samp = c(raw = 1),
  width_range = NULL,
  types = NULL,
  regions = NULL
)

junction_norm(junctions)

junction_process(
  junctions,
  ref,
  ref_cols = c("gene_id", "tx_name", "exon_name"),
  ref_cols_to_merge = c("gene_id"),
  count_thresh = c(raw = 5),
  n_samp = c(raw = 1),
  width_range = NULL,
  types = NULL,
  regions = NULL,
  score_func = .zscore,
  ...
)

junction_score(junctions, score_func = .zscore, ...)

Arguments

junctions: junction data as a RangedSummarizedExperiment-class object.
ref: either path to gtf/gff3 or object of class TxDb-class or EnsDb-class. EnsDb-class is required if you intend to annotate junctions with gene symbols/names.
ref_cols: character vector listing the names of the columns in ref for which to annotate junctions with. Must contain "gene_id", used for categorising junctions.
ref_cols_to_merge: character vector listing which of the annotation columns ref_cols should be merged into in columns to merge into a single column per junction. Must contain "gene_id", used for categorising junctions.
count_thresh: named vector with names matching the names of the assays in junctions. Values denote the number of counts below which a junction will be filtered out.
n_samp: named vector with names matching the names of the assays in junctions. Values denotes number of samples that have to express the junction above the count_thresh in order for that junction to not be filtered.
width_range: numeric vector of length 2. The first element denoting the lower limit of junction width and the second the upper limit. Junctions with widths outside this range will be filtered out.
types: any junctions matching these types, derived form junction_annot will be filtered out.
regions: any junctions overlapping this set of regions (in a GRanges-class format) will be filtered out.
score_func: function to score junctions by their abnormality. By default, will use a z-score but can be switched to a user-defined function. This function must take as input an x and y argument, containing case and control counts respectively. This must return a numeric vector equal to the length of x with elements corresponding to a abnormality of each junction.
...: additional arguments passed to score_func.

Value

RangedSummarizedExperiment-classobject containing filtered, annotated, normalised junction data with abnormality scores.

Details

junction_process wraps all "junction_" prefixed functions in dasper except junction_load. This is designed to simplify processing of the junction data for those familiar or uninterested with the intermediates.

junction_annot annotates junctions by 1. whether their start and/or end position precisely overlaps with an annotated exon boundary and 2. whether that junction matches an intron definition from existing annotation. Using this information along with the strand, junctions are categorised into "annotated", "novel_acceptor", "novel_donor", "novel_combo", "novel_exon_skip", "ambig_gene" and "unannotated".

junction_filter filters out "noisy" junctions based on counts, the width of junctions, annotation category of the junction returned from junction_annot and whether the junction overlaps with a set of (blacklist) regions.

junction_norm normalises the raw junction counts by 1. building junction clusters by finding junctions that share an acceptor or donor position and 2. calculating a proportion-spliced-in (PSI) for each junction by dividing the raw junction count by the total number of counts in it's associated cluster.

junction_score will use the counts contained within the "norm" assay to calculate a deviation of each patient junction from the expected distribution of control junction counts. The function used to calculate this abnormality score can be user-inputted or left as the default z-score. Junctions will also be labelled based on whether they are up-regulated (+1) or down-regulated (-1) with respect to controls junction and this information is stored in the assay "direction" for use in outlier_aggregate.

Functions

junction_annot: Annotate junctions using reference annotation
junction_filter: Filter junctions by count, width, annotation or region
junction_norm: Normalise junction counts by cluster
junction_score: Score patient junctions by their abnormality

Examples


##### Set up txdb #####

# use GenomicState to load txdb (GENCODE v31)
ref <- GenomicState::GenomicStateHub(
    version = "31",
    genome = "hg38",
    filetype = "TxDb"
)[[1]]
#> snapshotDate(): 2021-10-20
#> loading from cache

##### junction_annot #####

junctions <- junction_annot(junctions_example, ref)
#> [1] "2022-03-26 18:59:24 - Obtaining co-ordinates of annotated exons and junctions..."
#> [1] "2022-03-26 18:59:38 - Getting junction annotation using overlapping exons..."
#> [1] "2022-03-26 18:59:41 - Tidying junction annotation..."
#> [1] "2022-03-26 18:59:41 - Deriving junction categories..."
#> [1] "2022-03-26 18:59:46 - done!"

##### junction_filter #####

junctions <- junction_filter(
    junctions,
    types = c("ambig_gene", "unannotated")
)
#> [1] "2022-03-26 18:59:46 - Filtering junctions..."
#> [1] "2022-03-26 18:59:46 - by count..."
#> [1] "2022-03-26 18:59:46 - by type..."
#> [1] "2022-03-26 18:59:46 - done!"

##### junction_norm #####

junctions <- junction_norm(junctions)
#> [1] "2022-03-26 18:59:46 - Clustering junctions..."
#> [1] "2022-03-26 18:59:47 - Normalising junction counts..."
#> [1] "2022-03-26 18:59:47 - done!"

##### junction_score #####

junctions <- junction_score(junctions)
#> [1] "2022-03-26 18:59:47 - Calculating the direction of change of junctions..."
#> [1] "2022-03-26 18:59:47 - Generating junction abnormality score..."
#> [1] "2022-03-26 18:59:47 - done!"

##### junction_process #####

junctions_processed <- junction_process(
    junctions_example,
    ref,
    types = c("ambig_gene", "unannotated")
)
#> [1] "# Filtering junctions -----------------------------------------------------"
#> [1] "2022-03-26 18:59:47 - Filtering junctions..."
#> [1] "2022-03-26 18:59:47 - by count..."
#> [1] "2022-03-26 18:59:47 - done!"
#> [1] "# Annotating junctions ----------------------------------------------------"
#> [1] "2022-03-26 18:59:47 - Obtaining co-ordinates of annotated exons and junctions..."
#> [1] "2022-03-26 19:00:01 - Getting junction annotation using overlapping exons..."
#> [1] "2022-03-26 19:00:02 - Tidying junction annotation..."
#> [1] "2022-03-26 19:00:03 - Deriving junction categories..."
#> [1] "2022-03-26 19:00:03 - done!"
#> [1] "# Filtering junctions -----------------------------------------------------"
#> [1] "2022-03-26 19:00:03 - Filtering junctions..."
#> [1] "2022-03-26 19:00:03 - by type..."
#> [1] "2022-03-26 19:00:03 - done!"
#> [1] "# Normalise junctions -----------------------------------------------------"
#> [1] "2022-03-26 19:00:03 - Clustering junctions..."
#> [1] "2022-03-26 19:00:03 - Normalising junction counts..."
#> [1] "2022-03-26 19:00:04 - done!"
#> [1] "# Score junctions ---------------------------------------------------------"
#> [1] "2022-03-26 19:00:04 - Calculating the direction of change of junctions..."
#> [1] "2022-03-26 19:00:04 - Generating junction abnormality score..."
#> [1] "2022-03-26 19:00:04 - done!"

# the two objects are equivalent
all.equal(junctions_processed, junctions, check.attributes = FALSE)
#> [1] TRUE

Arguments

Value

Details

Functions

See also

Examples