Capture all matches in a single subject string

2025-01-21

The nc::capture_all_str function is for the common case of extracting each match from a multi-line text file (a single large subject string). In this section we demonstrate how to extract data tables from such loosely structured text data. For example we consider the following track hub meta-data file:

trackDb.txt.gz <- system.file(
  "extdata", "trackDb.txt.gz", package="nc")
trackDb.vec <- readLines(trackDb.txt.gz)

Some representative lines from that file are shown below.

cat(trackDb.vec[78:107], sep="\n")
#> track peaks_summary
#> type bigBed 5
#> shortLabel _model_peaks_summary
#> longLabel Regions with a peak in at least one sample
#> visibility pack
#> itemRgb off
#> spectrum on
#> bigDataUrl http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/peaks_summary.bigBed
#> 
#> 
#>  track bcell_McGill0091
#>  parent bcell
#>  container multiWig
#>  type bigWig
#>  shortLabel bcell_McGill0091
#>  longLabel bcell | McGill0091
#>  graphType points
#>  aggregate transparentOverlay
#>  showSubtrackColorOnUi on
#>  maxHeightPixels 25:12:8
#>  visibility full
#>  autoScale on
#> 
#>   track bcell_McGill0091Coverage
#>   bigDataUrl http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/bcell/McGill0091/coverage.bigWig
#>   shortLabel bcell_McGill0091Coverage
#>   longLabel bcell | McGill0091 | Coverage
#>   parent bcell_McGill0091
#>   type bigWig
#>   color 141,211,199

Match all tracks in the text file

Each block of text begins with “track” and includes several lines of data before the block ends with two consecutive newlines. That pattern is coded below using a regex:

tracks.dt <- nc::capture_all_str(
  trackDb.vec, 
  "track ",
  track="\\S+",
  fields="(?:\n[^\n]+)*",
  "\n")
str(tracks.dt)
#> Classes 'data.table' and 'data.frame':   123 obs. of  2 variables:
#>  $ track : chr  "bcell" "kidneyCancer" "kidney" "leukemiaCD19CD10BCells" ...
#>  $ fields: chr  "\nsuperTrack on show\nshortLabel bcell\nlongLabel bcell ChIP-seq samples" "\nsuperTrack on show\nshortLabel kidneyCancer\nlongLabel kidneyCancer ChIP-seq samples" "\nsuperTrack on show\nshortLabel kidney\nlongLabel kidney ChIP-seq samples" "\nsuperTrack on show\nshortLabel leukemiaCD19CD10BCells\nlongLabel leukemiaCD19CD10BCells ChIP-seq samples" ...
#>  - attr(*, ".internal.selfref")=<externalptr>

The result is a data.table with one row for each track block that matches the regex. There are two character columns: track is a unique name, and fields is a string with the rest of the data in that block:

tracks.dt[, .(track, fields.start=substr(fields, 1, 30))]
#>                         track                     fields.start
#>                        <char>                           <char>
#>   1:                    bcell \nsuperTrack on show\nshortLabel
#>   2:             kidneyCancer \nsuperTrack on show\nshortLabel
#>   3:                   kidney \nsuperTrack on show\nshortLabel
#>   4:   leukemiaCD19CD10BCells \nsuperTrack on show\nshortLabel
#>   5:                 monocyte \nsuperTrack on show\nshortLabel
#>  ---                                                          
#> 119: tcell_McGill0106Coverage  \n  bigDataUrl http://hubs.hpc.
#> 120:    tcell_McGill0106Peaks  \n  bigDataUrl http://hubs.hpc.
#> 121:         tcell_McGill0107 \n parent tcell\n container mult
#> 122: tcell_McGill0107Coverage  \n  bigDataUrl http://hubs.hpc.
#> 123:    tcell_McGill0107Peaks  \n  bigDataUrl http://hubs.hpc.

Match all fields in each track

Each block has a variable number of lines/fields. Each line starts with a field name, followed by a space, followed by the field value. That regex is coded below:

(fields.dt <- tracks.dt[, nc::capture_all_str(
  fields,
  "\\s+",
  variable=".*?",
  " ",
  value="[^\n]+"),  
  by=track])
#>                      track   variable                      value
#>                     <char>     <char>                     <char>
#>   1:                 bcell superTrack                    on show
#>   2:                 bcell shortLabel                      bcell
#>   3:                 bcell  longLabel     bcell ChIP-seq samples
#>   4:          kidneyCancer superTrack                    on show
#>   5:          kidneyCancer shortLabel               kidneyCancer
#>  ---                                                            
#> 899: tcell_McGill0107Peaks shortLabel      tcell_McGill0107Peaks
#> 900: tcell_McGill0107Peaks  longLabel tcell | McGill0107 | Peaks
#> 901: tcell_McGill0107Peaks     parent           tcell_McGill0107
#> 902: tcell_McGill0107Peaks       type                     bigWig
#> 903: tcell_McGill0107Peaks      color                      0,0,0
str(fields.dt)
#> Classes 'data.table' and 'data.frame':   903 obs. of  3 variables:
#>  $ track   : chr  "bcell" "bcell" "bcell" "kidneyCancer" ...
#>  $ variable: chr  "superTrack" "shortLabel" "longLabel" "superTrack" ...
#>  $ value   : chr  "on show" "bcell" "bcell ChIP-seq samples" "on show" ...
#>  - attr(*, ".internal.selfref")=<externalptr>

Note that because by=track was specified, nc::capture_all_str is called for each unique value of track (i.e. each row). The results are combined into a single data.table with one row for each field. This data.table can be easily queried, e.g.

fields.dt[
  J("tcell_McGill0107Coverage", "bigDataUrl"),
  value,
  on=.(track, variable)]
#> [1] "http://hubs.hpc.mcgill.ca/~thocking/PeakSegFPOP-/samples/tcell/McGill0107/coverage.bigWig"
fields.dt[, .(count=.N), by=variable][order(count)]
#>                  variable count
#>                    <char> <int>
#>  1:               itemRgb     4
#>  2:              spectrum     4
#>  3:            superTrack     8
#>  4:             container    37
#>  5:             graphType    37
#>  6:             aggregate    37
#>  7: showSubtrackColorOnUi    37
#>  8:       maxHeightPixels    37
#>  9:             autoScale    37
#> 10:            visibility    41
#> 11:                 color    74
#> 12:            bigDataUrl    78
#> 13:                parent   111
#> 14:                  type   115
#> 15:            shortLabel   123
#> 16:             longLabel   123

For more information about data.table syntax, read vignette("datatable-intro", package="data.table").

Match all tracks and some fields with one regex

In the examples above we extracted all fields from all tracks (using two regexes, one for the track, one for the field). In the example below we extract only the track name, split into separate columns (using a single regex for the track).

cell.sample.type <- list(
  cellType="[^ ]*?",
  "_",
  sampleName=list(
    "McGill",
    sampleID="[0-9]+", as.integer),
  dataType="Coverage|Peaks")
nc::capture_all_str(trackDb.vec, cell.sample.type)
#>      cellType sampleName sampleID dataType
#>        <char>     <char>    <int>   <char>
#>   1:    bcell McGill0091       91 Coverage
#>   2:    bcell McGill0091       91 Coverage
#>   3:    bcell McGill0091       91    Peaks
#>   4:    bcell McGill0091       91    Peaks
#>   5:    bcell McGill0322      322 Coverage
#>  ---                                      
#> 144:    tcell McGill0106      106    Peaks
#> 145:    tcell McGill0107      107 Coverage
#> 146:    tcell McGill0107      107 Coverage
#> 147:    tcell McGill0107      107    Peaks
#> 148:    tcell McGill0107      107    Peaks

Note that the pattern above defines nested capture groups via named lists (e.g. sampleID is a subset of sampleName). The pattern below matches either the previously specified track pattern, or any other type of track name:

sample.or.anything <- list(
  cell.sample.type,
  "|",
  "[^\n]+")
track.pattern.old <- list(
  "track ",
  track=sample.or.anything)
nc::capture_all_str(trackDb.vec, track.pattern.old)
#>                         track cellType sampleName sampleID dataType
#>                        <char>   <char>     <char>    <int>   <char>
#>   1:                    bcell                           NA         
#>   2:             kidneyCancer                           NA         
#>   3:                   kidney                           NA         
#>   4:   leukemiaCD19CD10BCells                           NA         
#>   5:                 monocyte                           NA         
#>  ---                                                               
#> 119: tcell_McGill0106Coverage    tcell McGill0106      106 Coverage
#> 120:    tcell_McGill0106Peaks    tcell McGill0106      106    Peaks
#> 121:         tcell_McGill0107                           NA         
#> 122: tcell_McGill0107Coverage    tcell McGill0107      107 Coverage
#> 123:    tcell_McGill0107Peaks    tcell McGill0107      107    Peaks

Notice the repetition of track in the pattern above. This can be avoided by using the nc::field helper function, which takes three arguments, that are pasted together to form a pattern:

The example above can thus be re-written as below, avoiding the repetition of track which was present above:

track.pattern <- nc::field("track", " ", sample.or.anything)
nc::capture_all_str(trackDb.vec, track.pattern)
#>                         track cellType sampleName sampleID dataType
#>                        <char>   <char>     <char>    <int>   <char>
#>   1:                    bcell                           NA         
#>   2:             kidneyCancer                           NA         
#>   3:                   kidney                           NA         
#>   4:   leukemiaCD19CD10BCells                           NA         
#>   5:                 monocyte                           NA         
#>  ---                                                               
#> 119: tcell_McGill0106Coverage    tcell McGill0106      106 Coverage
#> 120:    tcell_McGill0106Peaks    tcell McGill0106      106    Peaks
#> 121:         tcell_McGill0107                           NA         
#> 122: tcell_McGill0107Coverage    tcell McGill0107      107 Coverage
#> 123:    tcell_McGill0107Peaks    tcell McGill0107      107    Peaks

Finally we use field again to match the type column:

any.lines.pattern <- "(?:\n[^\n]+)*"
nc::capture_all_str(
  trackDb.vec,
  track.pattern,
  any.lines.pattern,
  "\\s+",
  nc::field("type", " ", "[^\n]+"))
#>                         track cellType sampleName sampleID dataType     type
#>                        <char>   <char>     <char>    <int>   <char>   <char>
#>   1:               all_labels                           NA          bigBed 9
#>   2:                 problems                           NA          bigBed 3
#>   3:            jointProblems                           NA          bigBed 3
#>   4:            peaks_summary                           NA          bigBed 5
#>   5:         bcell_McGill0091                           NA            bigWig
#>  ---                                                                        
#> 111: tcell_McGill0106Coverage    tcell McGill0106      106 Coverage   bigWig
#> 112:    tcell_McGill0106Peaks    tcell McGill0106      106    Peaks   bigWig
#> 113:         tcell_McGill0107                           NA            bigWig
#> 114: tcell_McGill0107Coverage    tcell McGill0107      107 Coverage   bigWig
#> 115:    tcell_McGill0107Peaks    tcell McGill0107      107    Peaks   bigWig

Exercise for the reader (easy): modify the above regex in order to capture the bigDataUrl field, and three additional columns (red, green, blue) from the color field. Assume that bigDataUrl occurs before color in each track. Note that this is a limitation of the single regex approach — using two regex, as described in previous sections, could extract any/all fields, even if they appear in different orders in different tracks.

Exercise for the reader (hard): note that the last code block only matches tracks which define the type field. How would you optionally match the type field? Hint: the current any.lines.pattern can match the type field.

Parsing SweeD output files

Thanks to Marc Tollis for providing the example data used in this section (from the SweeD bioinformatics program). Some representative lines from one output file are shown below.

info.txt.gz <- system.file(
  "extdata", "SweeD_Info.txt.gz", package="nc")
info.vec <- readLines(info.txt.gz)
info.vec[20:50]
#>  [1] " Total number of samples in the VCF:\t13"
#>  [2] " Samples excluded from the analysis:\t6" 
#>  [3] ""                                        
#>  [4] ""                                        
#>  [5] " Alignment 1"                            
#>  [6] ""                                        
#>  [7] "\t\tChromosome:\t\tscaffold_0"           
#>  [8] "\t\tSequences:\t\t14"                    
#>  [9] "\t\tSites:\t\t\t1670366"                 
#> [10] "\t\tDiscarded sites:\t1264068"           
#> [11] ""                                        
#> [12] "\t\tProcessing:\t\t155.53 seconds"       
#> [13] ""                                        
#> [14] "\t\tPosition:\t\t8.936200e+07"           
#> [15] "\t\tLikelihood:\t\t4.105582e+02"         
#> [16] "\t\tAlpha:\t\t\t6.616326e-06"            
#> [17] ""                                        
#> [18] ""                                        
#> [19] " Alignment 2"                            
#> [20] ""                                        
#> [21] "\t\tChromosome:\t\tscaffold_1"           
#> [22] "\t\tSequences:\t\t14"                    
#> [23] "\t\tSites:\t\t\t1447008"                 
#> [24] "\t\tDiscarded sites:\t1093595"           
#> [25] ""                                        
#> [26] "\t\tProcessing:\t\t138.83 seconds"       
#> [27] ""                                        
#> [28] "\t\tPosition:\t\t8.722482e+07"           
#> [29] "\t\tLikelihood:\t\t2.531514e+02"         
#> [30] "\t\tAlpha:\t\t\t1.031963e-05"            
#> [31] ""

The Alignment numbers must be matched with the numbers before slashes in the other file,

report.txt.gz <- system.file(
  "extdata", "SweeD_Report.txt.gz", package="nc")
report.vec <- readLines(report.txt.gz)
cat(report.vec[1:10], sep="\n")
#> 
#> //1
#> Position Likelihood  Alpha
#> 700.0000 4.637328e-03    2.763840e+02
#> 130585.6172  3.781283e-01    8.490200e-04
#> 260471.2344  3.602315e-02    4.691340e-03
#> 390356.8516  7.618749e-01    5.377668e-04
#> 520242.4688  2.979971e-08    1.411765e-01
#> 650128.0859  3.552965e-03    7.790821e-03
#> 780013.7031  4.637359e-03    1.727400e-02
cat(report.vec[1000:1010], sep="\n")
#> 129366774.7188   1.218965e-01    2.215489e-02
#> 129496660.3359   1.165627e-02    3.384931e-02
#> 129626545.9531   2.233934e-02    3.602669e-02
#> 129756434.0000   3.850623e-01    4.812648e+01
#> 
#> //2
#> Position Likelihood  Alpha
#> 135.0000 7.282316e-01    3.163686e+01
#> 111533.0625  2.548831e+00    4.932014e-04
#> 222931.1250  1.369720e-02    1.044774e+00
#> 334329.1875  7.118828e+00    3.965791e-04

The goal is to produce a bed file, which has tab-separated values with four columns: chrom, chromStart, chromEnd, Likelihood. The chrom values appear in the info file (Chromosome) so we will need to join the two files based on alignment ID. First we capture all alignments in the info file:

(info.dt <- nc::capture_all_str(
  info.vec,
  "Alignment ",
  alignment="[0-9]+",
  "\n\n\t\tChromosome:\t\t",
  chrom=".*",
  "\n"))
#>     alignment      chrom
#>        <char>     <char>
#>  1:         1 scaffold_0
#>  2:         2 scaffold_1
#>  3:         3 scaffold_2
#>  4:         4 scaffold_3
#>  5:         5 scaffold_4
#>  6:         6 scaffold_5
#>  7:         7 scaffold_6
#>  8:         8 scaffold_7
#>  9:         9 scaffold_8
#> 10:        10 scaffold_9

Then we capture all alignment/csv blocks in the report file:

(report.dt <- nc::capture_all_str(
  report.vec,
  "//",
  alignment="[0-9]+",
  "\n",
  csv="[^/]+"
)[, {
  data.table::fread(text=csv)
}, by=alignment])
#>        alignment   Position   Likelihood        Alpha
#>           <char>      <num>        <num>        <num>
#>     1:         1      700.0 4.637328e-03 2.763840e+02
#>     2:         1   130585.6 3.781283e-01 8.490200e-04
#>     3:         1   260471.2 3.602315e-02 4.691340e-03
#>     4:         1   390356.9 7.618749e-01 5.377668e-04
#>     5:         1   520242.5 2.979971e-08 1.411765e-01
#>    ---                                               
#>  9996:        10 82991564.8 8.051006e-03 1.357819e-03
#>  9997:        10 83074967.8 7.048433e-03 1.825764e-03
#>  9998:        10 83158370.8 1.012360e-07 7.999999e-03
#>  9999:        10 83241773.8 3.977189e-08 9.999997e-01
#> 10000:        10 83325174.0 3.980538e-08 1.200000e+03

Note that because by=alignment was specified, fread is called for each unique value of alignment (i.e. each row). The results are combined into a single data.table with all of the csv data from the original file, plus the additional alignment column. Next, we join this table to the previous table in order to get the chrom column:

(join.dt <- report.dt[info.dt, on=.(alignment)])
#>        alignment   Position   Likelihood        Alpha      chrom
#>           <char>      <num>        <num>        <num>     <char>
#>     1:         1      700.0 4.637328e-03 2.763840e+02 scaffold_0
#>     2:         1   130585.6 3.781283e-01 8.490200e-04 scaffold_0
#>     3:         1   260471.2 3.602315e-02 4.691340e-03 scaffold_0
#>     4:         1   390356.9 7.618749e-01 5.377668e-04 scaffold_0
#>     5:         1   520242.5 2.979971e-08 1.411765e-01 scaffold_0
#>    ---                                                          
#>  9996:        10 82991564.8 8.051006e-03 1.357819e-03 scaffold_9
#>  9997:        10 83074967.8 7.048433e-03 1.825764e-03 scaffold_9
#>  9998:        10 83158370.8 1.012360e-07 7.999999e-03 scaffold_9
#>  9999:        10 83241773.8 3.977189e-08 9.999997e-01 scaffold_9
#> 10000:        10 83325174.0 3.980538e-08 1.200000e+03 scaffold_9

Finally the desired bed table can be created via

join.dt[, .(
  chrom,
  chromStart=as.integer(Position-1),
  chromEnd=as.integer(Position),
  Likelihood)]
#>             chrom chromStart chromEnd   Likelihood
#>            <char>      <int>    <int>        <num>
#>     1: scaffold_0        699      700 4.637328e-03
#>     2: scaffold_0     130584   130585 3.781283e-01
#>     3: scaffold_0     260470   260471 3.602315e-02
#>     4: scaffold_0     390355   390356 7.618749e-01
#>     5: scaffold_0     520241   520242 2.979971e-08
#>    ---                                            
#>  9996: scaffold_9   82991563 82991564 8.051006e-03
#>  9997: scaffold_9   83074966 83074967 7.048433e-03
#>  9998: scaffold_9   83158369 83158370 1.012360e-07
#>  9999: scaffold_9   83241772 83241773 3.977189e-08
#> 10000: scaffold_9   83325173 83325174 3.980538e-08

Exercise for the reader (easy): notice that the code above for creating info.dt involves repetition in the pattern and group names (alignment, Alignment, chrom, Chromosome). Re-write the pattern using nc::field in order to eliminate that repetition.

Exercise for the reader (hard): notice that Chromosome is only the first field – how could you extract the other fields as well? Hint: use nc::field in a helper function in order to avoid repetition.