--- title: "Quick Start" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quick Start} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` `tidysq` package is meant to store and conduct operations on biological sequences. This vignette provides a guide to basic usage of `tidysq`, i.e. reading, manipulating and writing sequences to file. The most recent version of `tidysq` can be installed with `install_github()` function from `devtools`. ```{r setup} # devtools::install_github("BioGenies/tidysq") library(tidysq) ``` ## Sequence creation Biological sequences can be and often are represented as strings -- sequences of letters. For example, a DNA sequence can take the form of `"TAGGCCCTAGACCTG"`, where `A` means adenine, `C` -- cytosine, `G` -- guanine and `T` -- thymine. Exact IUPAC recommendations for one-letter codes can be found in *Cornish-Bowden A. Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res. 1985 May 10;13(9):3021-30. doi: 10.1093/nar/13.9.3021. PMID: 2582368; PMCID: PMC341218*. Within `tidysq` package sequence data is stored in `sq` objects, that is, vectors of biological sequences. They can be created from string vectors as above: ```{r sq_from_string} sq_dna <- sq(c("TAGGCCCTAGACCTG", "TAGGCCCTGGGCATG")) sq_dna ``` There are several thing to note. First, each sequence is an element of `sq` object. Many operations are vectorized --- they are applied to all sequences of a vector --- and `sq` objects are no different in this regard. Second, the first line of output says: `basic DNA sequences list`. This means that all sequences of this object are of DNA type and do not use ambiguous letters (more about that in "Advanced alphabet techniques" vignette). ## Subsetting sequences Manipulating sequence objects is an integral part of `tidysq`. `sq` objects can be easily subsetted using usual R syntax: ```{r sq_subset} sq_dna[1] ``` Extracting subsequences is a bit more complicated than that --- because it uses designated function `bite()`. Its syntax, however, closely resembles that of base R --- indexing starts with one and negative indices are interpreted as "anything except that". It returns an `sq` object with all sequences subsetted: ```{r sq_bite} bite(sq_dna, 5:10) bite(sq_dna, c(-9, -11, -13)) ``` It's possible to reverse sequences using this function: ```{r sq_bite_reversing} # Don't do it like that! bite(sq_dna, 15:1) ``` However, this usage is strongly discouraged, because it's both ineffective and works badly with sequences of different lengths. Instead, there is a designated function `reverse()`: ```{r sq_reverse} reverse(sq_dna) ``` Note that it is very different to base `rev()`, which reverses only the order of sequences, not letters: ```{r sq_rev} rev(sq_dna) ``` We can combine two or more `sq` objects using base `c()` function: ```{r sq_c} sq_dna <- c(sq_dna, reverse(sq_dna)) sq_dna ``` ## Biological interpretation `tidysq` offers two functions specific to DNA/RNA sequences, namely `complement()` and `translate()`. The former creates sequences with complementary bases, that is, replaces `A` with `T`, `C` with `G` and *vice versa*. The latter translates input to amino acid sequences using [the translation table with three-letter codons](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables). These functions can be called as shown below: ```{r sq_complement_translate} complement(sq_dna) translate(sq_dna) ``` One noteworthy feature here is that translation can be done with any genetic code table of those listed [on this Wikipedia page](https://en.wikipedia.org/wiki/List_of_genetic_codes): ```{r sq_translate_other_table} translate(sq_dna, table = 6) ``` ## Finding motifs Motifs are short subsequences. These are often searched for in biological sequences. `tidysq` has two distinct functions that allow the user to perform such search. One of them is a `%has%` operator that takes `sq` object and character vector as parameters respectively. It returns a logical vector of the same length as `sq` object, where each element says whether all motifs passed as strings were found in given sequence: ```{r sq_has} sq_dna %has% "ATC" # It can be used to subset sq sq_dna[sq_dna %has% c("AG", "CC")] ``` It says nothing about motif placement within sequence nor it exact form, however. In this case, there is `find_motifs()` function that returns a whole `tibble` (from `tibble` package; basically improved version of `data.frame`) with various info about found motifs. Important thing to note here is that the second argument is a character vector of sequence names to avoid embedding potentially long sequences in resulting `tibble` potentially many times: ```{r sq_find_motifs} find_motifs(sq_dna, c("seq1", "seq2", "rev1", "rev2"), c("ATC", "TAG")) ``` You can also provide this function with a `data.frame` (or, what we recommend, `tibble`) containing one column called `sq`, containing the sequences and the other column `name` containing the names. ```{r sqibble_find_motifs} sqibble <- tibble::tibble(sq = sq_dna, name = c("seq1", "seq2", "rev1", "rev2")) # does the same as the call from previous chunk of code find_motifs(sqibble, c("ATC", "TAG")) ``` There are ambiguous DNA bases in IUPAC codes and these can be used in motifs. One of them is `"N"` --- its meaning is "any of `A`, `C`, `G` or `T`: ```{r sq_find_motifs_amb} find_motifs(sqibble, "GNCC") ``` This example displays the difference between `"sought"` and `"found"` columns. The former contains the string representation of motif that the user was looking for, while the latter contains a `tidysq`-encoded sequence with an "instance" of motif. Two additional characters are reserved because of their special meaning in motifs. `"^"` means that this motif must be found at the start of a sequence, while `"$"` means the same, but with the end instead. They can be mixed with ambiguous letters, of course: ```{r sq_find_motifs_start_end} find_motifs(sqibble, c("^TAG", "ATN$")) ``` ## Exporting sq objects After doing computations the user might wish to save their sequences for future use. One of the most popular formats for storing biological sequences is FASTA. `tidysq` allows the user to write sequences to FASTA file with `write_fasta()` function. Important thing to remember here that the arguments for the function are analogous to those used in `find_motifs()` -- either `sq` object and a vector of names or a `tibble` with columns of sequences and names: ```{r write_fasta, eval=FALSE} write_fasta(sq_dna, c("seq1", "seq2", "rev1", "rev2"), "just_your_ordinary_fasta_file.fasta") # or write_fasta(sqibble, "just_your_ordinary_fasta_file.fasta") ```