% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/fansi-package.R
\docType{package}
\name{fansi}
\alias{fansi}
\alias{fansi-package}
\title{Details About Manipulation of Strings Containing Control Sequences}
\description{
Counterparts to R string manipulation functions that account for
the effects of ANSI text formatting control sequences.
}
\section{Control Characters and Sequences}{


Control characters and sequences are non-printing inline characters that can
be used to modify terminal display and behavior, for example by changing text
color or cursor position.

We will refer to ANSI control characters and sequences as "\emph{Control
Sequences}" hereafter.

There are three types of \emph{Control Sequences} that \code{fansi} treats specially:
\itemize{
\item "C0" control characters, such as tabs and carriage returns (we include
delete in this set, even though technically it is not part of it).
\item Sequences starting in "ESC[", also known as ANSI CSI sequences.
\item Sequences starting in "ESC" and followed by something other than "[".
}

All of these are considered zero display-width for purposes of string width
calculations.

\emph{Control Sequences} starting with ESC are assumed to be two characters
long (including the ESC) unless they are of the CSI variety, in which case
their length is computed as per the \href{http://www.ecma-international.org/publications/standards/Ecma-048.htm}{ECMA-48specification}.
There are non-CSI escape sequences that may be longer than two characters,
but \code{fansi} will (incorrectly) treat them as if they were two characters
long.

In theory it is possible to encode \emph{Control Sequences} with a single
byte introducing character in the 0x40-0x5F range instead of the traditional
"ESC[".  Since this is rare and it conflicts with UTF-8 encoding, we do
not support it.
}

\section{ANSI CSI SGR Control Sequences}{


\strong{NOTE}: not all displays support ANSI CSI SGR sequences; run
\link{term_cap_test} to see whether your display supports them.

ANSI CSI SGR Control Sequences are the subset of CSI sequences that can be
used to change text appearance (e.g. color).  These sequences begin with
"ESC[" and end in "m".  \code{fansi} interprets these sequences and writes new
ones to the output strings in such a way that the original formatting is
preserved.  In most cases this should be transparent to the user.

Occasionally there may be mismatches between how \code{fansi} and a display
interpret the CSI SGR sequences, which may produce display artifacts.  The
most likely source of artifacts are \emph{Control Sequences} that move
the cursor or change the display, or that \code{fansi} otherwise fails to
interpret, such as:
\itemize{
\item Unknown SGR substrings.
\item "C0" control characters like tabs and carriage returns.
\item Other escape sequences.
}

Another possible source of problems is that different displays parse
and interpret control sequences differently.  The common CSI SGR sequences
that you are likely to encounter in formatted text tend to be treated
consistently, but less common ones are not.  \code{fansi} tries to hew by the
ECMA-48 specification \strong{for CSI control sequences}, but not all terminals
do.

The most likely source of problems will be 24-bit CSI SGR sequences.
For example, a 24-bit color sequence such as "ESC[38;2;31;42;4" is a
single foreground color to a terminal that supports it, or separate
foreground, background, faint, and underline specifications for one that does
not.  To mitigate this particular problem you can tell \code{fansi} what your
terminal capabilities are via the \code{term.cap} parameter or the
"fansi.term.cap" global option, although \code{fansi} does try to detect them by
default.

\code{fansi} will will warn if it encounters \emph{Control Sequences} that it cannot
interpret or that might conflict with terminal capabilities.  You can turn
off warnings via the \code{warn} parameter or via the "fansi.warn" global option.

\code{fansi} can work around "C0" tab control characters by turning them into
spaces first with \link{tabs_as_spaces} or with the \code{tabs.as.spaces} parameter
available in some of the \code{fansi} functions.

We chose to interpret ANSI CSI SGR sequences because this reduces how
much string transcription we need to do during string manipulation.  If we do
not interpret the sequences then we need to record all of them from the
beginning of the string and prepend all the accumulated tags up to beginning
of a substring to the substring.  In many case the bulk of those accumulated
tags will be irrelevant as their effects will have been superseded by
subsequent tags.

\code{fansi} assumes that ANSI CSI SGR sequences should be interpreted in
cumulative "Graphic Rendition Combination Mode".  This means new SGR
sequences add to rather than replace previous ones, although in some cases
the effect is the same as replacement (e.g. if you have a color active and
pick another one).
}

\section{Encodings / UTF-8}{


\code{fansi} will convert any non-ASCII strings to UTF-8 before processing them,
and \code{fansi} functions that return strings will return them encoded in UTF-8.
In some cases this will be different to what base R does.  For example,
\code{substr} re-encodes substrings to their original encoding.

Interpretation of UTF-8 strings is intended to be consistent with base R.
There are three ways things may not work out exactly as desired:
\enumerate{
\item \code{fansi}, despite its best intentions, handles a UTF-8 sequence differently
to the way R does.
\item R incorrectly handles a UTF-8 sequence.
\item Your display incorrectly handles a UTF-8 sequence.
}

These issues are most likely to occur with invalid UTF-8 sequences,
combining character sequences, and emoji.  For example, as of this writing R
(and the OSX terminal) consider emojis to be one wide characters, when in
reality they are two wide.  Do not expect the \code{fansi} width
calculations to to work correctly with strings containing emoji.

Internally, \code{fansi} computes the width of every UTF-8 character sequence
outside of the ASCII range using the native \code{R_nchar} function.  This will
cause such characters to be processed slower than ASCII characters.
Additionally, \code{fansi} character width computations can differ from R width
computations despite the use of \code{R_nchar}. \code{fansi} always computes width for
each character individually, which assumes that the sum of the widths of each
character is equal to the width of a sequence.  However, it is theoretically
possible for a character sequence that forms a single grapheme to break that
assumption. In informal testing we have found this to be rare because in the
most common multi-character graphemes the trailing characters are computed as
zero width.

As of R 3.4.0 \code{substr} appears to use UTF-8 character byte sizes as indicated
by the leading byte, irrespective of whether the subsequent bytes lead to a
valid sequence.  Additionally, UTF-8 byte sequences as long as 5 or 6 bytes
may be allowed, which is likely a holdover from older Unicode versions.
\code{fansi} mimics this behavior.  It is likely \code{substr} will start failing with
invalid UTF-8 byte sequences with R 3.6.0 (as per SVN r74488).  In general,
you should assume that \code{fansi} may not replicate base R exactly when there
are illegal UTF-8 sequences present.

Our long term objective is to implement proper UTF-8 character width
computations, but for simplicity and also because R and our terminal do not
do it properly either we are deferring the issue for now.
}

\section{Miscellaneous}{


The native code in this package assumes that all strings are NULL terminated
and no longer than (32 bit) INT_MAX (excluding the NULL).  This should be a
safe assumption since the code is designed to work with STRSXPs and CHRSXPs.
Behavior is undefined and probably bad if you somehow manage to provide to
\code{fansi} strings that do not adhere to these assumptions.
}

