                        Dear Emacs, please make this -*-Text-*- mode!

                CHANGES IN tm VERSION 0.5-7

SIGNIFICANT USER-VISIBLE CHANGES

    o   The function termFreq() provides two new arguments for
        generalized bounds checking of term frequencies and word
        lengths. This replaces the arguments minDocFreq and
        minWordLength.

    o   The function termFreq() is now sensitive to the order of
        control options.

NEW FEATURES

    o   Weighting schemata for term-document matrices in SMART notation.

    o   Local and global options for term-document matrix
        construction.

    o   SMART stopword list was added.

                CHANGES IN tm VERSION 0.5-5

NEW FEATURES

    o   Access documents in a corpus by names (fallback to IDs if names are
        not set), i.e., allow a string for the corpus operator `[[`.

BUG FIXES

    o   The function findFreqTerms() now checks bounds on a global level
        (to comply with the manual page) instead per document. Reported
        and fixed by Thomas Zapf-Schramm.

                CHANGES IN tm VERSION 0.5-4

SIGNIFICANT USER-VISIBLE CHANGES

    o   Use IETF language tags for language codes (instead of ISO 639-2).

NEW FEATURES

    o   The function tm_tag_score() provides functionality to score
        documents based on the number of tags found. This is useful for
        sentiment analysis.

    o   The weighting function for term frequency-inverse document
        frequency weightTfIdf() has now an option for term
        normalization.

    o   Plotting functions to test for Zipf's and Heaps' law on a
        term-document matrix were added: Zipf_plot() and
        Heaps_plot(). Contributed by Kurt Hornik.

                CHANGES IN tm VERSION 0.5-3

NEW FEATURES

    o   The reader function readRCV1asPlain() was added and combines the
        functionality of readRCV1() and as.PlainTextDocument().

    o   The function stemCompletion() has a set of new heuristics.

                CHANGES IN tm VERSION 0.5-2

SIGNIFICANT USER-VISIBLE CHANGES

    o   The function termFreq() which is used for building a
        term-document matrix now uses a whitespace oriented tokenizer
        as default.

NEW FEATURES

    o   A combine method for merging multiple term-document matrices
        was added (c.TermDocumentMatrix()).

    o   The function termFreq() has now an option to remove
        punctuation characters.

DEPRECATED & DEFUNCT

    o   Following functions have been removed:

          CSVSource()     (use DataframeSource(read.csv(..., stringsAsFactors = FALSE)) instead), and
          TermDocMatrix() (use DocumentTermMatrix() instead).

BUG FIXES

    o   removeWords() no longer skips words at the beginning or the end
        of a line. Reported by Mark Kimpel.

                CHANGES IN tm VERSION 0.5-1

BUG FIXES

    o   preprocessReut21578XML() no longer generates invalid file names.

                CHANGES IN tm VERSION 0.5

SIGNIFICANT USER-VISIBLE CHANGES

    o   All classes, functions, and generics are reimplemented using
        the S3 class system.

    o   Following functions have been renamed:

          activateCluster()   to tm_startCluster()     ,
          asPlain()           to as.PlainTextDocument(),
          deactivateCluster() to tm_stopCluster()      ,
          tmFilter()          to tm_filter()           ,
          tmIndex()           to tm_index()            ,
          tmIntersect()       to tm_intersect()        , and
          tmMap()             to tm_map()              .

    o   Mail handling functionality is factored out to the
        tm.plugin.mail package.

DEPRECATED & DEFUNCT

    o   Following functions have been removed:

          tmTolower()         (use tolower() instead), and
          replacePatterns()   (use gsub()    instead).

                CHANGES IN tm VERSION 0.4

SIGNIFICANT USER-VISIBLE CHANGES

    o   The Corpus class is now virtual providing an abstract
        interface.

    o   VCorpus, the default implementation of the abstract corpus
        interface (by subclassing), provides a corpus with volatile (=
        standard R object) semantics. It loads all documents into
        memory, and is designed for small to medium-sized data sets.

    o   PCorpus, an implementation of the abstract corpus interface (by
        subclassing), provides a corpus with permanent storage
        semantics. The actual data is stored in an external database
        (file) object (as supported by the filehash package), with
        automatic (un-)loading into memory. It is designed for systems
        with small memory.

    o   Language codes are now in ISO 639-2 (instead of ISO 639-1).

    o   Reader functions no longer have a load argument for lazy
        loading.

NEW FEATURES

    o   The reader function readReut21578XMLasPlain() was added and
    	combines the functionality of readReut21578XML() and asPlain().

BUG FIXES

    o   weightTfIdf() no longer applies a binary weighting to an input
        matrix in term frequency format (which happened only in 0.3-4).

                CHANGES IN tm VERSION 0.3-4

SIGNIFICANT USER-VISIBLE CHANGES

    o   .onLoad() no longer tries to start a MPI cluster (which often
        failed in misconfigured environments). Use activateCluster()
        and deactivateCluster() instead.

    o   DocumentTermMatrix (the improved reimplementation of defunct
        TermDocMatrix) does not use the Matrix package anymore.

NEW FEATURES

    o   The DirSource() constructor now accepts the two new (optional)
        arguments pattern and ignore.case. With pattern one can define
        a regular expression for selecting only matching files, and
        ignore.case specifies whether pattern-matching is
        case-sensitive.

    o   The readNewsgroup() reader function can now be configured for
        custom date formats (via the DateFormat argument).

    o   The readPDF() reader function can now be configured (via the
        PdfinfoOptions and PdftotextOptions arguments).

    o   The readDOC() reader function can now be configured (via the
        AntiwordOptions argument).

    o   Sources now can be vectorized. This allows faster corpus
        construction.

    o   New XMLSource class for arbitrary XML files.

    o   The new readTabular() reader function allows to create a custom
        tailor-made reader configured via mappings from a tabular data
        structure.

    o   The new readXML() reader function allows to read in arbitrary
        XML files which are described with a specification.

    o   The new tmReduce() transformation allows to combine multiple
        maps into one transformation.

DEPRECATED & DEFUNCT

    o   CSVSource is defunct (use DataframeSource instead).

    o   weightLogical is defunct.

    o   TermDocMatrix is defunct (use DocumentTermMatrix or
        TermDocumentMatrix instead).

                CHANGES IN tm VERSION 0.3-3

NEW FEATURES

    o   The abstract Source class gets a default implementation for
        the stepNext() method. It increments the position counter by
        one, a reasonable value for most sources. For special purposes
        custom methods can be created via overloading stepNext() of
        the subclass.

    o   New URISource class for a single document identified by a
        Uniform Resource Identifier.

    o   New DataframeSource for documents stored in a data frame. Each
        row is interpreted as a single document.

BUG FIXES

    o   Fix off-by-one error in convertMboxEml() function. Reported by
        Angela Bohn.

    o   Sort row indices in sparse term-document matrices. Kudos to
        Martin Mächler for his suggestions.

    o   Sources and readers no longer evaluate calls in a non-standard
        way.

                CHANGES IN tm VERSION 0.3-2

NEW FEATURES

    o	Weighting functions now have an Acronym slot containing
        abbreviations of the weighting functions' names. This is highly
        useful when generating tables with indications which weighting
        scheme was actually used for your experiments.

    o   The functions tmFilter(), tmIndex(), tmMap() and TermDocMatrix()
        now can use a MPI cluster (via the snow and Rmpi packages) if
        available. Use (de)activateCluster() to manually override
        cluster usage settings. Special thanks to Stefan Theussl for
        his constructive comments.

    o   The Source class receives a new Length slot. It contains the
        number of elements provided by the source (although there
        might be rare cases where the number cannot be determined in
        advance---then it should be set to zero).
