NEWS

text2vec 0.6

2019-12-17

breaking change - removed construction of a vocabulary in parallel on windows
use rsparse package for SVD and GloVe factorizations
updated RWMD implementation (hopefully bug free)

2018-09-10

breaking change - changed IDF formula - see #280 for details.

2018-05-28

Added postag_lemma_tokenizer() (wrapper around udpipe::udpipe_annotate). Can be used as a drop-in replacement for more simple tokenizers in text2vec.

2018-05-25

Made combine_vocabularies() part of public API - see #260 for details.

2018-05-10

Added coherence() function for comprehensive coherence metrics. Thanks to Manuel Bickel ( @manuelbickel ) for conrtibution.

2018-05-02

Fixed bug LSA model - document embeddings calculated as left singular vectors multiplied by singular values (not square root of values as before). Thanks to Sloane Simmons ( @singularperturbation )
Now fit_transform and transform methods in LDA model produce same results. Thanks to @jiunsiew for reporting. Also now LDA has n_iter_inference parameter. It controls number of the samples from converged distribution for document-topic inference. This leads to more robust document-topic probabilities (reduced variance). Default value is 10.

2018-01-17

more numerically robust PMI, LFMD - thanks to @andland. Also adds iteration number iter to collocation_stat. iter shows iteration number when collocation stats (and counters) were calculated.

text2vec 0.5.1 [2018-01-10]

2018-01-10

removed rank* columns from collocation_stat - were never used internally. Users can easily calculate ranks themselves

2018-01-09

Added Bi-Normal Separation transformation, thanks to Pavel Shashkin ( @pshashk )
Added Dunning’s log-likelihood ratio for collocations, thanks to Chris Lee ( @Chrisss93 )
Early stopping for collocations learning

2017-12-18

fixed several bugs #219 #217 #205
decreased number of dependencies - no more magrittr, uuid, tokenizers
removed distributed LDA which didn’t work correctly

2017-10-18

Now tokenization is based on tokenizers and THE stringi packages.
models API follow mlapi package. No API changes on text2vec side - we just put abstract scikit-learn-like classes to a separate package in order to make them more reusable.

text2vec 0.5.0

2017-06-12

Add additional filters to prune_vocabulary - filter by document counts
Clean up LSA, fixed transform method. Added option to use randomized SVD algorithm from irlba.

2017-05-17

Imrove dist2 performamce for RWMD - incorporate ideas from gensim PR discussion.

2017-05-17

API breaking change - vocabulary format change - now plain data.frame with meta-information in attributes (stopwords, ngram, number of docs, etc).

2017-03-25

No more rely on RcppModules
API breaking change - removed lda_c from formats in DTM construction
added ifiles_parallel, itoken_parallel high-level functions for parallel computing
API breaking change chunks_numer parameter renamed to n_chunks

2017-01-02

API breaking change - removed create_corpus from public API, moved co-occurence related optons to create_tcm from vecorizers
add ability to add custom weights for co-occurence statistics calculations

2016-12-30

Noticeable speedup (1.5x) and even more noticeable improvement on memory usage (2x less!) for create_dtm, create_tcm . Now package relies on sparsepp library for underlying hash maps.

2016-10-30

Collocations - detection of multi-word phrases using differend heuristics - PMI, gensim, LFMD.

2016-10-20

Fixed bug in as.lda_c() function

text2vec 0.4.0

Now under GPL (>= 2) Licence

“immutable” iterators - no need to reinitialize them

unified models interface

New models: LSA, LDA, GloVe with L1 regularization

Fast similarity and distances calculation: Cosine, Jaccard, Relaxed Word Mover’s Distance, Euclidean

Better hadnling UTF-8 strings, thanks to @qinwf

iterators and models rely on R6 package

text2vec 0.3.0

2016-01-13 fix for #46, thanks to @buhrmann for reporting

2016-01-16 format of vocabulary changed.

do not keep doc_proportions. see #52.
add stop_words argument to prune_vocabulary. signature also was changed.

2016-01-17 fix for #51. if iterator over tokens returns list with names, these names will be:

stored as attr(corpus, 'ids')
rownames in dtm
names for dtm list in lda_c format

2016-02-02 high level function for corpus and vocabulary construction.

construction of vocabulary from list of itoken.
construction of dtm from list of itoken.

2016-02-10 rename transformers

now all transformers starts with transform_* - more intuitive + simpler usage with autocompletion

2016-03-29 (accumulated since 2016-02-10)

rename vocabulary to create_vocabulary.
new functions create_dtm, create_tcm.
All core functions are able to benefit from multicore machines (user have to register parallel backend themselves)
Fix for progress bars. Now they are able to reach 100% and ticks increased after computation.
ids argument to itoken. Simplifies assignement of ids to rows of DTM
create_vocabulary now can handle stopwords
see all updates here

2016-03-30 more robust split_into() util.

text2vec 0.2.0 (2016-01-10)

First CRAN release of text2vec.

Fast text vectorization with stable streaming API on arbitrary n-grams.

Functions for vocabulary extraction and management
Hash vectorizer (based on digest murmurhash3)
Vocabulary vectorizer

GloVe algorithm word embeddings.

Fast term-co-occurence matrix factorization via parallel async AdaGrad.

All core functions written in C++.

Need a high-speed mirror for your open-source project?
Contact our mirror admin team at info@clientvps.com.

This archive is provided as a free public service to the community.
Proudly supported by infrastructure from VPSPulse , RxServers , BuyNumber , UnitVPS , OffshoreName and secure payment technology by ArionPay.

Welcome to ClientVPS Mirrors

text2vec 0.6.6 (2025-11-29)

text2vec 0.6.5 (2023-10-16)