NEWS | R Documentation |
tm version >=0.6 is required.
The data set AssociatedPress
as well as other code
checking document term matrices now conforms to the data structure
of document term matrices in tm version >=0.6.
The specification of a seed for Gibbs sampling now leads to a
call to set.seed
and the external code used for fitting
accesses the state of the R random number generator. The seed can
also be set to NA
(default) in order to not change the seed
of the R random number generator when fitting the model.
The Gibbs sampling method for fitting the LDA model now also returns the current topic assignments for all words which allows to initialize Gibbs sampling either using the current term distribution of topics or these assignments.
The Gibbs sampling method for fitting the LDA model now allows to specify seed words, i.e., assign higher a-priori weights to some words for some topics.
The word assignment matrix contained in the fitted models now does not have any dimnames any more.
Package corpus.JSS.papers is now listed in the DESCRIPTION file together with the information that is available from the additional repository http://datacube.wu.ac.at.
Package topicmodels now depends on package methods instead of importing it.
Package SnowballC is now suggested instead of Snowball.
A check was added to ensure that no empty documents are in the data. Thanks to Terry Therneau for pointing the problem out.
The first argument in the functions printf_vector and printf_matrix defined in the C code for the CTM was corrected to be const char *. Thanks to Murray Stokely for providing the patch.
A bug in function posterior
was fixed where the
rownames of the wrong object were used. Thanks to Benjamin S. Porter
for pointing the problem out.
Dependency structure changed such that some packages are now only imported.
The information printed during the VEM algorithm when
verbose
is larger than 0 was improved.
The code in the vignette for removing HTML markup was modified due to changes in package XML.
A memory leak in the code of the fit function for LDA with method
"VEM"
was corrected. Thanks to Ramis Yamilov for pointing
the problem out.
The included dataset AssociatedPress had row names which were of type integer and not of type character. The object was re-saved omitting the row names.
Vignettes moved from /inst/doc to /vignettes.
The source code for fitting the model using Gibbs sampling was modified because the code did not compile on Solaris. Thanks to Prof. Brian D. Ripley for pointing the problem out.
dtm2ldaformat()
was modified to ensure that the resulting matrices
for the documents contain integers. In addition dtm2ldaformat()
and ldaformat2dtm()
were changed to also work for document-term
matrices containing empty documents and an argument was introduced
to indicate if empty documents should be removed. Thanks to Eu Jin
Lok for pointing the problems out.
Missing 'Suggests' entries added in the DESCRIPTION file. Thanks to Prof. Brian D. Ripley for pointing the problem out.
Name tags for Rd files changed to not contain slashes. Thanks to Prof. Brian D. Ripley for pointing the problem out as indicated in bug PR14707.
A small bug fixed when saving interim results for fitting a LDA model using Gibbs sampling. Thanks to Nicholas Switanek for pointing the problem out.
Makevars.win changed due to changes on CRAN for making libgsl for Windows. Thanks to Prof. Brian D. Ripley for pointing that out.
The package vignette has been published in the Journal of
Statistical Software, Volume 40, Issue 13
(http://www.jstatsoft.org/v40/i13), and the paper should be
used as citation for the package, run
citation("topicmodels")
for details.
C code changed to allow the package to compile on Solaris systems. Thanks to Prof. Brian D. Ripley for pointing the problems out and recommending suitable changes.
C code changed to avoid warnings of unused variables.
The slots for documents and terms names are not restricted to be
of class "vector"
any more to allow for document-term matrices
where no row and/or column names are provided.
A function perplexity()
added for model validation and selection.
The input data for LDA()
and CTM()
can now either be a
"DocumentTermMatrix"
with term-frequency weighting or an object
coercible to a "simple_triplet_matrix"
with integer entries.
A bug in the C++ Gibbs sampling code fixed for the random number generation. Thanks to Uwe Ligges for pointing the problem out which he noted when checking the package for the Windows platform.
New control arguments added for keeping intermediate log-likelihood values during estimation and running repeated runs with random initilization. In addition the number of iterations made is now saved with the fitted model.
Functions ldaformat2dtm()
and dtm2ldaformat()
added to transform data
from the lda package into a "DocumentTermMatrix"
object and vice
versa.
Bug fixed in rctm.c where for estimate.beta = FALSE
one EM step
was performed.
The control for topic models now also has a seed
argument to ensure reproducibility of results and a
estimate.beta
argument which can be used to fix the term
distribution over topics after initialization.
The control for Gibbs sampling allows to specify to return
repeated draws in a list using arguments burnin
,
thin
and iter
.
In slot beta for class "TopicModel"
the log parameters are stored
to have a higher accuracy for the VEM code if parameter values are
close to zero.
Call to assert removed in C code to avoid termination of R.
Class "TopicModel"
now has a slot
loglikelihood
. For models fitted using Gibbs sampling this
contains the loglikelihood of the corpus, for VEM fitted models
the vector of loglikelihoods for each document separately.
Memory bug fixed in returnObjectGibbsLDA
.
A slot save
is added to the control objects to specify if the
results and with which step size intermediate results are saved
into files.
Header files changed in utilities.cpp following an advice by Prof. Brian D. Ripley.
Code for installing the package corpus.JSS.papers in the vignette improved.
dir.create()
now called with showWarnings = FALSE
.
Bug fixed in get_most_likely()
for maximum possible k.
First version released on CRAN: 0.0-3.