LDA¶
- 
class pyspark.mllib.clustering.LDA[source]¶
- Train Latent Dirichlet Allocation (LDA) model. - New in version 1.5.0. - Methods - train(rdd[, k, maxIterations, …])- Train a LDA model. - Methods Documentation - 
classmethod train(rdd: pyspark.rdd.RDD[Tuple[int, VectorLike]], k: int = 10, maxIterations: int = 20, docConcentration: float = - 1.0, topicConcentration: float = - 1.0, seed: Optional[int] = None, checkpointInterval: int = 10, optimizer: str = 'em') → pyspark.mllib.clustering.LDAModel[source]¶
- Train a LDA model. - New in version 1.5.0. - Parameters
- rddpyspark.RDD
- RDD of documents, which are tuples of document IDs and term (word) count vectors. The term count vectors are “bags of words” with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and >= 0. 
- kint, optional
- Number of topics to infer, i.e., the number of soft cluster centers. (default: 10) 
- maxIterationsint, optional
- Maximum number of iterations allowed. (default: 20) 
- docConcentrationfloat, optional
- Concentration parameter (commonly named “alpha”) for the prior placed on documents’ distributions over topics (“theta”). (default: -1.0) 
- topicConcentrationfloat, optional
- Concentration parameter (commonly named “beta” or “eta”) for the prior placed on topics’ distributions over terms. (default: -1.0) 
- seedint, optional
- Random seed for cluster initialization. Set as None to generate seed based on system time. (default: None) 
- checkpointIntervalint, optional
- Period (in iterations) between checkpoints. (default: 10) 
- optimizerstr, optional
- LDAOptimizer used to perform the actual calculation. Currently “em”, “online” are supported. (default: “em”) 
 
- rdd
 
 
- 
classmethod