mallet代码改进……计算p(w|t)

2012/5/7镜像同步3 回复

最近在看MIT的mallet项目……看得不是很明白，这里大牛众多，希望接触过的给我简单介绍一下这个项目吧。另外，我需要得到P(w|t),单词出现在topic中的概率，mallet中只提供了计算topic出现在document中的概率的方法。从网上查到一个方法： > On Tue, May 3, 2011 at 10:48 AM, Steven Bethard wrote: >> TopicInferencer.getSampledDistribution gives you a double[] representing the topic distribution for the entire instance (document). Is there a way to get the per-word topic distributions? On May 9, 2011, at 6:26 PM, David Mimno wrote: > It doesn't look like there's an easy way without digging into the > sampling code. You'd need to add an additional data structure to store > token-topic distributions, and update it from the "topics" array after > every sampling round. Once you're done, you'll need a way to pass it > back -- keeping the token-topic distributions as a state variable and > adding a callback function to pick up the distribution after every > document might be the best option. Thanks for the response. I ended up using the Stanford Topic Modeling Toolbox instead, which supports per-word topic distributions out of the box, but the approach above sounds plausible if I ever end up going back to the Mallet code. url是： http://article.gmane.org/gmane.comp.ai.mallet.devel/1482/match=getting+topic +distribution 希望做过相关改进的师兄分享一下经验，小弟不胜感激。

订阅后，新回复会通过你的通知中心匿名送达。

3 条回复

oypz机器人#1 · 2012/5/8

bd~ 【在 zhangjun0806 (气有浩然) 的大作中提到: 】 : 最近在看MIT的mallet项目……看得不是很明白，这里大牛众多，希望接触过的给我简 : 单介绍一下这个项目吧。 : 另外，我需要得到P(w|t),单词出现在topic中的概率，mallet中只提供了计算topic出 : ...................

zhangjun0806机器人#2 · 2012/5/8

水木社区上以为师兄回复了我的问题，他回答得比较认真，和大家分享一下，我刚开始接触topic model，希望懂得师兄可以分享一下你对这个回答得更详细一点，通俗一点的解释：水木社区和计算文档中主题的概率一样啊，一个是\theta矩阵，一个是\phi矩阵。文档中topic的概率= (文档中topic的词频+alpha)/sum_k(文档中topic的词频+alpha) 主题中word的概率=（主题中word的词频+beta）/sum_w(主题中word的词频+beta) 【在 zhangjun0806 的大作中提到: 】 : 最近在看MIT的mallet项目……看得不是很明白，这里大牛众多，希望接触过的给我简 : 单介绍一下这个项目吧。 : 另外，我需要得到P(w|t),单词出现在topic中的概率，mallet中只提供了计算topic出 : ...................

namisan机器人#3 · 2012/5/23

mallet的代码其乱无比。David M Blei 的c代码相对比较清晰，如果想研究，推荐看这个版本的。但是这个版本用的是vb推理。现在流行gibbs sampler 来推理。有个java版本的gibbs代码，清晰易懂。