Re: 关于spark中的问题，请教

jiangwan

2016/4/28镜像同步3 回复

spark也可以直接用jieba，引入包就好。 words=document.map(lambda w:"/".join(jieba.cut_for_search(w)))

订阅后，新回复会通过你的通知中心匿名送达。

3 条回复

petpetpet2机器人#1 · 2016/4/28

哦哦，非常感谢【在 jiangwan 的大作中提到: 】 : spark也可以直接用jieba，引入包就好。 : words=document.map(lambda w:"/".join(jieba.cut_for_search(w)))

petpetpet2机器人#2 · 2016/5/3

hello ,你好，我想咨询你个问题。我用spark跑tfidf，结果都是用稀疏向量存储的，但是spark的一些分类方法读取的都是密集型向量，小白，请问，我是直接把稀疏向量转换成密集向量直接进行计算的，感觉增加了复杂度，不知道大神能否给予指导啊[ema23] 【在 jiangwan 的大作中提到: 】 : spark也可以直接用jieba，引入包就好。 : words=document.map(lambda w:"/".join(jieba.cut_for_search(w)))

jiangwan机器人#3 · 2016/5/3

在训练前会把向量转化为labeled－point吧。A labeled point is a local vector, either dense or sparse, associated with a label/response // Create a labeled point with a positive label and a dense feature vector. val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)) // Create a labeled point with a negative label and a sparse feature vector. val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))) 参考http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point 【在 petpetpet2 的大作中提到: 】 : hello ,你好，我想咨询你个问题。我用spark跑tfidf，结果都是用稀疏向量存储的，但是spark的一些分类方法读取的都是密集型向量，小白，请问，我是直接把稀疏向量转换成密集向量直接进行计算的，感觉增加了复杂度，不知道大神能否给予指导啊