[问题]Spark stalling during flatMap

2014/9/24镜像同步14 回复

Hi all, Has anyone observed spark stalling during a flatMap operation with the following messages : INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to [worker host] INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to [worker host] INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to [worker host] And the version of spark is 1.0.2. What we want to do is to find those similar elements in the RDD. In details, the dataset includes two parts of data, which is flagged as 1 and 2 respectively.For each element flagged as 1, we aim to find all counterparts flagged as 2. The RDD named inRDD is created from a file in HDFS. The code is as follows: inRDD.flatMap{/*do something*/}.flatMap{/*do something*/}.map{/*do some transform*/}.groupByKey() .flatMap{}.map(/*do some transform*/).groupByKey().flatMap{v=>{ val citation = new ArrayBuffer[String]() // save all elements flagged as 1 val journal = new ArrayBuffer[String]() // save all elements flagged as 2 val iter = v._2.iterator while (iter.hasNext) { val tokens = iter.next().split("#") if(tokens(2).equals("1")) citation+= tokens(4)+"#"+tokens(1) else if(tokens(2).equals("2") ) journal+= tokens(4)+"#"+tokens(1) } var bufferString = new StringBuffer() for(i<-0 until citation.length){ val citaTokens = citation(i).split("#") for(j<-0 until journal.length){ val qikanTokens = journal(j).split("#") if(distance(citaTokens(0),qikanTokens(0))) // defined similarity function bufferString = bufferString.concat(citaTokens(1)+"#"+qikanTokens(1)+"@") } } bufferString.deleteCharAt(bufferString.length()-1) bufferString.toString.split("@") } }.saveAsTextFile(args(1)) Through a series of flatmap and groupby operations, we marked and then grouped the data in order to reduce the computation space of the last flatMap operation. By the last flatMap operation, we transformed the data into tuples matched successfully.I guess it was just the flatMap operations caused the spark stalling. I wonder where the errors came from and the corresponding solutions. (谅解我比较懒，直接把发在mail list里的英文的复制过来了[ema0][ema0][ema0][ema0]) 跪求大神指点交流。

订阅后，新回复会通过你的通知中心匿名送达。

9 条回复

Hemingway机器人#1 · 2014/9/24

Hemingway机器人#2 · 2014/9/24

@nuanyangyang 召唤大神啦

nuanyangyang机器人#3 · 2014/9/24

【在 Hemingway 的大作中提到: 】 : @nuanyangyang 召唤大神啦不懂。不好意思。

Hemingway机器人#4 · 2014/9/24

[ema16] 【在 nuanyangyang 的大作中提到: 】 : : 不懂。不好意思。

zhb007机器人#5 · 2014/10/2

lz的job应该会被分成三个stage。影响速度的因素很多，建议lz通过界面看一下到底是那个stage速度慢。可以点进看一下task的运行时间，运行时的locality level，gc情况等等。 By the way, spark 的优势在于迭代运算，lz 的case不会比在 MR 上好多少。【在 Hemingway 的大作中提到: 】 : Hi all, : Has anyone observed spark stalling during a flatMap operation with the following messages : : INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to [worker host] : ...................

Hemingway机器人#6 · 2014/10/3

慢就慢在了我给出代码的flatMap里。不知道该如何有效的解决。Spark一般还是比MR快不少的，曾经有试过类似两表entity的match运算，spark比MR快了三倍，算法实现是一样的。【在 zhb007 的大作中提到: 】 : lz的job应该会被分成三个stage。 : 影响速度的因素很多，建议lz通过界面看一下到底是那个stage速度慢。可以点进看一下task的运行时间，运行时的locality level，gc情况等等。 : By the way, spark 的优势在于迭代运算，lz 的case不会比在 MR 上好多少。 : ...................

wy3434000机器人#7 · 2014/10/3

我觉得最崩溃的是那个双层for 找找矩阵运算的算法看能不能优化

zhb007机器人#8 · 2014/10/3

如果是task慢，那应该能找出是gc 的问题还是数据规模的问题。如果task本身耗时很长，可以试着优化算法，提高并行度等。尝试定位一下瓶颈所在？应该不难【在 Hemingway 的大作中提到: 】 : 慢就慢在了我给出代码的flatMap里。不知道该如何有效的解决。Spark一般还是比MR快不少的，曾经有试过类似两表entity的match运算，spark比MR快了三倍，算法实现是一样的。

Hemingway机器人#9 · 2014/10/4

开始是GC问题，主要在permanent区，默认分配24MB，job一运行瞬间99%。然后做了优化，分配了512MB空间给永久区。还是有out of heap size error吧，记不清了，忙paper放那里两周没动了。我觉得问题是出在最后的flatMap中，有new两个ArrayBuffer.这两个ArrayBuuffer主要用于遍历RDD，找到journal中每个与citation中的一个记录相似的记录。【在 zhb007 的大作中提到: 】 : 如果是task慢，那应该能找出是gc 的问题还是数据规模的问题。如果task本身耗时很长，可以试着优化算法，提高并行度等。 : 尝试定位一下瓶颈所在？应该不难 :