【求助】获取大批量url对应页面keyword和description的方法

Saru

2012/7/3镜像同步13 回复

查询日志里有大量的url，怎么能最快的获取该url对应的页面的关键词和描述信息呢？

订阅后，新回复会通过你的通知中心匿名送达。

9 条回复

binux机器人#1 · 2012/7/3

打开看看

Saru机器人#2 · 2012/7/3

【在 binux 的大作中提到: 】 : 打开看看 millions?

binux机器人#3 · 2012/7/3

信息又不会无中生有【在 Saru 的大作中提到: 】 : millions?

Saru机器人#4 · 2012/7/3

【在 binux 的大作中提到: 】 : 信息又不会无中生有 e,我是想问有没有现成的软件呢？毕竟这个功能应该也不是稀有的功能吧

binux机器人#5 · 2012/7/3

curl "http://www.byr.edu.cn" | grep -Ei "name=\"keywords" | sed -E "s/.*?keywords.*?content=\"([^\"]+)\".*?/\1/" 【在 Saru 的大作中提到: 】 : e,我是想问有没有现成的软件呢？毕竟这个功能应该也不是稀有的功能吧

killme机器人#6 · 2012/7/4

正则【在 binux 的大作中提到: 】 : curl "http://www.byr.edu.cn" | grep -Ei "name=\"keywords" | sed -E "s/.*?keywords.*?content=\"([^\"]+)\".*?/\1/"

wks机器人#7 · 2012/7/4

写了一段简陋的代码（scala）。思路是用nio（后面使用select/poll什么的）并行化。因为主要挑战是每个请求的round-trip time太长。缺点是可能并发的太多……还没做限制。 package com.github.wks.kwdesc import dispatch._ import dispatch.futures._ import dispatch.jsoup.JSoupHttp._ import scala.collection.JavaConversions._ import org.jsoup.nodes.Document import actors.Actor._ import util.control.Breaks._ import grizzled.slf4j.Logging case class UrlResult(val keywords: String, val description: String) object Downloader extends AnyRef with Logging { def getMetaValue(doc: Document, name: String): String = { doc.head.select("meta").filter(_.attr("name") equalsIgnoreCase name).headOption.map(_.attr("content")).getOrElse("") } lazy val myHttp = new nio.Http() def probeUrl[T](pageUrl: String)(callback: UrlResult => T): StoppableFuture[T] = { info("Probing [%s]...".format(pageUrl)) myHttp apply url(pageUrl) >\ null </> { doc => val ur = UrlResult(getMetaValue(doc, "keywords"), getMetaValue(doc, "description")) info("UrlResult for %s fetched: %s".format(pageUrl, ur)) callback(ur) } } def main(args: Array[String]): Unit = { val List(input, output) = args.toList info("Openning input [%s]...".format(input)) val in = io.Source.fromFile(input) info("Openning output [%s]...".format(output)) val out = new java.io.FileWriter(output) val bout = new java.io.BufferedWriter(out) info("Creating file-writing actor...") val writerActor = actor { loop { react { case (pageUrl, UrlResult(keyword, description)) => { info("Reactor writing record for %s".format(pageUrl)) bout.write("%s\n%s\n%s\n".format(pageUrl, keyword, description)) } case 'STOP => { bout.close() out.close() exit } } } } info("Starting url downloading...") try { val results = in.getLines.map { line => probeUrl(line) { ur => writerActor ! (line, ur) } }.toList info("Waiting for all branches to stop...") results.foreach(_()) info("Instructing writerActor to stop...") writerActor ! 'STOP info("Main thread done.") } finally { in.close() myHttp.shutdown } } } build.sbt依赖配置： libraryDependencies ++= Seq( "net.databinder" %% "dispatch-http" % "0.8.8", "net.databinder" %% "dispatch-nio" % "0.8.8", "net.databinder" %% "dispatch-tagsoup" % "0.8.8", "net.databinder" %% "dispatch-jsoup" % "0.8.8", "org.clapper" %% "grizzled-slf4j" % "0.6.9", "ch.qos.logback" % "logback-classic" % "1.0.6", "org.slf4j" % "jul-to-slf4j" % "1.6.5", "org.slf4j" % "jcl-over-slf4j" % "1.6.5", "org.slf4j" % "log4j-over-slf4j" % "1.6.5", "org.specs2" %% "specs2" % "1.11" % "test", "junit" % "junit" % "4.10" % "test" ).map(_.exclude("commons-logging", "commons-logging"))

antinucleon机器人#8 · 2012/7/4

简单，curl下来调用过滤HTML后用Stanford CoreNLP去处理。如果不会Summary，去学习LDA和pLDA。当然我从来都是忽视中文存在的【在 Saru 的大作中提到: 】 : 查询日志里有大量的url，怎么能最快的获取该url对应的页面的关键词和描述信息呢？

Saru机器人#9 · 2012/7/5

都是牛人啊