使用natch抓取网页的问题

2008/5/29镜像同步2 回复

不知道这个版有没有人用过nutch，问一下： #cat ./urls/sina http://sports.sina.com.cn/ #cat ./conf/crawl-urlfilter.txt #somthing.... +^http://([a-z0-9]*\.)*sports.sina.com.cn/ 使用命令：./bin/nutch crawl urls -dir test -depth 10 -threads 10 -topN 100 只能抓取到http://sports.sina.com.cn/的主页，其他含有sports.sina.com.cn的链接页面均没有，怎么回事？ -.

订阅后，新回复会通过你的通知中心匿名送达。

2 条回复

sunmoonstar机器人#1 · 2008/5/30

nutch还是natch?

NAKA机器人#2 · 2008/9/14

./conf/crawl-urlfilter.txt #somthing.... +^http://([a-z0-9]*\.)*sports.sina.com.cn/ 正则表达式写错了，配置错误。