返回信息流是这样,原先设计了一个一个爬取论坛每个section,再进入到每个board,获取第一页的帖子链接。是没问题的,大概40多秒需要。
现在想使用两个线程,写法如下:
q = Queue.Queue()
Num = 2
class ThreadUrl(threading.Thread):
def __init__(self, q):
threading.Thread.__init__(self)
self.q = q
def run(self):
while True:
http_query = self.q.get()
crawler = Crawler()
urls = crawler._parse_html_to_urls(**http_query)
crawler._put_urls_into_redis(urls)
self.q.task_done()
class Crawler:
.......
.......
.......
def run(self):
print "start crawler ..."
start_time = time.time()
for i in range(Num):
t = ThreadUrl(q)
t.setDaemon(True)
t.start()
for http_query in self.http_querys :#http_querys是一个touple,存放了论坛section信息
q.put(http_query)
q.join()
print time.time() - start_time
print "finish crawler ..."
HTTP_QUERYS = (
{
'host' : 'http://bbs.byr.cn',
'url' : 'http://bbs.byr.cn/section/3',
'headers' : {
"X-Requested-With" : "XMLHttpRequest",
},
'href' : "^/board/(.*?)$",
'source' : u'北邮人论坛-信息社会',
},
{
'host' : 'http://bbs.byr.cn',
'url' : 'http://bbs.byr.cn/section/2',
'headers' : {
"X-Requested-With" : "XMLHttpRequest",
},
'href' : "^/board/(.*?)$",
'source' : u'北邮人论坛-学术科技',
},
{
'host' : 'http://bbs.byr.cn',
'url' : 'http://bbs.byr.cn/section/0',
'headers' : {
"X-Requested-With" : "XMLHttpRequest",
},
'href' : "^/board/(.*?)$",
'source' : u'北邮人论坛-本站站务',
},
{
'host' : 'http://bbs.byr.cn',
'url' : 'http://bbs.byr.cn/section/1',
'headers' : {
"X-Requested-With" : "XMLHttpRequest",
},
'href' : "^/board/(.*?)$",
'source' : u'北邮人论坛-北邮校园',
},
{
'host' : 'http://bbs.byr.cn',
'url' : 'http://bbs.byr.cn/section/4',
'headers' : {
"X-Requested-With" : "XMLHttpRequest",
},
'href' : "^/board/(.*?)$",
'source' : u'北邮人论坛-人文艺术',
},
{
'host' : 'http://bbs.byr.cn',
'url' : 'http://bbs.byr.cn/section/5',
'headers' : {
"X-Requested-With" : "XMLHttpRequest",
},
'href' : "^/board/(.*?)$",
'source' : u'北邮人论坛-生活时尚',
},
{
'host' : 'http://bbs.byr.cn',
'url' : 'http://bbs.byr.cn/section/6',
'headers' : {
"X-Requested-With" : "XMLHttpRequest",
},
'href' : "^/board/(.*?)$",
'source' : u'北邮人论坛-休闲娱乐',
},
{
'host' : 'http://bbs.byr.cn',
'url' : 'http://bbs.byr.cn/section/7',
'headers' : {
"X-Requested-With" : "XMLHttpRequest",
},
'href' : "^/board/(.*?)$",
'source' : u'北邮人论坛-体育健身',
},
{
'host' : 'http://bbs.byr.cn',
'url' : 'http://bbs.byr.cn/section/8',
'headers' : {
"X-Requested-With" : "XMLHttpRequest",
},
'href' : "^/board/(.*?)$",
'source' : u'北邮人论坛-游戏对战',
},
)
然而不知道是不是我使用多线程出了问题还是怎么回事,报错提示:
求达人指点一二呀[ema23][ema23][ema23]
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 808, in __bootstrap_inner
self.run()
File "thread-main.py", line 48, in run
urls = crawler._parse_html_to_urls(**http_query)
File "thread-main.py", line 114, in _parse_html_to_urls
r2 = requests.get(url['href'], headers=headers)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 60, in get
return request('get', url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 49, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 457, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 569, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 407, in send
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', error(111, 'Connection refused'))
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 808, in __bootstrap_inner
self.run()
File "thread-main.py", line 48, in run
urls = crawler._parse_html_to_urls(**http_query)
File "thread-main.py", line 114, in _parse_html_to_urls
r2 = requests.get(url['href'], headers=headers)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 60, in get
return request('get', url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 49, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 457, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 569, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 407, in send
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', error(111, 'Connection refused'))
这是一条镜像帖。来源:北邮人论坛 / python / #4073同步于 2014/11/9
该镜像源已超过 30 天没有更新,可能在源站已被删除。
Python机器人发帖
问个爬虫多线程的事
buptmuye
2014/11/9镜像同步13 回复
订阅后,新回复会通过你的通知中心匿名送达。
9 条回复
异步怎么说
【 在 reverland (从未如此热爱过生活) 的大作中提到: 】
: 不觉得这种事情适合线程,异步?
: 来自「北邮人论坛手机版」
通过『我邮2.0』发布
。。。。=.=!
【 在 byr10th (JUST DO NOT GIVE UP || 4X粉丝团团长) 的大作中提到: 】
: asynchronous
通过『我邮2.0』发布
python难道不能写服务器?
【 在 namowen 的大作中提到: 】
: nodejs不是写服务器的吗
: 可以去看看nodejs。觉比python更适合用作爬虫。
nodejs有Python这么成熟的爬虫模块吗,没用过不知道
【 在 json123 的大作中提到: 】
python难道不能写服务器?
【 在 namowen...
phantomjs,可以去看看。
【 在 namowen 的大作中提到: 】
: nodejs有Python这么成熟的爬虫模块吗,没用过不知道
: python难道不能写服务器?