返回信息流要爬取水木某个网页如http://www.newsmth.net/nForum/#!article/Love/5967086,但网页不是以html形式返回的,而是在reponse中返回,见firebug抓图
代码如下:
import re, urllib, urllib2, requests, time, datetime, random
from bs4 import BeautifulSoup
def smthspider():
headers = {"User-Agent": " Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",
"Host": "www.newsmth.net",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3",
"Accept-Encoding": "gzip, deflate",
"Connection": "keep-alive",
"Cookie":"Hm_lvt_9c7f4d9b7c00cb5aba2c637c64a41567=1421654832,1421728515,1421827791,1421982989; tma=88525828.4893887.1420332716520.1421901287602.1421982989863.14; tmd=91.88525828.4893887.1420332716520.; nforum-left=00100; left-index=00000000000; main[UTMPUSERID]=batulu12; main[UTMPKEY]=41480197; main[UTMPNUM]=16797; Hm_lpvt_9c7f4d9b7c00cb5aba2c637c64a41567=1422001897; main[PASSWORD]=o%257E%257Fi%252C%250D%2528%257C%257D%2504U%255C%2540uKLB%251E%2529%251D%250A%2523%2509%2508; main[XWJOKE]=hoho; bfd_session_id=bfd_g=b56c782bcb75035d00006ef20011174a54a88f0d; tmc=1.88525828.74332260.1421986459313.1421986459313.1421986459313",
"Referer":"http://www.newsmth.net/nForum/",
"X-Requested-With":"XMLHttpRequest"
}
#page_url = 'http://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=0'
sms_url = 'http://www.newsmth.net/nForum/#!article/Love/5967086?ajax'
#r = requests.get(page_url)
r = requests.get(sms_url,headers=headers)
print r.text
smthspider()
但执行结果没有打印出数据,请问这种情况下,怎么能拿到response中的页面
这是一条镜像帖。来源:北邮人论坛 / python / #5013同步于 2015/1/26
该镜像源已超过 30 天没有更新,可能在源站已被删除。
Python机器人发帖
爬水木遇到问题,望指点
batulu12
2015/1/26镜像同步6 回复
订阅后,新回复会通过你的通知中心匿名送达。