[挑战]爬byr论坛之1：主题帖列表

2015/2/27镜像同步26 回复

要求：给定版名，爬北邮人论坛的该版，并返回它的第一页的非顶置贴的标题、帖子id、原作者和最后回复者。限制： 1. 语言不限，但考虑一下你能击败Python的优雅吗？ 2. 库不限，任何工具都可以。例子： import requests from bs4 import BeautifulSoup def tops(board): url = "http://m.byr.cn/board/"+board r = requests.get(url) html = r.text soup = BeautifulSoup(html) for li in soup.select("ul.list > li"): if li.div.a.get('class') != ["top"]: yield (li.div.a.text, int(li.div.a["href"].split("/")[-1]), li.find_all("div")[1].find_all("a")[0].text, li.find_all("div")[1].find_all("a")[1].text) for title,articleid,author,lastreply in tops("Python"): print("title: ",title) print("articleid: ",articleid) print("author: ",author) print("lastreply: ",lastreply) print() 输出： title: 求助，为什么爬论坛的html页面要不就是没法显示 articleid: 5258 author: NM999 lastreply: kafei123 title: 【有偿咨询】大型爬虫的技术指导 articleid: 5270 author: namowen lastreply: nuanyangyang title: python的mro问题 articleid: 5269 author: tycoon0 lastreply: nuanyangyang title: [挑战]合并字典 articleid: 5228 author: nuanyangyang lastreply: nuanyangyang title: [问题]如何抓取论坛python版的帖子 articleid: 5249 author: Cycer lastreply: Cycer title: 学Python是2好还是3好？ articleid: 5254 author: lzj0218 lastreply: nuanyangyang title: [问题]Pycharm的配置 articleid: 5248 author: buptwangzhe lastreply: awsxsa title: 关于BeautifulSoup新手小白求答疑解惑。。。 articleid: 5213 author: Dlovingalice lastreply: Dlovingalice title: python和php哪个好呢？ articleid: 5217 author: meng714620 lastreply: LoveEugene title: 还是python好 articleid: 5202 author: asif12 lastreply: a1019866208 title: 请问谁知道qq空间的加密算法吗？ articleid: 4706 author: oneone lastreply: Ncer title: [求助]类属性和实例属性的优先级 articleid: 5197 author: NM999 lastreply: NM999 title: （python版）自制空闲磁盘擦除器（反数据恢复软件） articleid: 5180 author: awsxsa lastreply: nuanyangyang title: [问题]nltk.bigrams()出现<generator object bigrams at 0x0205 articleid: 5186 author: wsgsg lastreply: nuanyangyang title: 求助京东爬虫登陆问题 articleid: 5189 author: a262620801 lastreply: a262620801 title: python标准库和第三方库连接mysql articleid: 5176 author: flasher lastreply: flasher title: [问题]想请教一下大家都是怎么处理登录/验证码的问题的 articleid: 5149 author: byzwl lastreply: zxc701 title: [求较]关于python 3 map的一个问题 articleid: 5174 author: believe0ne lastreply: nuanyangyang title: git上如何只下载单个文件夹 articleid: 5023 author: awsxsa lastreply: lzrak47 title: 【求助】爬虫斗鱼直播，遇到问题 articleid: 5171 author: harrytao lastreply: shaonianpai title: 求问一个调用api的问题 articleid: 5159 author: yiyiyongfu lastreply: zxc701 title: python 装whl 文件时出现 Badzipfile是神马情况？大神求笼罩 articleid: 5085 author: ivyfangru lastreply: jh1 title: 小白求问关于django与Shell中对于python的执行是否有区别 articleid: 5145 author: airfan lastreply: airfan title: [问题]python2 可变序列的insert方法,文档有个地方看不懂 articleid: 5140 author: ColorNote3 lastreply: ColorNote3 title: [问题]关于抓包的问题 articleid: 5138 author: oceansea1911 lastreply: Chon title: 小白求问PyObject*变量如何使用？ articleid: 5128 author: airfan lastreply: airfan title: 小白求问python与C++交互数组数据 articleid: 5121 author: airfan lastreply: airfan

订阅后，新回复会通过你的通知中心匿名送达。

9 条回复

asif12机器人#1 · 2015/2/27

试一试自带的库python3.4 import requests from html.parser import HTMLParser p=''' title : %s articleid: %s author : %s lastreply: %s ''' class ByrParser(HTMLParser): def __init__(self): super().__init__() self.re_set() def re_set(self): self.enter_li=False self.is_top=None self.info=[] def handle_starttag(self, tag, attrs): if not self.enter_li and tag=='li': self.enter_li=True elif self.enter_li and self.is_top is None and tag=='a': if len(attrs)>1: self.is_top = True else: self.info.append(attrs[0][1].split('/')[-1]) def handle_endtag(self, tag): if self.enter_li and tag=='li': if self.info: self.output() self.re_set() def handle_data(self, data): if self.enter_li and not self.is_top: self.info.append(data) def output(self): articleid=self.info[0] title=self.info[1] author=self.info[4] lastreply=self.info[-1] print(p%(title,articleid,author,lastreply)) def tops(board): parser = ByrParser() parser.feed(requests.get("http://m.byr.cn/board/%s"%board).text)

WTF机器人#2 · 2015/2/28

nuan神继续这样的挑战啊， # encoding:utf-8 import requests, re from bs4 import BeautifulSoup def board(board_name): base_url = 'http://m.byr.cn/board/%s' href_rc = re.compile('/article/%s' % board_name) html_doc = requests.get(base_url % board_name).text bs = BeautifulSoup(html_doc) link_list = bs.find_all(name='a', attrs={'class': ''}, href=href_rc) for link_a in link_list: a_text = link_a.get_text() a_href = link_a.get('href') a_id = a_href[a_href.rfind('/') + 1:] user_list = link_a.parent.next_sibling.find_all('a')#父亲节点的兄弟节点 author, lastreply = user_list[0].get_text(), user_list[1].get_text() print("title: ", a_text) print("articleid: ", a_id) print("author: ", author) print("lastreply: ", lastreply) print() pass pass if __name__ == '__main__': board_name = 'Python' board(board_name) pass

glazard机器人#3 · 2015/2/28

其实只是相当于换了个库…… import urllib2 from lxml import etree def tops(board): html_text = urllib2.urlopen('http://m.byr.cn/board/%s' % board).read() html_tree = etree.HTML(html_text) lis = html_tree.xpath('/html/body/div[@id="wraper"]/div[@id="m_main"]/ul[@class="list sec"]/li') return [( li[0][0].text, li[0][0].attrib['href'].split('/')[-1], li[1][0].text, li[1][1].text, ) for li in lis if li[0][0].attrib.get('class') != 'top']

Ncer机器人#4 · 2015/2/28

啊，在学校没网，占个楼

feichashao机器人#5 · 2015/2/28

围观学习

icybee机器人#6 · 2015/2/28

好喜欢python，可惜不会用

toobee机器人#7 · 2015/2/28

学习了。

Vampire机器人#8 · 2015/2/28

蛋疼一下…… #!/bin/bash [ -z "$1" ] && { exit 0; } url="http://m.byr.cn/board/$1" curl $url | sed -n 's/<\/li>/\n/gp' | grep -v 'class="top"' | grep '^<li'| \ gawk '{ match($0, /article\/.*\/([0-9]+)/, group) id = group[1] match($0, /[0-9]+">(.*)<\/a>$[0-9]+$/, group) title = group[1] match($0, /user\/query\/([^"]+).+user\/query\/([^"]+)/, group) author = group[1] last = group[2] printf "title: %s\narticleid: %s\nauthor: %s\nlastreply: %s\n\n", title, id, author, last }' 【在 nuanyangyang 的大作中提到: 】 : 要求：给定版名，爬北邮人论坛的该版，并返回它的第一页的非顶置贴的标题、帖子id、原作者和最后回复者。 : 限制： : 1. 语言不限，但考虑一下你能击败Python的优雅吗？ : ...................

Jayvee机器人#9 · 2015/2/28

进楼学习