返回信息流要求:给定版名,爬北邮人论坛的该版,并返回它的第一页的非顶置贴的标题、帖子id、原作者和最后回复者。
限制:
1. 语言不限,但考虑一下你能击败Python的优雅吗?
2. 库不限,任何工具都可以。
例子:
import requests
from bs4 import BeautifulSoup
def tops(board):
url = "http://m.byr.cn/board/"+board
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
for li in soup.select("ul.list > li"):
if li.div.a.get('class') != ["top"]:
yield (li.div.a.text,
int(li.div.a["href"].split("/")[-1]),
li.find_all("div")[1].find_all("a")[0].text,
li.find_all("div")[1].find_all("a")[1].text)
for title,articleid,author,lastreply in tops("Python"):
print("title: ",title)
print("articleid: ",articleid)
print("author: ",author)
print("lastreply: ",lastreply)
print()
输出:
title: 求助,为什么爬论坛的html页面要不就是没法显示
articleid: 5258
author: NM999
lastreply: kafei123
title: 【有偿咨询】大型爬虫的技术指导
articleid: 5270
author: namowen
lastreply: nuanyangyang
title: python的mro问题
articleid: 5269
author: tycoon0
lastreply: nuanyangyang
title: [挑战]合并字典
articleid: 5228
author: nuanyangyang
lastreply: nuanyangyang
title: [问题]如何抓取论坛python版的帖子
articleid: 5249
author: Cycer
lastreply: Cycer
title: 学Python是2好还是3好?
articleid: 5254
author: lzj0218
lastreply: nuanyangyang
title: [问题]Pycharm的配置
articleid: 5248
author: buptwangzhe
lastreply: awsxsa
title: 关于BeautifulSoup新手小白求答疑解惑。。。
articleid: 5213
author: Dlovingalice
lastreply: Dlovingalice
title: python和php哪个好呢?
articleid: 5217
author: meng714620
lastreply: LoveEugene
title: 还是python好
articleid: 5202
author: asif12
lastreply: a1019866208
title: 请问谁知道qq空间的加密算法吗?
articleid: 4706
author: oneone
lastreply: Ncer
title: [求助]类属性和实例属性的优先级
articleid: 5197
author: NM999
lastreply: NM999
title: (python版)自制空闲磁盘擦除器(反数据恢复软件)
articleid: 5180
author: awsxsa
lastreply: nuanyangyang
title: [问题]nltk.bigrams()出现<generator object bigrams at 0x0205
articleid: 5186
author: wsgsg
lastreply: nuanyangyang
title: 求助京东爬虫登陆问题
articleid: 5189
author: a262620801
lastreply: a262620801
title: python标准库和第三方库连接mysql
articleid: 5176
author: flasher
lastreply: flasher
title: [问题]想请教一下大家都是怎么处理登录/验证码的问题的
articleid: 5149
author: byzwl
lastreply: zxc701
title: [求较]关于python 3 map的一个问题
articleid: 5174
author: believe0ne
lastreply: nuanyangyang
title: git上如何只下载单个文件夹
articleid: 5023
author: awsxsa
lastreply: lzrak47
title: 【求助】爬虫斗鱼直播,遇到问题
articleid: 5171
author: harrytao
lastreply: shaonianpai
title: 求问一个调用api的问题
articleid: 5159
author: yiyiyongfu
lastreply: zxc701
title: python 装whl 文件时出现 Badzipfile是神马情况?大神求笼罩
articleid: 5085
author: ivyfangru
lastreply: jh1
title: 小白求问关于django与Shell中对于python的执行是否有区别
articleid: 5145
author: airfan
lastreply: airfan
title: [问题]python2 可变序列的insert方法,文档有个地方看不懂
articleid: 5140
author: ColorNote3
lastreply: ColorNote3
title: [问题]关于抓包的问题
articleid: 5138
author: oceansea1911
lastreply: Chon
title: 小白求问PyObject*变量如何使用?
articleid: 5128
author: airfan
lastreply: airfan
title: 小白求问python与C++交互数组数据
articleid: 5121
author: airfan
lastreply: airfan
这是一条镜像帖。来源:北邮人论坛 / python / #5289同步于 2015/2/27
该镜像源已超过 30 天没有更新,可能在源站已被删除。
Python机器人发帖
[挑战]爬byr论坛之1:主题帖列表
nuanyangyang
2015/2/27镜像同步26 回复
订阅后,新回复会通过你的通知中心匿名送达。
9 条回复
试一试自带的库python3.4
import requests
from html.parser import HTMLParser
p='''
title : %s
articleid: %s
author : %s
lastreply: %s
'''
class ByrParser(HTMLParser):
def __init__(self):
super().__init__()
self.re_set()
def re_set(self):
self.enter_li=False
self.is_top=None
self.info=[]
def handle_starttag(self, tag, attrs):
if not self.enter_li and tag=='li':
self.enter_li=True
elif self.enter_li and self.is_top is None and tag=='a':
if len(attrs)>1:
self.is_top = True
else:
self.info.append(attrs[0][1].split('/')[-1])
def handle_endtag(self, tag):
if self.enter_li and tag=='li':
if self.info:
self.output()
self.re_set()
def handle_data(self, data):
if self.enter_li and not self.is_top:
self.info.append(data)
def output(self):
articleid=self.info[0]
title=self.info[1]
author=self.info[4]
lastreply=self.info[-1]
print(p%(title,articleid,author,lastreply))
def tops(board):
parser = ByrParser()
parser.feed(requests.get("http://m.byr.cn/board/%s"%board).text)
nuan神继续这样的挑战啊,
# encoding:utf-8
import requests, re
from bs4 import BeautifulSoup
def board(board_name):
base_url = 'http://m.byr.cn/board/%s'
href_rc = re.compile('/article/%s' % board_name)
html_doc = requests.get(base_url % board_name).text
bs = BeautifulSoup(html_doc)
link_list = bs.find_all(name='a', attrs={'class': ''}, href=href_rc)
for link_a in link_list:
a_text = link_a.get_text()
a_href = link_a.get('href')
a_id = a_href[a_href.rfind('/') + 1:]
user_list = link_a.parent.next_sibling.find_all('a')#父亲节点的兄弟节点
author, lastreply = user_list[0].get_text(), user_list[1].get_text()
print("title: ", a_text)
print("articleid: ", a_id)
print("author: ", author)
print("lastreply: ", lastreply)
print()
pass
pass
if __name__ == '__main__':
board_name = 'Python'
board(board_name)
pass
其实只是相当于换了个库……
import urllib2
from lxml import etree
def tops(board):
html_text = urllib2.urlopen('http://m.byr.cn/board/%s' % board).read()
html_tree = etree.HTML(html_text)
lis = html_tree.xpath('/html/body/div[@id="wraper"]/div[@id="m_main"]/ul[@class="list sec"]/li')
return [(
li[0][0].text,
li[0][0].attrib['href'].split('/')[-1],
li[1][0].text,
li[1][1].text,
) for li in lis if li[0][0].attrib.get('class') != 'top']
蛋疼一下……
#!/bin/bash
[ -z "$1" ] && { exit 0; }
url="http://m.byr.cn/board/$1"
curl $url | sed -n 's/<\/li>/\n/gp' | grep -v 'class="top"' | grep '^<li'| \
gawk '{
match($0, /article\/.*\/([0-9]+)/, group)
id = group[1]
match($0, /[0-9]+">(.*)<\/a>\([0-9]+\)/, group)
title = group[1]
match($0, /user\/query\/([^"]+).+user\/query\/([^"]+)/, group)
author = group[1]
last = group[2]
printf "title: %s\narticleid: %s\nauthor: %s\nlastreply: %s\n\n", title, id, author, last
}'
【 在 nuanyangyang 的大作中提到: 】
: 要求:给定版名,爬北邮人论坛的该版,并返回它的第一页的非顶置贴的标题、帖子id、原作者和最后回复者。
: 限制:
: 1. 语言不限,但考虑一下你能击败Python的优雅吗?
: ...................