【已解决】【谢谢暖神及各位北邮人】爬虫问题

2016/9/21镜像同步23 回复

我是北师大的学生，没有网络知识。前一段时间在论坛上发了个帖子，https://bbs.byr.cn/#!article/Python/15164，是有关爬虫的，当时受暖神的指点，解决了那个问题。当时的那个数据网站相对简单，现在碰上复杂的了。。。同样的，有400个点，每个点包含经纬度和1个年份。需要进入某网页，然后按要求填写，并进行其他选项的相关选择、筛选。最后生成下载链接。需要把这些链接都扒下来。手动操作的步骤写在文档里了。附件(1.2MB) earthdata.docx 不知道完成这个代码得多少钱，1000元内可以吗？（若有不当，请多指出，不太懂行情。。）不是开玩笑的，是真心寻求帮助的，老板说可以给现金，不用走劳务。谢谢各位！

订阅后，新回复会通过你的通知中心匿名送达。

9 条回复

nuanyangyang机器人#1 · 2016/9/21

额……他们其实有API的。试试看？https://earthdata.nasa.gov/api

chenxiansf机器人#2 · 2016/9/21

1000太多了，即使没有API用python模拟的话也不太难

huihui7987机器人#3 · 2016/9/22

这很可以

ComputerAI机器人#4 · 2016/9/22

谢谢暖神！尽管看不懂xml，但经过一上午的死磕，根据API里的说明我还是照猫画虎弄了个程序，现在可以扒下来下载链接了。。。不过好像有点其它的问题。我现在在程序里先循环一次，发现运行完成后IDE卡死了。网上说使用subprocess.Popen会有类似的问题，这个应该怎么解决呢？另外就是，有没有让循环并行运行的方法呢，这个循环里的内容是互不影响的。贴上丑陋的代码，以及我又使用正则表达式了，，，那个findall报错了 import subprocess from xml.etree import ElementTree as ET import re import xlrd import os in_file = "D:/info.xlsx" des_folder = "H:/aeronet_400_sites/" # load info data = xlrd.open_workbook(in_file) table = data.sheet_by_name(u'Sheet1') nrows = table.nrows ncols = table.ncols data_list = [] for i in range(nrows-1): tmp = table.row_values(i+1) tmp1 = (int(tmp[0]), int(tmp[5]), tmp[6], tmp[7]) data_list.append(tmp1) username = '***' password = '***' client_id = 'NASADATA' ip_address = '***' # login login_str = 'curl -X POST --header "Content-Type: application/xml" -d ' \ '"<token><username>''' + username + '</username>' \ '<password>' + password + '</password>' \ '<client_id>' + client_id + '</client_id>' \ '<user_ip_address>' + ip_address + '</user_ip_address> </token>" ' \ 'https://api.echo.nasa.gov/echo-rest/tokens' sp1 = subprocess.Popen(login_str, stdout = subprocess.PIPE) login_return = sp1.stdout.read() sp1.kill() # get token_id login = ET.fromstring(login_return) token_id = login.find('id').text short_name = 'MOD04_L2' version = '6' page_size = 1000 for i in range(0, len(data_list)): no, year, lat, lon = data_list[i] # search data point = str(lon) + ',' + str(lat) time = str(year) + '-01-01T00:00:00Z,' + str(year) + '-12-31T23:59:59Z' search_str = 'curl -v -i -H "Echo-token: ' + token_id + '" -H "Client-Id: ' + \ client_id + '" "https://cmr.earthdata.nasa.gov/search/granules.iso?' + \ 'short_name\[\]=' + short_name + '&version\[\]=' + version + '&point\[\]=' + \ point + '&temporal\[\]=' + time + '&page_size=' +str(page_size) + '&pretty=true"' sp2 = subprocess.Popen(search_str, stdout = subprocess.PIPE) search_return = sp2.stdout.read() sp2.kill() # get urls #str_tmp = search_return.split('\n', 11)[11] #search = ET.fromstring(str_tmp) #urls = search.findall('gmd:URL') pattern = re.compile('<gmd:URL>(.*?)</gmd:URL>') urls = pattern.findall(search_return) folder = des_folder + '/' + short_name + '/' + str(no) + '/' if (os.path.exists(folder) == False): os.mkdir(folder) f = open(folder + 'urls.txt', 'w') for j in len(urls): f.writelines('%s\n' % urls[j]) f.close() # logout logout_str = 'curl -X DELETE --header "Content-Type: application/xml" '\ 'https://api.echo.nasa.gov/echo-rest/tokens/' + token_id sp3 = subprocess.Popen(logout_str) sp3.kill()

ComputerAI机器人#5 · 2016/9/22

@chenxiansf @huihui7987 最初我老板和我以为得几千块。。。

chenxiansf机器人#6 · 2016/9/22

我觉得你代码写的挺6的，自己折腾折腾又省钱又学习了【在 ComputerAI 的大作中提到: 】 : @chenxiansf @huihui7987 : 最初我老板和我以为得几千块。。。

nuanyangyang机器人#7 · 2016/9/22

这是何苦？你明明会用requests，为什么还要用curl呢？requests处理HTTP头，比curl灵活得多得多得多得多得多得多得多得多得多得多。【在 ComputerAI 的大作中提到: 】 : 谢谢暖神！ : 尽管看不懂xml，但经过一上午的死磕，根据API里的说明我还是照猫画虎弄了个程序，现在可以扒下来下载链接了。。。不过好像有点其它的问题。 : 我现在在程序里先循环一次，发现运行完成后IDE卡死了。网上说使用subprocess.Popen会有类似的问题，这个应该怎么解决呢？ : ...................

nuanyangyang机器人#8 · 2016/9/22

另外，API返回的可是非常适合机器解析的XML格式。你明明可以用lxml，配合xpath很快就能找出<gmd:URL>标签里面的内容。这回我是非常认真的：绝对不要用正则表达式解析XML!!!!!!绝对不要用正则表达式解析XML!!!!!!绝对不要用正则表达式解析XML!!!!!!用正则解析HTML是很多新手会做的，但用正则解析XML是绝对不能原谅的！！！！！！【在 ComputerAI 的大作中提到: 】 : 谢谢暖神！ : 尽管看不懂xml，但经过一上午的死磕，根据API里的说明我还是照猫画虎弄了个程序，现在可以扒下来下载链接了。。。不过好像有点其它的问题。 : 我现在在程序里先循环一次，发现运行完成后IDE卡死了。网上说使用subprocess.Popen会有类似的问题，这个应该怎么解决呢？ : ...................

dss886机器人#9 · 2016/9/22

暖神的愤怒