返回信息流【 以下文字转载自 NLP 讨论区 】
发信人: zibuyu (得之我幸), 信区: NLP
标 题: 研究Blog的大好数据
发信站: 水木社区 (Wed Mar 25 14:28:28 2009), 站内
http://www.icwsm.org/2009/data/
3rd Int'l AAAI Conference on Weblogs and Social Media
May 17 - 20, 2009, San Jose, California
Sponsored by the Association for the Advancement of Artificial Intelligence.
ICWSM 2009 Data Challenge
Continuing the ICWSM tradition, ICWSM 2009 is making a dataset available to researchers in the blog and social media fields. We invite you to download the dataset, explore it, learn something interesting about it, and submit a paper about it to ICWSM 2009.
Good research topics might include...
* link analysis
* social network extraction
* tracing the evolution of news
* blog search and filtering
* psychological, sociological, ethnographic, or personality-based studies
* analysis of influence among bloggers
* blog summarization and discourse analysis
But you should feel free to explore any aspect of the data that you feel would be of interest to the ICWSM community.
List of papers accepted to the Data Challenge Workshop
Identifying Personal Stories in Millions of Weblog Entries
Andrew Gordon and Reid Swanson
SentiSearch: Exploring Mood on the Web
Sara Sood and Lucy Vasserman
Flash Floods and Ripples: The Spread of Media Content through the Blogosphere
Meeyoung Cha, Juan Antonio Navarro Perez, and Hamed Haddadi
Event Intensity Tracking in Weblog Collections
Viet Ha Thuc, Yelena Mejova, Christopher Harris and Padmini Srinivasan
Quantification of Topic Propagation using Percolation Theory: A study of the ICWSM Network
Ali Azimi Bolourian, Yashar Moshfeghi and C. J. van Rijsbergen
Authors are invited to submit papers to a special data challenge workshop, to be held on the last day of ICWSM. Papers for the workshop may be submitted here. The deadline for workshop submissions is March 1st. Submissions may be up to 8 pages in length, must be in PDF format, and must follow the ICWSM formatting guidelines. The workshop itself will feature presentations by authors as well as a broader discussion of data issues and opportunities confronting the social media community.
We also welcome authors to submit papers on the dataset to the main ICWSM conference. Time permitting, we will invite authors of accepted ICWSM papers on the dataset to also briefly present their work at the workshop.
The best paper (main conference or workshop) on the dataset will be selected by the data chairs and will receive a prize at the conference.
Please note that the datasets made available through ICWSM are not restricted to only ICWSM 2009 or even ICWSM in general. Our long-term goal is to make weblog and social media datasets available to the research community, and while we hope that ICWSM will be a premier venue for presenting that research, we are happy to see the ICWSM datasets used far and wide.
ICWSM 2009 Spinn3r Blog Dataset
190 people have downloaded the dataset so far! (as of 4 Feb 2009)
The dataset, provided by Spinn3r.com, is a set of 44 million blog posts made between August 1st and October 1st, 2008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed).
This dataset spans a number of big news events (the Olympics; both US presidential nominating conventions; the beginnings of the financial crisis; ...) as well as everything else you might expect to find posted to blogs.
To get access to the Spinn3r dataset, please download and sign the usage agreement , and email it to dataset-request (at) icwsm.org. Once your form is processed (usually within 1-3 days), you will be sent a URL and password where you can download the collection.
Here is a sample of blog posts from the collection. The XML format is described on the Spinn3r website.
Spinn3r provides free access to researchers. If you are interested in making use of their data beyond the ICWSM collection, for example to crawl linked posts or earlier stories from certain blogs, visit their site, spinn3r.com
Community
We have a mailing list for discussing the datasets at http://groups.google.com/group/icwsm-data. Please join to talk about whatever you're doing with the data. In particular, if you are looking for groups to collaborate with, here's a forum for you. We also have a project at Google Code, http://code.google.com/p/icwsm-data/, where we can host tools and resources that you create to go along with the datasets.
Data Chairs
Ian Soboroff, NIST
Akshay Java, Live Labs, Microsoft
这是一条镜像帖。来源:北邮人论坛 / ml-dm / #4526同步于 2009/3/26
该镜像源已超过 30 天没有更新,可能在源站已被删除。
ML_DM机器人发帖
[zz]研究Blog的大好数据
PtwCJ
2009/3/26镜像同步4 回复
订阅后,新回复会通过你的通知中心匿名送达。
4 条回复
热情一赞
【 在 PtwCJ (鲜的每日C|女共产党员的男朋友) 的大作中提到: 】
: 【 以下文字转载自 NLP 讨论区 】
: 发信人: zibuyu (得之我幸), 信区: NLP
: 标 题: 研究Blog的大好数据
: ...................