BYR Achieve · 镜像论坛

【以下文字转载自 NLP 讨论区】发信人: zibuyu (得之我幸), 信区: NLP 标题: 研究Blog的大好数据发信站: 水木社区 (Wed Mar 25 14:28:28 2009), 站内 http://www.icwsm.org/2009/data/ 3rd Int'l AAAI Conference on Weblogs and Social Media May 17 - 20, 2009, San Jose, California Sponsored by the Association for the Advancement of Artificial Intelligence. ICWSM 2009 Data Challenge Continuing the ICWSM tradition, ICWSM 2009 is making a dataset available to researchers in the blog and social media fields. We invite you to download the dataset, explore it, learn something interesting about it, and submit a paper about it to ICWSM 2009. Good research topics might include... * link analysis * social network extraction * tracing the evolution of news * blog search and filtering * psychological, sociological, ethnographic, or personality-based studies * analysis of influence among bloggers * blog summarization and discourse analysis But you should feel free to explore any aspect of the data that you feel would be of interest to the ICWSM community. List of papers accepted to the Data Challenge Workshop Identifying Personal Stories in Millions of Weblog Entries Andrew Gordon and Reid Swanson SentiSearch: Exploring Mood on the Web Sara Sood and Lucy Vasserman Flash Floods and Ripples: The Spread of Media Content through the Blogosphere Meeyoung Cha, Juan Antonio Navarro Perez, and Hamed Haddadi Event Intensity Tracking in Weblog Collections Viet Ha Thuc, Yelena Mejova, Christopher Harris and Padmini Srinivasan Quantification of Topic Propagation using Percolation Theory: A study of the ICWSM Network Ali Azimi Bolourian, Yashar Moshfeghi and C. J. van Rijsbergen Authors are invited to submit papers to a special data challenge workshop, to be held on the last day of ICWSM. Papers for the workshop may be submitted here. The deadline for workshop submissions is March 1st. Submissions may be up to 8 pages in length, must be in PDF format, and must follow the ICWSM formatting guidelines. The workshop itself will feature presentations by authors as well as a broader discussion of data issues and opportunities confronting the social media community. We also welcome authors to submit papers on the dataset to the main ICWSM conference. Time permitting, we will invite authors of accepted ICWSM papers on the dataset to also briefly present their work at the workshop. The best paper (main conference or workshop) on the dataset will be selected by the data chairs and will receive a prize at the conference. Please note that the datasets made available through ICWSM are not restricted to only ICWSM 2009 or even ICWSM in general. Our long-term goal is to make weblog and social media datasets available to the research community, and while we hope that ICWSM will be a premier venue for presenting that research, we are happy to see the ICWSM datasets used far and wide. ICWSM 2009 Spinn3r Blog Dataset 190 people have downloaded the dataset so far! (as of 4 Feb 2009) The dataset, provided by Spinn3r.com, is a set of 44 million blog posts made between August 1st and October 1st, 2008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed). This dataset spans a number of big news events (the Olympics; both US presidential nominating conventions; the beginnings of the financial crisis; ...) as well as everything else you might expect to find posted to blogs. To get access to the Spinn3r dataset, please download and sign the usage agreement , and email it to dataset-request (at) icwsm.org. Once your form is processed (usually within 1-3 days), you will be sent a URL and password where you can download the collection. Here is a sample of blog posts from the collection. The XML format is described on the Spinn3r website. Spinn3r provides free access to researchers. If you are interested in making use of their data beyond the ICWSM collection, for example to crawl linked posts or earlier stories from certain blogs, visit their site, spinn3r.com Community We have a mailing list for discussing the datasets at http://groups.google.com/group/icwsm-data. Please join to talk about whatever you're doing with the data. In particular, if you are looking for groups to collaborate with, here's a forum for you. We also have a project at Google Code, http://code.google.com/p/icwsm-data/, where we can host tools and resources that you create to go along with the datasets. Data Chairs Ian Soboroff, NIST Akshay Java, Live Labs, Microsoft

[zz]研究Blog的大好数据