Blog Miner

Contents

[edit] Group Members

[edit] Course Reference

EECS 439, Web Data Mining

[edit] Project Proposal

[edit] Objective

The overall goal of this project is to mine web blogs for useful information, such as popular culture trend, blogger social network, etc. The outcome of the project includes an implementation of prototype blog mining system and a set of data/diagrams mined by the system.

[edit] Overview

In recent years, blogging has quickly emerged as a popular and important means of communication within Internet users. More and more individuals and groups have created blogs to share their lives, opinions, and other information online. It was reported that up to 2005 October over 100 million blogs had been created all over the world. ([1]) With blog’s explosive growth, it becomes an important source of information over Internet. Blog often contains fresh ideas from individual Internet users, and covers topics that are up to date. Due to this characteristic, analyzing blogs is likely to reveal popular culture trend, collect public opinions, and other important information for business.

[edit] Mining Tasks

In this section, we briefly described some methods we might use for blog mining. Further technical details will be presented in final report.

[edit] Blogger Network

Bloggers often put link of friends’ blogs on their blog space. The friends’ links can build a blogger social network just as social network in real life. It can be viewed as a reduction of whole social network in real life. Analyzing this social network could help us to understand the real life social network. For example, prove six-degree of separation. It is guessed that any two persons in the earth can be connected by no more than five intermediate persons. If we can build a blogger network large enough, we can testify if six-degree of separation holds or not.

[edit] Most Cited Topics

People talk about different topics on their blogs. By analyzing the contents of blogs, we can find out topics that are most popular in blogging world. For example, most cited news, most cited persons, and most cited companies are possible topics to mine.

[edit] Opinion Collection

People express their ideas on merchandises, movies, bands and so on. Mining blogs can collect public attitudes toward some topic. For example, the record company may interested in how bloggers think of The Strokes’ new album. For another example, instead of doing a survey, the school may be able collect students’ thought about the Case Daily mail by mining Case blog system.

[edit] Blog Summarization/Search

Certain topic/concepts can be summarized from a blog. Blog summarization can help to rank blog pages and build a blog search engine. The traditional citation-based ranking scheme, such as PageRank, may work well for blogs, because a personal blog may not be cited by many pages. A more practical blog ranking scheme can be built upon concept extraction and blog summarization.

[edit] Blogs to mine

There are a couple of large blog hosting servers, e.g., MSN Space, Blogger.com, and etc. Our school also provides blog spaces [2]. We may first work on case blog system for efficiency consideration. We may extend our system to mine other blog communities.

[edit] Milestones

The estimated schedule of this project is outlined as below:

[edit] Literature Study

Schedule: 02/03 – 03/03

  • Investigate RSS and Feeds
  • Investigate natural language processing
  • Blog Crawler Development
  • Progress Report 1

[edit] Blog Analyzing

Schedule: 03/06 – 03/31

  • Design and Implement algorithms for blog context analyzing
  • Progress Report 2

[edit] System Execution

Schedule: 04/03 – 04/21

Combine crawler and analyzer to perform mining tasks

[edit] Final Writing

Schedule: 04/24 – 04/30

Final Report.

[edit] Member Responsibilities

Zhihua Wen has the expertise in network programming and he will be responsible for developing blog crawler. John Tantalo will contribute to the incorporation RSS and feeds with blog crawler and also blog analyzing. Meng Hu will be responsible for the blog analyzing module.

[edit] Web Portal

http://vorlon.case.edu/~mxh147/eecs439.htm

Project progress will be updated on our project website. All documents (proposal, progress reports and final report), source codes and formatted mined data will also be made available online.

[edit] References

  1. http://www.blogherald.com/2005/10/10/the-blog-herald-blog-count-october-2005/
  2. http://blog.case.edu The Case Blog System
  3. http://web.media.mit.edu/~hugo/conceptnet/ ConceptNet Project, A Very-Large Semantic Network of Common Sense Knowledge
  4. http://web.media.mit.edu/~hugo/montylingua/ MontyLingua, A Free, Commonsense-Enriched Natural Language Understander for English
  5. Fellbaum C (Ed): ‘WordNet: An Electronic Lexical Database’, MIT Press (1998).
  6. Lenat D B: ‘CYC: A Large-Scale Investment in Knowledge Infrastructure’, Communications of the ACM, 38, No 11, pp 33—38 (1995).
  7. http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html What is RSS

[edit] Requirements Documentation

What does the software need to do?

Blog Miner will provide a useful textual analysis of blog posts. The two main types of analysis it will do are for most-discussed topics and summarization and search.

[edit] Most-discussed topics

General topics can be inferred from a blog post using natural language processing. Given a set of posts from multiple blogs over several days, the most frequent topics should be identified. This process should avoid creating topics which are very closely related or semantically equivalent, such as "weblog" and "blog". After frequent topics are identified, posts containing these topics should be aggregated for a user to browse by post or by topic.

[edit] Summarization and search

The blog miner will summarize each blog and each blog post using a few concepts. The concepts include not only frequent cited words within blog posts, but also high level concepts inferred from blog posts. For example, if one blog post mentions "guns", "army" and "death", "war" might be inferred as a summarization concept.

Blog search tool can be enhanced by employing blog summarization. Instead of using exact keyword matching, blog miner also extend the keywords with related concepts, which can guide the blog miner identify blogs or blog posts summarized by these concepts. Therefore, more user-interested blog posts could be located.

[edit] Specifications Documentation

What interfaces will the software use? How are these interfaces defined?

[edit] Web Miner

The web miner will employ two interfaces: a web crawler and feed aggregator.

[edit] Web Crawler

This component is assigned to Zhihua Wen.

The miner's web crawler will crawl the web or a specific domain using HTTP for RSS feeds (XML documents) corresponding to blogs. These feeds are usually very easy to identify in HTML documents. In most blogs, a link to the feed is included in the <head> section of the blog's main page. The web crawler should store the locations of all RSS feeds encountered.

In this example from Mano Singham's Web Journal, the RSS feed is located at http://blog.case.edu/mxs24/rss20.xml.

<link rel="alternate" type="application/rss+xml" title="RSS" href="http://blog.case.edu/mxs24/rss20.xml" />

Crawler design:

Advantage of crawling blogs

Blogs under the same blog site usually have every simple and similar structures.

First of all, they share the same ip address, which saves us a lot of time in DNS resolving and we can also use persistent http 1.1 to get multiple web pages in one TCP connection. The following link patterns are examples for blogs of case and msn space, http://blog.casse.edu/(user)/ and http://spaces.msn.com/(user)/, it’s obvious they share the same ip address since they share the same domain name. Some other blog sites such as blogger.com use a different pattern like (user).blogspot.com. In this case, although different blogs may seem to have different domain names, but in fact all these domain names are alias for the same domain with canonical name blogspot.blogger.com.

Secondly, blogs also shares every similar external and internal pattern. Here external pattern refers its entry link from outside, e.g., http://blog.casse.edu/(user)/ for blog.case.edu, we also use blog pattern to denote it. Internal pattern refers to structure inside each blog. Unlike personal website, which tends to be very different from each other, blogs under the same hose normally have very similar directory. For example, in case blog, under the directory of every blogs, we have sub directory such as /archives, /categories and so on. It’s true that different blog users may want to customize their blogs to make them looks very personalized. But still even if their appearances are very different, the internal structures are pretty much the same. These features make blogs every easy to identify. If we see a link looks like a case blog, i.e., matches the blog pattern @"^(http://)?blog.(case|cwru).edu(:\d+)?/(?<name>[^/]+)/" using regular expression, we may guess this is a blog. Not all the urls matches the blog pattern are necessary blogs, e.g., http://blog.case.edu/stats/ is not a blog, but the statistical link for case blog. However, we believe such organization links are very few compare to the number of blogs, so our guess will have a very high percentage of hit rate. Further more, when we are suspecting if a url which mathes the blog pattern. We can check the internal pattern for this link to verify it. For example, if we see a link http://blog.case.edu/9/2005/10/15/eitheror#comments, we extract http://blog.case.edu/9/ from it and check if link to subdirectory http://blog.case.edu/9/archives and http://blog.case.edu/9/categories exist. Then we can tell if http://blog.case.edu/9/ is a blog or not. Once a link is identified as a blog, we can use our existing knowledge for this certain blog site to dig only the important information and ignore unimportant links. Since the number of blog sites is so small compare to the number of blogs they host, we can use our pre-knowledge about their structures to conduct this “supervised” crawling.

RSS Feed made things even easier

RSS Feed is a special xml file under a blog which contains all the important information about the current blog (at least true for case blog). A RSS feed xml file normally contains a channel and several item element. Under each channel, there is general information for the current blog, and under each item it’s the detailed information for each post or other blog element. The <content:encoded> node under each item contains the main content information for each item. There are also other useful fields such as <description> <title> which can be used for further keyword search. For crawling purpose, we need only to check if this feed file exist to tell if it’s a blog or not and this feed file is also the only file we need to download for this blog. Similar to the previous example, if we see blog.case.edu/9/2005/10/15/eitheror#comments, we just try to get blog.case.edu/9/rss20.xml. If we succeed, then we mark blog.case.edu/9/ as a blog and we already finished crawled this blog. If we failed, we mark site blog.case.edu/9/ as a non-blog and we continue retrieve the original blog.case.edu/9/2005/10/15/eitheror#comments as a normal URL.

The logic dataflow of a crawler socket which take advantage of the fixed format of case blog and its rss feed. We use a hash table to store the Boolean information about whether a link such as http://blog.case.edu/something/ is a blog or not. For example, blog.case.edu/zhihuawen/ is a blog, and Hash.get(http://blog.case.edu/zhihuawen/) will return true, http://blog.case.edu/features/ is not a blog and Hash.get(http://blog.case.edu/features/) will return false. If we see a page http://blog.case.edu/xxx/yyy/zzz.html, and Hash.get(http://blog.case.edu/xxx/) return null, this means we never encounter any link under http://blog.case.edu/xxx/ before. We then will try to get page http://blog.case.edu/xxx/rss20.xml. If we get 200 OK success message, we will use Hash.put(http://blog.case.edu/xxx/, true) to mark this site as a blog and save the information inside http://blog.case.edu/xxx/rss20.xml. If we get a 404 File not found failure message or the page we get is not in proper rss format, we will use Hash.put(http://blog.case.edu/xxx/, false) to mark it as a non-blog URL and come back to retrieve the original link http://blog.case.edu/xxx/yyy/zzz.html again.


Some log records in our crawler to show how blog.case.edu/moof/ is identified as a blog and further used. http://blog.case.edu/moof/2006/01/13/case_daily not sure a blog, bloglink:http://blog.case.edu/moof/ Retrieve http://blog.case.edu/moof/rss20.xml succeed, http://blog.case.edu/moof/ identified as blog link http://blog.case.edu/moof/2005/09/22/crazy_thought is a blog, bloglink:http://blog.case.edu/moof/

Asynchronous non-blocking sockets

When dealing with network multiplexing, we think there are basically four different methods. Creating new thread/process per connection, using existing thread/process pool, using select for multiplexed I/O and using asynchronous socket. This is the first time we use C#, we first want to use select to handle multiple client tcp socket connection. But then we find one article from Microsoft Msdn [8], which criticize select for “much less responsive than using a thread-based server, and still not very scalable, performance degrade significantly after a thousand or so clients have connected”, the author of that article also encourages users to use Asynchronous socket. We found out the use of asynchronous socket in .NET is very easy. For example, in traditional blocking mode, you will use function dosomething() and block until this function is finished. In non-blocking mode, you use function model like begin_dosomething(dosomething_callback, state) , and then you can continue to do something else. When the job is finished, dosomething_callback will be called and the parameter state will be transferred. Microsoft Msdn also said these callback functions are thread-safe as long as they are static. So basically we can use the senario (beginconnect->connectcallback->beginsend->sendcallback->(beginreceive->receivecallback) like we use (connect->send->receive) in non-blocking mode to download a webpage, except we can run multiple jobs like a pipeline. The following graph is a comparison between these two senarios. Although we do not need to explicitly poll or create threads, this is still in fact a thread poll solution. But the author in [8] claimed that this asynchronous model can accept/open more connections and has has much better responsiveness than a select model. We didn’t implement another select based crawler to compare with ours. But from my experience, our current crawler can retrieve multiple pages in the same pretty fast. We also use mutex and semaphore to resolve the concurrency issue around get and put URLs.

Implementation and Crawling result:

We have implemented two crawlers using C# to crawl the website at blog.case.edu. The first one downloads all the files except those bigger than 10MB, the second one tries to download only rss feed file from blog directory. The crawlers were running at a Dell 9150 Desktop machine with a 2.8GHz Pentium Dual Core CPU and 2GB memory with 2 gigabit network card. For the first crawler, we can download more than 900 blogs in less than 20 minutes using 100 concurrent sockets. The second crawler has about the same speed but it downloads fewer pages from blog.case.edu thus less harassment. We also found out less than 20 subdirectories under blog.case.edu are not blogs, compare to over 900 blogs, this number is quite small. We have also analyzed the relationship between case blogs. We found only 53 of them have non-zero indgree or outdgree to or from other case blogs.

Database table design:

Currently, we stored all the crawled pages in local files, but in the future, we want to store them in a database with specialized table for blog and rss feed. We already created these tables using sqlserver 2005. The following is the diagram for our database. As we can see, everything is around the blog table. Since a blog rss feed has only one channel but could have multiple items, we merge the channel with blog table and separate blogitem to a different table with link using blogid as foreign key. Blogrelation table represent the link graph between these blogs and URL table serve both as page for normal URL and a indication whether this directory is a blog or not. We also have DNS resolve table not listed in this diagram. The format of these tables could also be changed in the future.

The table diagram of our project, everything is around blog table.

[edit] Feed Aggregator

This component is assigned to John Tantalo.

The purpose of collecting RSS feeds is that feeds are currently the best way to detect when a new post is added to a blog. Each RSS feed includes a pubDate timestamp of the last update. All the aggregator need do is compare the pubDate to the last time the feed was read to detect a new post.

More generally, for each feed, the aggregator will store a timestamp corresponding the last time the feed was downloaded. Then, according to a schedule, the aggregator will download the feed over HTTP and look for new posts by parsing the XML and comparing the pubDate to the timestamp. The text contents of new posts (referenced in the feed) will then be passed to the post analyzer.

[edit] Topic Miner

The topic miner will employ two interfaces: the post analyzer and topic aggregator.

[edit] Post Analyzer

This component is assigned to Meng Hu.

The post analyzer consists of two components: XML parser and text analyzer.

RSS XML files of blogs are feed into XML parser and corresponding contents are extracted. For example, the blog title and blog items(posts) will be extracted from XML file of crawled blogs.

Given the text of a post, the post analyzer will perform natural language processing on it to identify topics. These topics will be compared with topics discussed in other posts to identify frequently discussed topics. The post analyzer will then compile a list of freqently discussed topics within each blog or all blogs.

The post analyzer may be integrated into web crawler in the future.

[edit] Topic Aggregator

This component is assigned to John Tantalo.

Once frequently discussed topics are identified for blog posts, the topic aggregator must provide a user interface to display what topics were discussed in a given post and what posts discussed a given topic. The social bookmarking website del.icio.us may be best suited to host this interface, as it provided functionality to implement these specifications using tags.

Topics detected by the post analyzer should be stored in the database. The topic aggregator should then make the table that relates blog items to topics available through an interface, such as a bookmark tagging website like del.icio.us. This may be accomplished by reading the table periodically and adding new records to the aggregation interface.

Here is a proposal of two tables to handle topics aggregation:

CREATE TABLE ITEMTOPIC( ITEMID int NOT NULL, TOPICID int NOT NULL, CREATEDATE datetime NOT NULL ) ; CREATE TABLE TOPIC( TOPICID int NOT NULL, TOPIC varchar(32) NOT NULL ) ;

[edit] Design Documentation

How is the software organized? What is the goal of this organization? How does the design address the requirements? What functionality do the classes, functions, scripts, etc. provide?

[edit] Web Miner

[edit] Web Crawler

This component is assigned to Zhihua Wen.

The logic dataflow of a crawler socket which take advantage of the fixed format case blog and its rss feed.

Our design of Asynchronous non-blocking sockets vs single-thread blocking socket design.

[edit] Feed Aggregator

This component is assigned to John Tantalo.

The feed aggregator will periodically check all feeds that haven't been checked in a certain time interval. For example, once an hour, the feed aggregator will download all the posts from blogs which haven't been examined for at least 1 day. This will hopefully distribute the set of feeds evenly through the update period.

Image:BlogMinerFeedAggregatorFlowChart.png

[edit] Topic Miner

[edit] Post Analyzer

This component is assigned to Meng Hu.

Here is a proposed flowchart of Post Analyzer.

Image:PostAnalyzerFlowchart.JPG

Currently, a perl XML parser has been built for fast prototyping. The XML parser takes RSS XML files as input, and outputs a single plain text file which contains the title of the blogs and all blog posts. Here is an sample output file: Sample Blog Text file

The blogs in plain text format are then processed by a part-of-speech tagger, which is implemented in java by calling MontyLingua(A natarual language processing toolkit). Here is the tagged file of the sample file given above: Sample Tagged Blog Text file

The tag set used here is Penn Treebank tagset

[edit] Topic Aggregator

This component is assigned to John Tantalo.

The topic aggregator adds new item/topic pairs to the aggregation interface (del.icio.us). New records are defined as records created since the last time the topic aggregator was executed.

Image:BlogMinerTopicAggregatorFlowChart.png

[edit] Implementation Documentation

How are the most essential components of the software implemented?

[edit] Supplemental Documentation

Demo and other data

[edit] Case Blog Link Network

The below network are built on a snapshot of links between case blogs on Feb. 22nd.

[edit] Post Analyzer Demo

[edit] XML Parser output

The compressed file below contains all posts of case blogs in plain text format(built on a snapshot on Feb. 22nd).

All Case Blogs in plain text file

[edit] Part-of-Speech Tagger output

The compressed file below contains all tagged posts of case blogs(built on a snapshot on Feb. 22nd).

All tagged Case Blogs

The below file is a statistical information of topics extracted from Case blog posts. All tagged nouns are considered as a "topic". The number after each topic is the occurrences of the topic in all blog posts. Currently topics are ranked by exact occurrences. In the future, more reasonable ranking schemes will be investigated and defined to filter "non-interesting" topics.

Topic Stats

[edit] Feed Aggregator Demo

[edit] Sample Output

This portion of output shows the feed and the items which were recognized as new and added to the database. Old items or items which are dated in the future are ignored.

blogId: 15 (Tue, 01 Jan 1980 00:00:00 -0500, http://blog.case.edu/adam.evans/rss20.xml)
	blogItem: 0 (Fri, 10 Feb 2006 16:41:46 -0500, Blink 143)
	blogItem: 1 (Thu, 16 Feb 2006 14:05:27 -0500, Link to Harvard's IAT test)
	blogItem: 2 (Fri, 17 Feb 2006 11:32:13 -0500, dont blink but sleep on it)
	blogItem: 3 (Tue, 28 Feb 2006 16:23:27 -0500, Project Creation)
	blogItem: 4 (Tue, 21 Mar 2006 22:11:36 -0500, Chapter Highlights)
blogId: 14 (Tue, 01 Jan 1980 00:00:00 -0500, http://blog.case.edu/act9/rss20.xml)
blogId: 13 (Tue, 01 Jan 1980 00:00:00 -0500, http://blog.case.edu/aau2/rss20.xml)
blogId: 12 (Tue, 01 Jan 1980 00:00:00 -0500, http://blog.case.edu/aaron.shaffer/rss20.xml)
	blogItem: 0 (Fri, 06 Jan 2006 11:33:31 -0500, Google Newsletter for Librarians)
	blogItem: 1 (Sun, 05 Feb 2006 13:20:00 -0500, Three Services from ITS: #3) Public Printing Solution)
	blogItem: 2 (Tue, 07 Feb 2006 18:58:35 -0500, Three Services From ITS:#2) A Network Drive)
	blogItem: 3 (Thu, 09 Feb 2006 20:14:19 -0500, Three Services From ITS:#1) ITS Open Forum Podcast)
	blogItem: 4 (Fri, 17 Feb 2006 16:00:00 -0500, A Review of Seven Microphones: The Heart of All Podcasting)
	blogItem: 5 (Tue, 21 Feb 2006 21:47:01 -0500, My Blog Is Like A Hammer)
	blogItem: 6 (Thu, 23 Feb 2006 23:44:50 -0500, Hundert has to go?)
	blogItem: 7 (Wed, 01 Mar 2006 18:59:14 -0500, A Game of Cat and Mouse:ISPs and Their Paying Customers)
	blogItem: 8 (Wed, 08 Mar 2006 10:47:55 -0500, Blog Against Sexism Day)
	blogItem: 9 (Thu, 09 Mar 2006 20:07:48 -0500, Gadget Review: Logitech Cordless Presenter)
	blogItem: 10 (Thu, 16 Mar 2006 09:44:52 -0500, Windows XP on Intel Macs)
	blogItem: 11 (Fri, 17 Mar 2006 09:28:12 -0500, Wikipedia on my iPod)
	blogItem: 12 (Mon, 20 Mar 2006 22:10:27 -0500, Help Support DMCA Reform: Pass HR 1201)
	blogItem: 13 (Wed, 22 Mar 2006 09:55:28 -0500, Which Team Are You On?)
	blogItem: 14 (Wed, 22 Mar 2006 16:55:10 -0500, Goodbye Television. Hello iTunes.)
blogId: 11 (Tue, 01 Jan 1980 00:00:00 -0500, http://blog.case.edu/aap8/rss20.xml)
blogId: 10 (Tue, 01 Jan 1980 00:00:00 -0500, http://blog.case.edu/aak8/rss20.xml)
[edit] Usage

The feed aggregator can accept as an argument the maximum number of feeds to examine, which is five in this example. If this value is less than 1, then the program will attempt to examine every feed in the blog table.

[lynn:EECS 439/BlogMiner/bin] john% java -cp ../lib/jtds-1.2.jar:. FeedAggregator 5
blogId: 219 (Sun, 26 Mar 2006 18:47:37 -0500, http://blog.case.edu/djt/rss20.xml)
blogId: 220 (Sun, 26 Mar 2006 18:47:37 -0500, http://blog.case.edu/djurek/rss20.xml)
blogId: 221 (Sun, 26 Mar 2006 18:47:37 -0500, http://blog.case.edu/dkh2/rss20.xml)
blogId: 222 (Sun, 26 Mar 2006 18:47:37 -0500, http://blog.case.edu/dkr3/rss20.xml)
blogId: 223 (Sun, 26 Mar 2006 18:47:37 -0500, http://blog.case.edu/dlf4/rss20.xml)

total inserted blog items: 0
This page has been accessed 7,976 times.
This page was last modified 19:45, March 26, 2006 by John Tantalo.
About | Disclaimers