Friday, December 20, 2013

A Datawarehouse for Social Media for Learning


Recently i started to redesign our Mediabase, the database that includes data crawled from social media. A data warehouse model is more convenient for analyzing social media. Here i'm going to explain why.

All in all, the Mediabase design is not wrong. Behind it, its creators put the actor network theory (ANT) that claims all items in a system as actors. The following picture will help to understand how this idea applied to the Mediabase.


The Mediabase was an empty database with set of defined tables. Now it is filled on a regular basis with the help of watcher scripts. A watcher crawls one of Media: blogs, feeds, podctasts, e-mail archives. The traces left in a medium by its users consist of texts, tags or labels, urls and many other components. We collect these Artifacts in the Mediabase to be able later to apply them for different analytic tasks. Moreover, we examine users, e.g. e-mail senders, blog owners, forum posts authors and others. We can apply different analysis investigating changes of behaviors of uniquely defined users. Such an analysis shows activity patterns of users, their roles in communities, their interests and their closed peers in networks.

The Artifact and Medium are actors according to the ANT. The Community, the Agent (or User) and the Process are actors as well. All they have an influence on activities in online networks. That is why the Mediabase aim is to capture these changes.

But why do we need to collect all these data? Let us follow a teacher that aims to understand what problems do students have by solving an exercise. He can try to find an answer by observing online student communities and finding answers to the following questions:
  • Are there any questions about the exercise?
  • How many people have viewed the topic?
  • How many people have took part in the discussions?
  • What resources were used? How active are the students that talk about the exercise?
  • What are the sentiments and intents expressed in the discussions?

To find the answers can be time-consuming and impossible manually (depending on the number of students and the number of involved online communities). 1) Communities have unique structures: the positions of users in structures specify roles that represent community experts, communicative members and brokers which are members connecting several groups. 2) Huge amount of texts include relevant and irrelevant for learning information. It may include interesting explanations and links but can be overlooked by a person.

So the help of machines is needed not only because the communities are big. We are talking about situations when teachers, managers or stakeholders of online learning communities want to get answers to different questions about different time intervals of their communities.

Therefore, we need a database model that provides the right information in the right place at the right time with the right cost in order to support the right decision. In the traditional databases we get a data item by specifying two parameters, e.g. specifying the time and the medium parameters we get posts appeared in the medium in the given time period. Or we get users that were active at the given time period in the given medium. But more sophisticated queries require some time to execute because of joins, etc. Moreover, analyzing communities researchers have to mine data and afterwards make some experiments to find patterns.

To simplify the life cf researchers, I propose a data cube that is a multidimensional model with a set of measures - facts. We can navigate over the data cube by specifying dimensions that correlate with Actors of ANT model. Particularly, we can choose a User that posted at a Time in a Medium so that we get the output that characterize the User in the given Time point in the given Medium. But in this case the output includes different measures captured by mining social networks, texts or user activities.

On the following picture you can see the hierarchy of dimensions to the proposed data cube. The construction of the Mediabase cube is still not ready, as it should reflect all actors we find important in the online networks. Thus I still need to elaborate Process, Agent(types), Community(types), Artifacts dimensions.