Showing posts with label workingpaper. Show all posts
Showing posts with label workingpaper. Show all posts

Friday, December 20, 2013

A Datawarehouse for Social Media for Learning


Recently i started to redesign our Mediabase, the database that includes data crawled from social media. A data warehouse model is more convenient for analyzing social media. Here i'm going to explain why.

All in all, the Mediabase design is not wrong. Behind it, its creators put the actor network theory (ANT) that claims all items in a system as actors. The following picture will help to understand how this idea applied to the Mediabase.


The Mediabase was an empty database with set of defined tables. Now it is filled on a regular basis with the help of watcher scripts. A watcher crawls one of Media: blogs, feeds, podctasts, e-mail archives. The traces left in a medium by its users consist of texts, tags or labels, urls and many other components. We collect these Artifacts in the Mediabase to be able later to apply them for different analytic tasks. Moreover, we examine users, e.g. e-mail senders, blog owners, forum posts authors and others. We can apply different analysis investigating changes of behaviors of uniquely defined users. Such an analysis shows activity patterns of users, their roles in communities, their interests and their closed peers in networks.

The Artifact and Medium are actors according to the ANT. The Community, the Agent (or User) and the Process are actors as well. All they have an influence on activities in online networks. That is why the Mediabase aim is to capture these changes.

But why do we need to collect all these data? Let us follow a teacher that aims to understand what problems do students have by solving an exercise. He can try to find an answer by observing online student communities and finding answers to the following questions:
  • Are there any questions about the exercise?
  • How many people have viewed the topic?
  • How many people have took part in the discussions?
  • What resources were used? How active are the students that talk about the exercise?
  • What are the sentiments and intents expressed in the discussions?

To find the answers can be time-consuming and impossible manually (depending on the number of students and the number of involved online communities). 1) Communities have unique structures: the positions of users in structures specify roles that represent community experts, communicative members and brokers which are members connecting several groups. 2) Huge amount of texts include relevant and irrelevant for learning information. It may include interesting explanations and links but can be overlooked by a person.

So the help of machines is needed not only because the communities are big. We are talking about situations when teachers, managers or stakeholders of online learning communities want to get answers to different questions about different time intervals of their communities.

Therefore, we need a database model that provides the right information in the right place at the right time with the right cost in order to support the right decision. In the traditional databases we get a data item by specifying two parameters, e.g. specifying the time and the medium parameters we get posts appeared in the medium in the given time period. Or we get users that were active at the given time period in the given medium. But more sophisticated queries require some time to execute because of joins, etc. Moreover, analyzing communities researchers have to mine data and afterwards make some experiments to find patterns.

To simplify the life cf researchers, I propose a data cube that is a multidimensional model with a set of measures - facts. We can navigate over the data cube by specifying dimensions that correlate with Actors of ANT model. Particularly, we can choose a User that posted at a Time in a Medium so that we get the output that characterize the User in the given Time point in the given Medium. But in this case the output includes different measures captured by mining social networks, texts or user activities.

On the following picture you can see the hierarchy of dimensions to the proposed data cube. The construction of the Mediabase cube is still not ready, as it should reflect all actors we find important in the online networks. Thus I still need to elaborate Process, Agent(types), Community(types), Artifacts dimensions.

Wednesday, September 26, 2012

Users editing articles in several Wikipedia

I was wondering about contributors who are posting in several Wikipedia, Who are they? What country are they from?

Surely English Wikipedia is a leader in the number of contributors, articles, articles edits and number of users who contribute both in English and any other Wikipedia is high. I never doubt it. What about other Wikipedia? Looking at Wikipedia we can find which nations interact more often with each other than others.

A bit more than 1% of all contributors from our data set editing/creating articles in 2 Wikipedia. Only 12 of 4,919,026 users have ever worked in 5 Wikipedia.



Most of the cross-Wikipedia authors are registered; therefore, we cannot identify the place where the authors come from. A majority of the cross-Wikipedia anonymous authors are coming from Germany (15%) and the United States (11%)













The cross-Wikipedia authors contributing to more than 3 Wikipedia are summarized in following table. 



We find 1,718 contributors that did 618,584 edits. In the table we summarize the contributors work: the number of edits and the number of edits per cross-Wikipedia user. Cross-Wikipedia users contributing in 3 Wikipedia, are participating often in Russian, Japanese and any other Wikipedia from the set. The highest activity (more than 100 edits per cross-Wikipedia user) was shown in Macedonian, Russian, Catalan and Arabic Wikipedia instances. Generally cross-Wikipedia users are more active than contributors working only in one Wikipedia.

Data set:


We analyzed the Wikipedia data starting from June, 30th 2001 till January, 1 2009 and divide this time into 16 equal time intervals to keep visualizations simple. The period of the 1st interval is from June 30th, 2001 till December 31st, 2001, the period of the 2nd interval is from January 1st, 2002 till June, 29, 2002 and so on. Because of hardware limitations, we did not consider all revisions done in the period for constructing author networks. As Wikipedia instances have different number of articles, we choose a number of revisions depending on the number of all revisions in the instances. The revisions are picked up along an article timeline. The overall structure of the Wikipedia network and activities of Wikipedians are not impaired by this data reduction. Finally, we got datasets with number of revisions comparable to each other.

We chose both European and Asian Wikipedia. The instances were selected according to their size: large European Wikipedia (Spanish and Russian), large Asian Wikipedia (Japanese and Turkish), small European Wikipedia (Bulgarian, Catalan, Danish, Greek, Macedonian, and Ukrainian) and small Asian Wikipedia (Arabic, Hindi, and Korean). The list of small European Wikipedia list includes Wikipedia of different Slavic languages (Bulgarian, Macedonian, Ukrainian) and the Catalan Wikipedia, the Wikipedia of a minority language group in Spain.

Monday, August 20, 2012

Cultural analysis of Wikipedia


I'm working under the paper devoted to cultural differences in Wikipedia. Therefore, i have done some literature research where i found some interesting resources devoted to this topic. Following I'm sharing them with you. 
Wikipedia has been found in early 2001 and since then has been one of the most successful and referenced source of knowledge with a qualitative information[1]. The open Encyclopedia can include opinions of any person wherever she geographically located. Wikipedia is organized in such a way that any culture can open its Wikipedia and use its own language for providing content.
According to Hofstede (1991) "culture is the collective programming of the mind which distinguish[sic] one group of people from the other"(Hofstede, 1991). Cultural patterns together create a complex culture structure that can be examined through studying cultural dimensions (Hall, 1976; 1983; Hofstede, 1991; Kluckhohn and Strodbeck 1961; Trompenaars & Hampden-Turner, 1998). Hofstede defined 5 dimensions of culture and calculate empirically dimension rates of many nationalities in the World. The dimensions has become recently a standard framework for cross-cultural research projects. Some works about Wikipedia cross-cultural nature use Hofstede’s dimensions. Courtesy behaviors in the Eastern Wikipedia are explained by high value of power distance and preferable collective work (Hara et al., 2010).  They claimed that authors from Western Wikipedia have more conflict and disagreement behaviors. Moreover, they argue that patterns of author behavior differ in the size of Wikipedia. They analyzed only 4 Wikipedia of various sizes focusing on differences between eastern and western cultures.
Hofstede’s dimensions were used in measuring the quality of the article game (Pfeil et al., 2006). The researchers analyzed the Wikipedia article from French, German, Japanese and Dutch Wikipedia. They found that even if cultures have positively correlated Hofstede's dimensions, Wikipedia quality criteria of considered Wikipedia are different. More evidence of cultural influence in Wikipedia was found by Pembe & Bingol (2006) during comparing linguistic structures of English and German Wikipedia. Although the English Wikipedia is larger; the German Wikipedia included at that point more words associated with family concept.
Rask (2007) analyzed differences between aspects of Wikipedia in developed and developing countries. He considered the number of contributors and edits per articles from different countries and concluded that richer countries profit more on the knowledge shared in Wikipedia. Rask compared different Wikipedia from the economical point of view using the human development index[2] in his observations.
The attempts to define cultural patterns include examination of Wikipedia user talks and talk contents, contents of articles and numbers of edits. In this paper we focus on a still untouched area for cultural patterns in Wikipedia. We observe author networks over the course of time and differentiate between registered and anonymous authors. Moreover, the geographical location of authors and their migrations to other countires is taken into consideration. Furthermore, we find Wikipedia users contributing to several Wikipedia and analyze their behavior and geographical location. These and other findings devoted to Wikipedia growth and editing behaviors are used to define cultural patterns.
Voss (2005) was the first who as many other researchers analyzed fundamentals of Wikipedia and their networks. His main focus was on the German Wikipedia and its graph of links. He showed that the Wikipedia network is scale-free (Barabási et al., 1999). Moreover, Voss found that the number of user talk pages is much higher in Japanese than in German, Danish or Croatian Wikipedia although he left questions concerning cultural differences of Wikipedia unanswered. 
In the next section, we consider name existing research works examining Wikipedia networks. Later, we present our methodology, and afterwards explain the data set we are using. The results include findings about authors, their behaviors, author networks, and articles. The paper concludes with a discussion and an outlook on future work. 

Dynamic development of Wikipedia network

Dynamic development of networks was in the focus by many works (Klamma and Haasler, 2008a, 2008b; Capocci et. al, 2006, Zlatic et. al, 2006). Klamma and Haasler (2008a, 2008b) visualized different Wiki projects (Berlin Wiki, Google Wiki, Aachen Wiki) and observed their changes over the course of time. They found that registered users often serve as connectors in networks of anonymous users. Moreover, they showed that a tiny number of Wikipedia contributors created or edited the majority of articles. Klamma and Haasler created Wikiwatcher tool that can be used for retrieving Wikipedia data, visualize their networks and calculate simple SNA values.
Analysis of Wikipedia as complex networks reveals that the growth of Wikipedia happens according to the preferential attachment (Barabási et al, 1999). Capocci et. al, (2006) showed similarities of evolution patterns of complex networks of WWW and different Wikipedia: new nodes are more probably connected with existed nodes with high degrees of connections. Zlatic et. al (2006) examined 11 Wikipedia networks of articles. The researchers concentrated on article network measures and its comparison. They argue that the growth of Wikipedia networks is unique for different language versions of Wikipedia.

References:
Barabási, A.-L., Réka A., Hawoong J. (1999): Mean-field theory for scale-free random networks.
Hall, E.T. (1976). Beyond culture. Garden City, New York: Doubleday.
Hall, E.T. (1983). The dance of life. Garden City, New York: Doubleday.

Hara, N., Shachaf, P., & Hew, K. (2010). Cross cultural analysis of the Wikipedia community. Journal of the American Society of Information Science and Technology, 61(10), 2097‐2108.
Hofstede, G.H. (1991). Cultures and organizations: Software of the mind. London: McGraw Hill.
Kluckhohn, C., & Strodbeck, F.L. (1961). Variation in value orientation. Evanson, IL: Row and Peterson.
Pembe, F., & Bingol, H. (2006). Complex networks in different languages: A study of an emergent multilingual encyclopedia, Proceedings of Sixth International Conference on Complex Systems, June 25-30, 2006, Boston, MA, USA.
Pfeil, U., Zaphiris, P., & Ang, C.S. (2006). Cultural differences in collaborative authoring of Wikipedia. Journal of Computer-Mediated Communication, 12(1), article 5.
Rask, M.(2007), The Richness and Reach of Wikinomics: Is the Free Web-Based Encyclopedia Wikipedia Only for the Rich Countries?. Proceedings of the Joint Conference of The International Society of Marketing Development and the Macromarketing Society, June 2-5, 2007. 

Trompenaars, F., & Hampden-Turner. C. (1998). Riding the waves of culture: Understanding cultural diversity in global business. New York: McGraw-Hill.
Voß, J. (2005). Measuring Wikipedia, Proceedings of the 10th ISSI 2005 Conference, July 24-28, 2005, Stockholm, Sweden, 1-12.

Tuesday, July 17, 2012

Cultural dimensions of Hofstede


I'm working on one paper devoted to cultural differences in Web and find cultural dimensions of Hofstede extremely important for those who are looking for cultural differences. Here i'm sharing my findings. 

"The great work of Hofstede proposes cultural dimensions of people working at IBM in over 50 countries (Hofstede, 1991). He identified 5 dimensions of differences between national cultures: power distance, collectivism versus individualism, femininity versus masculinity, uncertainty avoidance, long-term orientation. Following we are going to explain these dimensions.
The acceptability and expectance of the power of members within society institutions like family, school and a community at work shows the power distance of the society. This dimension is about respect according to the distance between people in different levels of hierarchy.
Everyone is responsible for herself in individualistic culture while groups are responsible for their members in collectivistic culture. In other words, individualism describes the situation when a person thinks firstly about his interests as long as collectivism describes the situation when a person thinks firstly about group interests.
Masculinity and femininity refer to social gender roles. Either the roles are clearly distinct, i.e. “men are focused on material success, whereas women are concerned with the quality of life” – masculinity is higher. Or the roles overlap, i.e. both women and men take care about life quality and material success.
The attitude to the uncertainty is reflected in uncertainty avoidance (UA) dimension. Representatives of some cultures react negatively to uncertainty and need clear idea about what is going to happen. These cultures have a high level of UA dimension.
Long-term vs. short-term orientation in life explains cultural differences in spending and saving money and accepting quick results. People with long-term orientation of life (LTOL) use to save more money and adopt themselves to the modern context while people with short-term orientation use to overspend but respect traditions without any or with minor adaptations. Moreover, in the first case results should be immediate while in the second case it is very preferable." 

Saturday, November 22, 2008

Reflective patterns in online communities

The rate of return (ROI) answers are the money gained or lost. It measures a percentage of the interest to the investment, e.g. you gained 5 € on your 10€ investment - the ROI is 5/10=50%. The ROI can be as well negative.

How can we analyse the interest of the investment in online communities? What events cause positive or negative activities in networks? There are some papers trying to explain the matter of things basing on communities contributions:

In Burke et al. the scholars calculated the probabilities of getting a response in discussion boards. The probabilities highly depends on the type of introduction("i'm lurking here for several months" - group introduction or "working as a professional in this area" - topic introduction) and sometimes on the type of group.

The wiki case is observed in "Wikis as social networks: dynamics and evolution", where the vandalism on a wiki page was detected if the page regulates the information flow within Wikis. Particularly, if the page is a hub between many other pages, the probability of trashing is high.

Anyway, it is more complex to find out the ROI for human interactions as to calculate the finance ROI, where monetary values are used. The investements in human interactions and the interests of these are not uniform and hardly comparable. What you know from the childhood: if you use rude words in the conversation, your listener will reply coarsely as well (in most cases). The rule "if ... then ... " is on the stage. If we have the pattern "rude question", will we get the pattern "rude answer"? Why yes and why now? What are dependencies? Shouldn't the activities/events/actions in communities be modeled? Then will modeling be useful to find the provoked profit?

In the informatics there is a bag of artificial intelligence problems like reasoning, learning, decision making. Those are aware or at least try to be aware of all the inconsistencies and uncertanties human interactions have. Those should be used to find the reflective patterns.

Thursday, July 24, 2008

Emotional Web Intelligence

I was always interested how it is possible to analyse non-cognitive side of human interactions. What i have done in my master thesis was the term frequency of categories words for a thread in a mailing list. The categories was kindly presented from James W. Pennebaker. He and his colleagues did a great job on creating the repositiry of the words that are classified according to the emotions/feelings people try to express in their writing. They use the repository now in their text analysis software.

A community of a mailing list was characterized by an emotional vector as well as by a structural vector. I examined all communities vectors in order to find similar communities. The results for clusters based on structure and clusters based on emotional analysis were different.

Analysing all the results i suppose that the situation we can see in a network structure is not correlating with the emotional preferences. Human interactions depend on many other different factors like prestige (created through a network structure), duties (put by a organizational structure) and many others.

We have to communicate to somebody we don't like to. But will such a communication be successful, efficient and enduring?

The human interactions can be perfectly visualized with the help of graphs:


and can be structurally analysed:


In our example the node 25 and the node 60 are not interacting, though they produce and consume the correlating number of words. However, those words possibly are produced by the node 26 who is between those nodes. Concerning to the other nodes, it is no evident correlation in characteristics was discovered. Anyway, if we switch to the node 36 characteristics we can’t find any value that will correlate with the cognitive characteristics of the nodes. The node 36 has no edges with the others and that might be a reason. Nevertheless, according to numerous experiments we could not prove the assumption that the interacting with each other members possess correlating or non-correlating emotional vectors.

Anyway emotions can be used during graph visualization so that it will be clear who is connected with whom non-cognitively(by colors, or by positions in a graph).

I would like to follow the community evolution and its dependency on emotions. Moreover it is useful to focus and discover other dependencies that influences on sucess or failure of communities.

Monday, June 30, 2008

Cross-Platform Aspects of the Social Web paper

I'm making now some last corrections to the paper for the special track on Cross-Platform Aspects of the Social Web (CPASW '08) within I-Know conference in Graz:
CPASW

I've done a kind of survey about prevalent SNS (social network sites). After talking with my colleague, David, i decide to write once a full survey of SNS and websites with SNS elements. It might be good cited.