My main area of research includes Large scale analysis of unstructured text.  I’m interested in finding patterns in Big Data for example patterns in the contents of the global media sphere and social media content using big data technologies and text analytics. I also work in creating scalable online preference learning modules that learn adaptively using machine learning techniques. Before being introduced to the world of text mining and big data I did research in the area of e-health informatics.

The following are the research projects and topics I have worked on.

Content Analysis of 150 years of British Periodicals

What could be learnt about the world if we could read the news from over 100 local newspapers for a period of 150 years? This is what we have done in this project, together with a social scientist and a historian, who had access to 150 years of British regional newspapers.

The patterns that emerged from the automated analysis of 35 million articles ranged from the detection of major events, to the subtle variations in gender bias across the decades. The study has investigated transitions such as the uptake of new technologies and even new political ideas, in a new way that is more like genomic studies than traditional historical investigation. The main focus of the study was to establish if major historical and cultural changes could be detected from the subtle statistical footprints left in the collective content of local newspapers. How many women were mentioned? In which year did electricity start being mentioned more than steam? Crucially, this work goes well beyond counting words, and deploys AI methods to identify people and their gender, or locations and their position on the map. We have demonstrated that computational approaches can establish meaningful relationships between a given signal in large-scale textual corpora and verifiable historical moments.

Journal Publication: PNAS
Media Coverage: Times Magazine,, Phys.orgInternational Business Coverage (UK)Tech Times, iNewsScience Daily, WN.comBritish Library Newsroom blogSciences et Avenir
Gender Representation in Online News

It has long been argued that women are under-represented and marginalised in relation to men in the world’s news media. In this project we analysed over two million articles to find out how gender is represented in online news. The study, which is the largest undertaken to date, found men’s views and voices are represented more in online news than women’s.

We analysed both words and images so as to give a broader picture of how gender is represented in online news. The corpus of news content examined consists of 2,353,652 articles collected over a period of six months from more than 950 different news outlets. From this initial data set, we extracted 2,171,239 references to named persons and 1,376,824 images resolving the gender of names and faces using automated computational methods. We found that males were represented more often than females in both images and text, but in proportions that changed across topics, news outlets and mode. Moreover, the proportion of females was consistently higher in images than in text, for virtually all topics and news outlets; women were more likely to be represented visually than they were mentioned as a news actor or source. Our large-scale, data-driven analysis offers important empirical evidence of macroscopic patterns in news content concerning the way men and women are represented.

Journal Publication: PLOS ONE
Media Coverage: Science DailyScience Mag
Scalable Preference Learning from Data Streams

In this project we study the task of learning the preferences of online readers of news, based on their past choices. Previous work has shown that it is possible to model this situation as a competition between articles, where the most appealing articles of the day are those selected by the most users.

The appeal of an article can be computed from its textual content, and the evaluation function can be learned from training data. In this paper, we show how this task can benefit from an efficient algorithm, based on hashing representations, which enables it to be deployed on high intensity data streams. To solve this problem, our prior results indicate that one needs to solve a preference learning task that consists of classifying ordered pairs of articles [9]. We call this classifier a ranker. For the ranking procedure we learn an appeal function that separates pairs of more-appealing (i.e. popular) and less-appealing (i.e. non-popular) articles. We demonstrate the effectiveness of this approach on four real world news streams, compare it with standard approaches, and describe a new online demonstration based on this technology. We have trained rankers on data streams within a global interval of six years between 2008 and 2014 using both TFIDF representations (words) and scalable hash representations (h12).

scalable preference learning
Conference Publication: WWW 2015
Coverage of Science in the Media: A study on the impact of the Fukushima disaster

The contents of English-language online-news over 5 years have been analyzed to explore the impact of the Fukushima disaster on the media coverage of nuclear power. This big data study, based on millions of news articles, involves the extraction of narrative networks, association networks, and sentiment time series.

The key finding is that media attitude towards nuclear power has significantly changed in the wake of the Fukushima disaster, in terms of sentiment and in terms of framing, showing a long lasting effect that does not appear to recover before the end of the period covered by this study. In particular, we find that the media discourse has shifted from one of public debate about nuclear power as a viable option for energy supply needs to a re-emergence of the public views of nuclear power and the risks associated with it.

Conference Publication: IEEE Big Data 2014
CompLACS: Composing Learning for Artificial Cognitive Systems

The purpose of this project is to develop a unified toolkit for intelligent control in many different problem areas. I was involved in the creation of a web agent platform. The goal of the web-agent platform was to test algorithms and concepts, and to inspire new algorithms and concepts, within the domain of composing learning systems. The four scenarios identified within this goal were: Scenario 1: Annotation of News Articles, Scenario 2: Prioritising News Feeds and Scenario 3: Exploring the Web.Several modules were developed for online classifiers for learning news topics such as sports, politics, entertainment, education and technology and also to learn appeal of readers of different news outlets such as BBC, NPR etc. Another study was done combining modular adaptive modules into a single system, where an intermediate representation of the data was learnt (hidden layer) by supervised online learning based on web streams. This was then used as an input to train another system, realising a combined architecture.

Project Website: CompLACS
Detecting patterns in 2012 US Elections news coverage

The automated parsing of 130k news articles about the 2012 US presidential elections produces a network formed by the key political actors and issues, which were linked by relations of support and opposition. The nodes are formed by noun phrases and links by verbs, directly expressing the action of one node upon the other. This network is studied by applying insights from several theories and techniques, and by combining existing tools in an innovative way, including: graph partitioning, centrality, assortativity, hierarchy and structural balance.First, we observe that the fundamental split between the Republican and Democrat camps can be easily detected by network partitioning, which provides a strong validation check of the approach adopted, as well as a sound way to assign actors and topics to one of the two camps. Second, we identify the most central nodes of the political camps. This is the first study in which political positions are automatically extracted and derived from a very large corpus of online news, generating a network that goes well beyond traditional word-association networks by means of richer linguistic analysis of texts.

Journal Publication: Big Data & Society
Conference Publication: EACL 2012, IPP 2012
Media Coverage: Science Daily, Scientific Computing, ECN Mag,,, Significance
Automated analysis of Gutenberg project stories for digital humanities

The wide availability of digitized literary works has given rise to a computational approach to analyzing these texts. The extraction of narrative networks from large corpora, discusses various issues relating to the validation of the resulting information, and shows an application of this methodology, to the analysis of three selected books containing stories from the Gutenberg Project. We demonstrate how this approach correctly identifies the key actors in each narrative and also some key features such as: “The Indian Tales” by Rudyard Kipling does not necessarily talk a lot about Indians, instead highlights British soldiers as the main heroes of the narration. The analysis of the “Heroic stories of American History” by Albert F. Blaisdell and Francis K. Ball identifies the most important generals who served the American army during the revolutionary war. This method can automate the labour intensive “coding” part of “Distant Reading” among other tasks, and therefore have relevance in the social sciences and the humanities.

Journal Publication: International Journal of Advanced Computer Science
Automating Large scale analysis of narrative text: The methodology

We present a methodology for the extraction of narrative information from a large corpus. The key idea is to transform the corpus into a network, formed by linking the key actors and objects of the narration, and then to analyse this network to extract information about their relations. By representing information into a single network it is possible to infer relations between these entities, including when they have never been mentioned together. Various measures have been introduced to increase the precision of the system thereby making sure information extracted is of high confidence.The contribution of this study is not in the improvement of tools for the processing of language (e.g. parsers) but in the development of a new methodology for the extraction of knowledge from a large corpus. The information we extract is not found in any individual document, but inferred from information distributed across the corpus, by effect of analysing a large network assembled by using all the documents.



Journal Publication: Natural Language Engineering
Conference Publication: WAPA 2011
Multi-document summarisation using event based extractive summarisation

With the continuing growth of online information, it has become increasingly important to provide improved mechanisms to find and present textual information effectively. Event-based summarization is of great interest in recent research. Atomic events, which are the relationships between the important named entities can be automatically extracted from text and used for summarization.We developed a method to do automatic summaries using the key entites, actions and triplets found by our system pipeline incorporating a sentence scoring method based on textual features and using it to identify the summary sentences. This method proves to be useful by ranking 24 and 23 in ROUGE-2 and ROUGE-SU4 recall scores when compared across the scores of other 43 participants in the TAC 2010 guided summarization task. Future work is to be conducted in order to generate much more cleaner summaries using this information with other more sophisticated techniques.

 PhD Thesis: Chapter 
Vidusuwa: A patient-centric telemedicine solution

The pilot project of ViduSuwa, links the Specialist in a city with a patient in a peripheral setting via an electronic media; the solution is presently implemented in two hospitals in Sri Lanka. Patients in rural areas incur heavy expenditure in traveling long distances spending lot of time to consult Specialists in cities due to the lack of Specialists in their areas. This issue can be addressed by an eSolution that makes appropriate use of Electronic Medical Records (EMR) and Telemedicine technologies which enables the patient to consult a Specialist through eConsultation. The main benefit of implementing such an eSolution is that, it ensures the availability of the Specialist across a distance at many eClinics within the shortest possible time frame.Information and Communication Technology Agency of Sri Lanka (ICTA) together with University of Colombo School of Computing (UCSC) and Health Ministry of Sri Lanka launched the e-consultancy project, which is full of benefits to both patients and physicians.

Journal Publication: Sri Lanka Journal of Bio-Medical Informatics
Conference Publication: ETELEMED’10 (e-health solution), ETELEMED’10 (m-health solution), eAsia 2009IST-Africa (e-health business model), IST-Africa (e-health change-management model)
Social media & sharing icons powered by UltimatelySocial