Microsoft Azure will support the research on Wikipedia quality

Dr Krzysztof Węcel from the Department of Information Systems received a grant for conducting research with the tools available through the Microsoft Azure cloud. The award was presented as part of the Microsoft Azure for Research Award programme after a positive assessment of the project proposal entitled “Data Science for improving the quality of crowdsourced information. The case of Wikipedia”. The project will also involve doctoral student Włodzimierz Lewoniewski and students taking summer internships at the Faculty of Informatics and Electronic Economy in the field.

The aim of the study is to develop methods of gathering complete, precise, reliable and current information, i.e. so-called high-quality information based on analysis of information provided by independent suppliers (crowdsourcing). The best-known example of a source co-created by multiple authors is Wikipedia. It currently contains over 44 million articles in nearly 300 languages. It is the fifth most popular website in the world. It is also the source that attracts the most online traffic through search engines – 37.5% (source: Alexa).

A special challenge is the data volume. The English Wikipedia contains over 5 million articles. Just the text of the articles after compression takes up 13GB. We then need to add pages with discussions on article contents – 25GB. If we wanted to include information about who and when changed any page (without changed contents), we would need another 50GB. The expected volume for the planned scope of research is 15-20 terabytes (1TB = 1024GB). The use of Azure services can significantly improve the quality and speed of the research. It will not only help us overcome the challenges related to the data volumes, but also significantly increase calculation capabilities, mainly in the scope of machine learning, to build quality assessment models.

The research conducted at Poznań University of Economics and Business may contribute towards overcoming a range of social and economic issues related to information quality. An example of the above may be the problem of spreading fake news. A team of PUEB researchers will gain valuable experience in working with large-scale data, which will reinforce the scientific potential of EU grant applications submitted.