Wikipedia data quality: automatic evaluation of infoboxes in different languages

Infobox provides a summary of the most important information relating to a particular object described in a Wikipedia article. In other words — Wikipedia infobox summarize factoid knowledge.

Infobox looks like a table usually added to the top right side of Wikipedia article. Depending on topic, such infobox consists various parameters. For example, if it describes a person, it often has a date and place of birth, education, a citizenship etc. Another example — an infobox about a city, which often shows a population, mayor, postal code, country, date of town rights and other.

Due to the independence of the editorial process in different language versions of Wikipedia, information in infoboxes about the same topic may differ. For example, if somebody provide updated information about the population in an article about London in English Wikipedia, it doesn’t mean, that other (over 200) languages will have such update — often other Wikipedia users must provide relevant changes in each language.

If we want to compare information quality in Wikipedia infoboxes between different language versions, we often need to understand those languages. Fortunately, we can automatizes this process using machine learning techniques to assess the quality of multilingual information. One of the applications for these purposes — a recently released Chrome extension that helps compare the quality of infoboxes between Wikipedia languages. See short video on how it works:

The best language versions can help to improve the articles quality in less developed Wikipedia language editions and also enrich other popular open knowledge bases: DBpedia, Wikidata, YAGO and others.

The source code of the extension is available on GitHub.