Dataset with Quality Assessment of 47 Million Wikipedia Articles

On the Hugging Face platform, an extensive dataset has been published containing the results of automated quality assessments for 47 million Wikipedia articles in 55 language versions. These evaluations were carried out using the algorithms employed by WikiRank.net, a tool that compares the quality of Wikipedia articles across different languages.

The WikiRank service assigns each article a synthetic score on a scale of 0–100 based on various metrics, including text length, the number of sources (references), sections, and illustrations. As a result, every article receives a unified quality score, simplifying comparisons between language versions that typically use different evaluation criteria.

The publication of this dataset offers many potential applications and benefits.

Comparison of Content Quality in Different Languages

The unified 0–100 scoring system allows for direct comparisons of article quality across various language versions. This enables the identification of languages in which a given article is best developed, as well as those where further improvement is needed. It is the first time such a comprehensive multilingual analysis of Wikipedia’s quality has been made possible.

Research on Information Quality and NLP

This dataset serves as a valuable resource for researchers in information science and specialists in natural language processing (NLP). It enables the analysis of quality trends on a massive scale and can be used to train artificial intelligence models to predict text quality. Previous studies have already leveraged similar WikiRank data to explore which topics are best represented in different language versions of Wikipedia, demonstrating the usefulness of these evaluations in comparative analyses. Now, such research will become even more accessible thanks to this publicly available dataset.

Support for Wikipedia Editors

Automated evaluations can assist Wikipedia editors in identifying articles that require improvement. In many language versions, the majority of entries have not received any community-assigned quality ratings (in some Wikipedias, over 99% of articles remain unevaluated by humans). With WikiRank data, editors can easily pinpoint lower-quality articles—for instance, those with few sources or brief content—and focus their efforts on enhancing them. This tool can highlight gaps and set editorial priorities for each language version.

Development of AI Algorithms for Content Quality Analysis

Releasing such a large and diverse dataset will facilitate the development of AI algorithms for assessing the quality of texts on the Internet. AI models can be trained with millions of examples of articles along with their quality scores, enabling them to distinguish between reliable and less substantiated content. Such automated evaluation systems may find applications not only on Wikipedia but also in filtering online information—from detecting unreliable articles to improving search engine results based on content quality.

The complete dataset is available for download on the Hugging Face.

Polski
English
Русский