Wikipedia is a massive repository of human knowledge. The largest edition, the English Wikipedia, contains over 65.5 million pages, including 7.17 million articles (excluding redirects). Connecting this vast network are 1.63 billion unique page-to-page links (commonly known as “wikilinks“). Based on an analysis of this dataset, the most cited articles on the English Wikipedia were identified.

When considering what these most cited articles in Wikipedia might be, we can assume that prominent historical topics like “United States” or “World War II” would hold the top positions. However, when the entire structural graph of Wikipedia is processed to calculate number of incoming links, or citations (quantifying exactly how many articles link to a specific page), the true leaders of the ranking emerge.

Watch the full video:

Top Cited Articles

Data analysis of the English Wikipedia dump files from May 2026 allowed to build ranking of the most cited articles. Here are the top 50 Wikipedia articles, with the number of incoming links (citations) shown in brackets:

  1. ISBN (1,640,723)
  2. Geographic coordinate system (1,257,791)
  3. Digital object identifier (711,302)
  4. Wayback Machine (675,714)
  5. Wikidata (509,749)
  6. ISSN (505,509)
  7. Taxonomy (biology) (496,532)
  8. Global Biodiversity Information Facility (464,041)
  9. Time zone (455,438)
  10. United States (437,284)
  11. IMDb (406,083)
  12. Open Tree of Life (399,021)
  13. Binomial nomenclature (392,021)
  14. Animal (376,148)
  15. Catalogue of Life (368,553)
  16. Interim Register of Marine and Nonmarine Genera (313,153)
  17. INaturalist (295,815)
  18. Encyclopedia of Life (275,091)
  19. Association football (274,176)
  20. Daylight saving time (273,633)
  21. Wikispecies (267,809)
  22. France (265,679)
  23. Semantic Scholar (253,534)
  24. OCLC (251,623)
  25. Record label (242,084)
  26. Arthropod (241,604)
  27. National Center for Biotechnology Information (237,741)
  28. PubMed (230,779)
  29. World War II (227,499)
  30. Music genre (221,588)
  31. Pancrustacea (216,225)
  32. Insect (214,763)
  33. Germany (214,199)
  34. United Kingdom (213,409)
  35. Record producer (200,934)
  36. The New York Times (200,715)
  37. Political party (194,625)
  38. Australia (190,367)
  39. Italy (189,720)
  40. Synonym (taxonomy) (185,121)
  41. India (184,424)
  42. Bibcode (174,702)
  43. Integrated Taxonomic Information System (174,431)
  44. Surname (171,530)
  45. Japan (168,687)
  46. Russia (166,259)
  47. Canada (165,593)
  48. Spain (162,773)
  49. UTC+02:00 (160,743)
  50. Poland (160,034)

A more complete ranking is presented in this video and is also available on Hugging Face and Kaggle.

Methodology: Processing the Data

To produce this ranking, a simple script to load content of Wikipedia articles (e.g. in a wiki markup format) is insufficient – such an approach would be prohibitively slow and would fail to account for hidden complexities within the website’s structure. Instead, Wikipedia’s raw SQL database dumps are processed. To ensure better accuracy, the data pipeline meticulously merges four core files (from the Wikimedia Downloads as of May 2026):

  • 1. Master Article Registry
    The process begins with enwiki-20260501-page.sql.gz (see page table). This file assigns a unique numerical ID to every page on Wikipedia. It enables the filtering out of talk pages or user pages, retaining only actual encyclopedia articles (namespace 0). Crucially, it also indicates whether a page is a “Redirect” (a shortcut page).
  • 2. Link Target Translator
    In modern Wikipedia architecture, links in the database do not point directly to text; they point to numerical target IDs. The file enwiki-20260501-linktarget.sql.gz (see linktarget table) is used as a dictionary to translate these target IDs back into readable article titles.
  • 3. Redirect Resolver

    Wikipedia relies heavily on alias pages (e.g., “USA” redirects to “United States”). The file enwiki-20260501-redirect.sql.gz (see redirect table) is utilized to build a map of every single redirect so the final destination of those shortcuts can be accurately determined.

  • 4. Connectivity Graph

    Finally, the massive enwiki-20260501-pagelinks.sql.gz file (see pagelinks table) is processed. This is the raw graph containing billions of links, stating simply that “Page A has a link to Target ID B”.

The Strict Counting Rules

Once the database files are merged, the final citation scores are calculated using a strict set of rules to ensure the integrity of the ranking:

  • True Articles Only: Only links originating from actual articles (Namespace 0) are counted. Links from talk pages, categories, or user profiles are entirely excluded. Furthermore, links originating from redirect pages are ignored.
  • Absolute Deduplication: Only unique links are considered. A single source article can provide a maximum of one citation to a target article. Even if an article links to a target multiple times throughout its text (or links to it through various different redirect aliases) it will never artificially inflate the citation score. The relationship is counted strictly once.

The Challenge: Resolving Aliases and Ensuring Unique Citations

Target IDs Pointing to Redirects

One of the technical challenges in building this ranking is handling instances where a link target points directly to a redirect. A target ID can, and very often does, point to a redirect name.

When a Wikipedia editor in some article types (using wiki markup language) [[USA]] (a link to the article titled “USA”), the database assigns it a link target ID. However, “USA” is not a final article; it exists in the page table (Master Article Registry) as a redirect page (flagged as page_is_redirect = 1). If this were counted naively, “USA” and “United States” would each receive a portion of the citations, fracturing the ranking.

The processing pipeline dynamically resolves this challenge:

  1. The link target ID is extracted from the Connectivity Graph.
  2. The Link Target Translator is queried to find the page ID that matches that target.
  3. The Master Article Registry is checked to determine if this page ID is marked as a redirect.
  4. If affirmative, the script intercepts the link and consults the Redirect Resolver to find the final destination page ID.
  5. The citation credit is then properly forwarded to the main article (e.g., “United States”).

Ensuring Unique Citations (Deduplication)

Another important aspect of the methodology is the strict requirement for deduplication: only one unique link from a given source article to a target article is counted. If a single article contains multiple links to the same destination, it does not artificially inflate the citation score.

Furthermore, because redirects are completely resolved before the final counting phase, this deduplication automatically applies to aliases. For example, if an article contains a link to “USA” and another link to “United States” further down the page, the pipeline resolves “USA” to “United States”. It then recognizes both links as pointing to the identical target entity, removing the duplicate. The final result is exactly one citation credited to “United States” from the analyzed article.

The Advantages of the SQL Approach

The rationale for processing raw SQL databases instead of developing a program to extract wikitext and parse it (via regular expressions) is threefold.

First, Wikipedia text is highly irregular – links are frequently embedded inside complex templates, infoboxes, or formatting macros (which generate the massive number of ISBN and DOI links seen at the top of the ranking). By utilizing the pre-compiled SQL dumps, the exact links successfully parsed by Wikipedia’s own servers are analyzed, guaranteeing better accuracy.

Second, a naive text-scraping method would split a topic’s popularity across its aliases and likely double-count multiple links on the same page. Strictly integrating the redirect table and enforcing unique pair counting consolidates the data, ensuring the final metrics accurately reflect an article’s true structural significance.

Third, attempting to collect this volume of data via web scraping (HTTP requests) would be computationally inefficient, require months of processing time, and place an unnecessary burden on Wikipedia’s servers. Utilizing the SQL dumps allows the entire 1.63 billion link graph to be processed and analyzed offline with high efficiency.

Templates, Infoboxes, and Automatically Generated Links

A fair question is whether rankings like this are heavily influenced by templates, infoboxes, citation templates, and other automatically generated elements. Indeed, a substantial portion of Wikipedia’s internal link structure originates from such mechanisms.

The main difficulty lies in deciding where the boundary should be drawn. If only links directly visible in raw wikitext were counted, many important connections would be missed, including links produced by inline templates, links generated by infoboxes, and links inserted through Wikidata-driven modules.

It is also worth noting that some links originally written manually in wikitext may later have been replaced by templates for convenience, consistency, or standardization. As a result, distinguishing between “manual” and “template-generated” links is not always straightforward.

A separate ranking based exclusively on article-body (wikitext) links would certainly be interesting. However, such an approach would require many additional methodological decisions regarding what should and should not be counted — for example, whether links located in the “References” section, navigation boxes, maintenance templates, or metadata-generated elements should be included.

For this reason, the present analysis relies on the fully processed internal link graph recorded in Wikipedia’s SQL database dumps, reflecting the actual connectivity structure generated by MediaWiki itself.