A research paper on the automatic identification of reliable sources of information about companies in multilingual Wikipedia has been published on the IEEE website. The information source assessment models presented in the research work can help Internet users find valuable sources of information about companies using open data from Wikipedia, Wikidata and DBpedia.
First, references were identified in each considered Wikipedia language versions. For example, the English contained about 70.3 million references (including 52 million unique ones), the German Wikipedia – about 12.7 million (including 10.1 million unique ones). Then Wikipedia articles about companies from over 40 different language versions were selected using semantic knowledge bases such as DBpedia and Wikidata. From these articles, sources of information were selected and assessed on the basis of the 5 described models.
Wikidata
The semantic knowledge base Wikidata works in a similar way to Wikipedia, with one notable difference – here we can put facts about subjects using proportional and value statements, not natural language sentences. Each Wikidata item contains a collection of different statements arranged in the form “Subject-Predicate-Object” (in context of Wikidata: “Item-Property-Value”). For example, information about the Apple Inc. company can be found on a separate page in Wikidata:
Within the above page, we can find statements that are described using different properties. For example, the following statements result from linked by the P31 property (“instance of”) to other objects (the object identifier is given in parentheses):
- Apple Inc. – instance of – enterprise (Q6881511)
- Apple Inc. – instance of – business (Q4830453)
- Apple Inc. – instance of – public company (Q891723)
- (and others…)
Wikidata is also considered to be the central data management platform for Wikipedia and most of its sister projects. This means that using Wikidata we can find links to Wikipedia articles in different languages describing the same object. Thus, having a list of Wikidata items of a certain type (e.g. companies), we can also find corresponding Wikipedia article names.
Currently, Wikidata has over 100 million items (described objects), while the number of Wikipedia articles in all language versions is around 60 million. This means that not every Wikidata item needs to refer to a separate Wikipedia article on a specific topic.
If we leave only those Wikidata items that are linked to at least one Wikipedia article, the most frequently used values under the P31 property (“instance of”) can be represented as the following value cloud (own calculations in 2022):
The following values were excluded from the illustration above: Q4167410 (“Wikimedia disambiguation page”), Q13406463 (“Wikimedia list article”), Q22808320 (“Wikimedia human name disambiguation page”), Q18340514 (“events in a specific year or time period”).
DBpedia
The semantic knowledge base DBpedia is automatically enriched using structured information from Wikipedia articles in different languages. The acquired knowledge on a given topic is available on a separate page. For example, such semantic data about the Apple Inc. as a DBpedia resource extracted from the English Wikipedia can be found at:
On such DBpedia pages, among the various properties, we can also find information about the type(s) of the described object. For our example, DBpedia indicates that the object belongs to such classes as: dbo:Organisation, dbo:Company and others. Having the names of the classes we are interested in, we can find all objects of a certain type within DBpedia.
The most commonly used classes from the DBpedia ontology are shown in the following figure (own calculations in 2022):
The research results were presented at the FedCSIS 2022 conference. The scientific publication can be found on the websites of IEEE and ACSIS.