Wikidata: Summing the sum of human knowledge

A lowdown on Wikidata, a free, structured database built on MediaWiki, which is currently taking in data from Wikipedias in different languages.

Have you ever wished you could query Wikipedia with a question? Have you ever wanted to run a search on Wikipedia to extract data rather than trawl through reams of Wikipedia pages and interlinks? Have you ever imagined a world where everyone could freely pool knowledge even without sharing a common language? Wikidata is an attempt to do all this, and more.


Wikidata, deployed in October this year, is the newest project by the Wikimedia Foundation, the organisation that runs Wikipedia and its sister projects. It is a semantic, multilingual, structured database that anyone can read and edit. It is machine readable, which makes searching, sorting and handling of data efficient and fast. Like with its siblings, Wikidata’s content and software are published under a free license.


Central repository of links

Wikidata aims to streamline Wikipedia’s rich but chaotic linking structure. Every Wikipedia page has ‘interwiki links’ that link it to other Wikipedia pages. ‘Interlanguage links’ connect pages to other pages on the same topic published on Wikipedias in different languages, creating a mesh of links. With around 23 million articles in 285 languages, the linking structure leads to immense duplication. Wikidata intends to act as a central repository for all language links on all Wikipedias, with the ultimate objective of levelling off the raw data in the language versions. Under the first phase of Wikidata deployment, which is now live, editors can add interlanguage links (sitelinks) on Wikidata pages.

Wikidata: Summing the sum of human knowledge

The layout of a Wikidata page showing links to Wikipedias in various languages on the topic The Elder Scrolls IV: Oblivion (Image credit: Wikimedia Commons/ Sven Manguard)



Ask Wikipedia a question

The future phases of Wikidata deployment involve converging the data found in Infoboxes (December 2012 or January 2013), and the enabling of the creation of lists on Wikipedia (April or March 2013).


Infoboxes -- tables with the summary of the most important information on the pages -- across Wikipedias are not uniform. Internationalised export and import of data present in Wikipedia templates such as the Infobox would greatly enrich its encyclopaedic value. All information present in all Infoboxes will then come to one central repository from where it could be updated everywhere else at the same time.


Wikipedia houses lists on assorted topics created by users -- episodes of the most popular TV shows, heads of state, and even inventors killed by their own inventions. However, it is currently not possible to query these lists and cull out specific information. For example, searching the string “countries with highest GDP” leads to a search results page with the entries ‘List of countries with the highest GDP per capita’ and ‘Historical list of ten largest countries by GDP’. But if you were to search for ‘ten countries with highest GDP in Asia’ you wouldn’t quickly reach the relevant information even though it exists in these lists. Wikidata will automatically create a user-queried list to answer semantic queries such as, “What were the ten countries with highest GDP in Asia last year?”


Konarak Ratnakar, who has registered over 8,000 edits on Wikidata, says, “I found that Wikidata had very few India-related entries. So I started adding information on Indian cities and the capitals of small countries. I like Infoboxes very much, and that fact that this project would make Infoboxes grow motivated me to contribute.” 


Supporting smaller Wikipedias

Of the 285 Wikipedias, only four have more than a million articles. 40 Wikipedias have over a 100,000 articles. Big Wikipedias have legions of editors; most Wikipedias with a smaller article base have fewer speakers of the language or less volunteer manpower. Wikidata aspires to enable smaller Wikipedias to extend their tendrils to bigger ones, and benefit from the vast amounts of language-independent data present on them. Having up-to-date data at their disposal will potentially rope in and retain editors and contributors on smaller Wikipedias, where researching and writing voluminous content from scratch may be a laborious task. The web and Wikipedia would get more content, which in turn, would become useful to those who only speak the languages in which smaller Wikipedias exist.


Verifiability and difference of opinion

Wikidata can handle differences of opinion. If different sources of data portray discrepant or conflicting information, Wikidata has the provision to include such differences, and the different nuances of information. This data will be accompanied by its source in keeping with one of Wikipedia’s cornerstones of verifiability. Lydia Pintscher of Wikimedia Deuthschland, who handles community communications for Wikidata, explains, “Say you have the number of inhabitants of Israel. People don't agree on an exact number for that for various reasons. In Wikidata it will be possible to add all of them (within reason) with sources. So it'll not be about what is right but about what some source says.”


Pushing the envelope at Wikipedias and the web

Once Wikidata is fully deployed, and if all goes as planned, it will allow for more new articles to be created with fewer contributions. Editors will not need to write full articles on topics that already exist on other Wikipedias. Instead, they will need to enter data at the specified data points to get an article with the preliminary framework ready. Wikidata will also enable micro-contributions to articles that already exist by allowing easy data entry for editors to expand or update articles using the available datasets.


Machine-readable data facilitates automatic updating of content that is currently done manually on Wikipedia, for example, the updating of census figures or election results. Wikipedia lists too will be automatically created and updated.


Editing on Wikidata is done through a form-based editor, instead of using wiki syntax, the mark-up present on Wikipedia, which new editors generally find tedious. 


Once Wikidata is fully operational, it will act as an interchange point for more Wiki projects, such as the Wikimedia Commons, a media repository. Wikidata is governed by the Creative Commons Public Domain license, which renders the data on it copyright-free. As the Wikidata API is freely available and the software is open licensed, you could run your own instance. The availability of interwiki data and API opens up many possibilities -- data extraction, data mash-ups, using interwiki metadata for regional or localised Wikipedias, and expanding the scope of apps such as the Interwiki Redirect Service. Pintscher says, “In the near future we'll see the roll-out of phase 2 of Wikidata and then the adaption of that on the Wikipedias. With time I hope that more and more players outside Wikimedia will make use of Wikidata and build great things on top of it. I'm sure they'll come up with many things we haven't even thought of yet.”


Wikidata has the potential to be a repository of repositories where the sum total of human knowledge coalesces, a database that glues together Wikipedias of all 285 (and counting) languages.


Disclosure: The author is a member is a Wikimedia India Chapter and a Wikipedia editor.


Cover image: Wikimedia Commons

Find our entire collection of stories, in-depth analysis, live updates, videos & more on Chandrayaan 2 Moon Mission on our dedicated #Chandrayaan2TheMoon domain.