Unification in Academic Literature

Libraries store the energy that fuels the imagination. They open up windows to the world and inspire us to explore and acheive, and contribute to improving our quality of life -Sidney Sheldon

Chaos is always present. Whether it’s in the weather or in crowds, things always tend to be in disarray. The same can be said for data storage and retrieval systems as well.

The concept of data is odd. In the end, data is essentially stored as a sequence of 0s and 1s in a serializable format. What makes it even more special is that this format can be manipulated to anything we desire it to be. We have the power to shape our own realities inside these complex logic gates. Objects are created at will in a method known as ‘Object-Oriented Programming.’ However, the greatest advancement has come in the form of data storage.

Libraries are nothing new. They date back to a millennium. These strongholds of knowledge have been commonplace since words could be put to paper. Libraries also tended to be magnets of civilizations. Wherever there was a library, people will follow. While being a beacon of hope for some, they were a beacon of destruction for others.

Times have changed now. Information is now disseminated at the speed of light. Google has proven to become an invaluable system to retrieve and categorize material. While we have all of this at our fingertips, there seems to always be an element of chaos to it.

For example, some things are a given. Information will always be written by someone; in this case, the author. In addition to that, things such as a title, date written, date published, etc., will always be found in the “metadata” of information. Knowing this, we can treat information as sort of object with all of this information. When we start becoming more domain specific, that’s when this system breaks down.

Academic literature is a special case of having a system that relatively works. What makes academia different from mainstream information is having that extra element of review—also known as “peer review.” Because of the relatively tenacious aspect of peer review and the need to quickly publish research, the concept of a preprint was born. A preprint is a academic paper that hasn’t undergone the peer review stage yet. Publisso provides a more succinct description:

The Sherpa Romeo database makes the following distinction: preprints are all the versions of an academic article or other publication before it has been submitted for peer review, while the postprint is the form of the article after all the peer review changes are in place.

When one thinks of a preprint, usually arXiv comes to mind as being the largest repository for these articles. This is where the chaos starts.

arXiv has its own system of organizing articles known as their “arXiv ID.” Every preprint uploaded here will have an ID attached to it for the duration of its relevancy. However, if that article moves out of preprint, it receives a DOI tag alongside as well. That’s not the end of the case, however. A preprint may be uploaded to various other sites as well, accumulating more tags along the way. Things such an eid (Elsevier ID), ISSN, or eISSN will slowly bloat the article. What if a researcher specifically wanted to search by eISSN but an article didn’t have it? Inconsistencies like this will slowly pop up as we increase the number of systems we use to catalog information.

Hence, these problems are why we need a unified system for cataloging information. These small problems will soon snowball into larger and larger problems. One great example of creating a comprehensive repository was the CORD -19 initiative led by researchers at different institutions.

The Covid-19 Open Research Dataset (CORD-19) is a growing resource of scientific papers on Covid-19 and related historical coronavirus research. CORD-19 is designed to facilitate the development of text mining and information retrieval systems over its rich collection of metadata and structured full text papers.

However, what makes CORD-19 stand out is the generation of a “harmonized and deduplicated metadata as well as structured full text parses of paper documents as output.” Because of this, it provides for easy searching and indexing for researchers to use; to this end, enabling a phenomenon known as data-based discovery. With this organized database, researchers were able to methodically glean information from the data and able to rapidly make scientific progress. However, much of this happened under the threat of a global pandemic–spurring the need for such a system and initiative. Looking at the success of this approach, we could achieve so much more with a unified system in other disciplines besides COVID-19.

Building upon the success of the CORD-19 initiative, future researchers can take upon the methodology and direction and construct similar corpuses for other fields as well, not only COVID. While some traditions are tried and true and cannot be changed until the next generation of researchers eclipse, we can learn from the current trends to build a better future in not only academia, but information as well.

Back to writing

External Resources and References

  1. Publisso: A difference between a preprint and postprint
  2. The CORD-19 initative