07 May 2014 |
Research article |
Information and Communications Technologies
Big Data and Genomics: What’s the relationship?
It’s hard to image a world without Internet. We start the day by checking our emails, the weather, public transit, our favourite social networks, etc. In fact, we often spend the entire day online. While this is happening, recent scientific advancements in health research require more powerful calculation capabilities and increased storage capacity on Internet servers.
In 2013, there were more than 2.2 billion Internet users worldwide. This massive usage generates tremendous volumes of information that has to be stored, analyzed and reused.
Several problems have arisen from this new lifestyle, including the misuse of new terminology—or, in the best-case scenario, its use out of context. One such term is “Big Data.” We hear this word all the time in industrial and academic circles, especially in relation to health and more specifically genomics.
Firstly, the word “volume” is used to express the fact that Big Data manages high volume data sets that are hard to process using traditional information management tools, such as relational databases. But let’s start by defining what we mean by “high volume.” We often hear that data saved on discs or drives is measured in GB (gigabytes) or TB (terabytes), with the total coming to 750 GB for a disc or 1 TB for a personal computer.
Big Data is also high variety, which means that the data is very complex. Traditional data is usually in the form of text and thus is easy to manage using today’s databases. It is also easy to convert into relational structures. The information that uses Big Data comes from a wide variety of sources. It comes from the Internet as a result of an analysis procedure, or it can simply be texts, images or videos. The data can be public or private and organized by IP address, server or country. That’s why it can be difficult to process using traditional tools.
.Finally, Big Data is high velocity, which means that data is delivered at a very fast rate. In other words, it is generated, captured and shared quickly. Big Data applications have to be able to process a data set before a new cycle of information is generated.
Classic relational databases are not able to manage the high volume, high variety and high velocity of information that characterizes Big Data. New representation models can make better use of the information. For example, the Hadoop framework’s MapReduce programming model and HBase databases can be a great solution. In this system, processes are separated and distributed in different parallel nodes and then processed in parallel. The results are then collected and recovered. That’s the essence of the MapReduce model.
This model’s actors are forced to rely on systems with high horizontal scalability and solutions based on No-SQL architecture, such as HBase.
.We will now explain—keeping it short and simple—the term “genomics.” As part of modern biology, this word represents a branch of science that studies the way an organism, organ or illness (such as cancer) works at the genetic level. It’s not about looking at single genes, but rather the interaction of several genes. A genome is the complete set of all the genetic material of an individual or a species. It is coded in DNA or, for some viruses, RNA. The genome comprises all the sequences of DNA. Here is a concrete example: The genome can be compared to a set of encyclopedia, with each volume being a chromosome (of which humans have 23).
Let’s look at an example: A research lab wishes to characterize secondary liver cancer (metastatic) from a genomic point of view. In other words, they want to find the genes and the gene interactions that cause the disease. Firstly, the human genome has an approximate size of 3.5 GB and is made up of approximately 22,000 genes. What is more, there are insignificant differences between human genomes, in the order of 0.1%. To effectively characterize this type of liver cancer, the lab has to collect several genomes from different types of people, including individuals with the disease, individuals without it, and all possible combinations of parents with or without a history of cancer. The entire five-step process is as follows:
- Each person’s genome is compared to those of the reference (or control) group. As previously mentioned, they only differ by 0.1%.
- The differences between the analyzed genome and the reference genome—called variations or mutations—are then highlighted. These variations may be evidence of a problem (or a disease) and they must be saved on a drive.
- Thus the user of the application compares variations among all patients. The goal is to determine which variations are common to all individuals affected by the disease and thus characterize the disease in question.
- Another comparison is then necessary, as it is possible that a person may carry the same variation found in the genome of the sick individuals, but may not have the disease themselves. It is very important to update the information and exclude patients who do not have the disease despite carrying the variation. The purpose of this step is to refine the results.
- Finally, once a variation has been detected, we must determine whether the variation is hereditary or not by examining he genome of the patient’s family.
Using this application, we can see how a relational database used to manage this type of information can easily surpass the teraoctet level. In fact, the example above uses 4 TB of data. That’s enough to qualify this “volume” as Big Data. The “variety” criterion is also met, as the result must be expressed in several different formats. And finally, the “velocity” criterion is also a factor, as the results must be presented to researchers quickly.
Abraham Gómez is an IT researcher and software developer. His research topics cover artificial intelligence and Cloud Computing applications, with a focus on Big Data in genetic applications. He is currently completing a PhD at ÉTS.