The header image was purchase on Istock.com and is protected by copyright.
Nowadays, data is one of the most significant values in the world. Huge amounts of data are produced and compiled every day of our lives. This ocean of data and information should be used the right way to optimize the factors that impact our daily lives.
To benefit from data or consider it as a value, we must first collect, analyze, and adapt it to specific needs or requirements. In the past, the concept of Data Mining was defined as meeting this need. Data mining is the practice of examining large databases in order to generate new information, which is then used to increase efficiency or solve complex problems.
With time, this concept has evolved and was named Data Science. Data science has several definitions. Generally, data science is the multidisciplinary knowledge of data, mathematics (statistics) and algorithms, and a technology that aims to propose solutions to complex situations and problems. Data Science journal explains this concept as follows:
“Data science means almost everything that has something to do with data: collecting, analyzing, modeling… yet, the most important part is its applications—all kinds of applications.”
In other words, the objective of data science is not to complicate models or produce an amazing visual, and is not limited to reading codes, it is a science that creates impacts or added value in different ways by using data for our benefit. The figure below illustrates the kind of knowledge required to be a data scientist, data analyst, or data engineer.
Data Science Lifecycle
There are various types of classification for the data science lifecycle. Based on previous research, the stages of this classification can be divided into 5, 6, or 7 phases. Here, we consider the complete version that contains the seven stages of the data science lifecycle.
- Business Understanding—the idea behind this stage is to identify the needs and requirements of the system, factors likely to influence the project, products, The ultimate goal should be determined in this phase.
- Data Collection—simply put, data gained or collected. The important thing at this stage is to collect the data related to the factors specified in the previous stage.
- Data Preparation—or data cleaning, which means increasing data quality for next level analysis. Inconsistencies, misspelled attributes, missing or duplicate values must be eliminated.
- Exploratory Data Analysis—this step aims to find a pattern for the collected data, in other words, define and refine the selection of features, variables that will be used in the model development. It should be mentioned that this is the most important step in the data science lifecycle because all the modeling and analysis will be based on it.
- Modeling—once obtained, the dataset needs to be modeled with one of the available techniques, for instance, machine learning technique (KNN, Decision Tree, Naive Bayes, ).
- Model Evaluation—each proposed model should be evaluated to validate its performance. This evaluation allows the data scientist to choose the model showing the best fit to business requirements.
- Model Deployment—after the model is validated, a deployment plan is developed, as a project summary, a tool, or a dashboard used on a regular basis.
The Importance of Data Science
Data science can be used for different types of requirements in our daily lives. For instance, below are some of the domains using data science:
- Genomic data provide a deeper understanding of genetic issues.
- Logistics companies like DHL or FedEx have discovered the best shipping times and routes.
- Human resource managers can predict employee attrition and understand the variables that influence employee turnover.
- Airline companies can easily predict flight delays and notify passengers.
Now, the question is: why should you become a data scientist? Actually, recent research proves that data science was deemed the best job in the United States in the past three years (2016, 2017, and 2018) according Glassdoor’s 2018 Rankings. Also, since the amount of data is increasing daily, demand for this position will increase, offering incredible opportunities for those in the field!