Data Science Workflow: Understanding Key Terminologies
Written on
Chapter 1: Introduction to Data Science Terminology
Entering the world of data science can be daunting, especially due to the vast array of terminology one must learn to progress in the field. This terminology can be categorized into several segments: general vocabulary, specialized focus terms, tool-specific language, and workflow definitions.
For newcomers, the extensive list of terms can be both discouraging and perplexing, particularly when seeking specific information without knowing the correct terminology. I have been in that position before—feeling lost amidst a sea of jargon while trying to grasp various concepts. This prompted me to create this article, and potentially more in the future, to clarify these terms for those new to the field or contemplating a career in data science.
This piece will specifically address the terminologies associated with various steps in a data science workflow. While workflows may differ based on project specifics, several fundamental steps are common across most data science endeavors.
So, let’s dive in…
Section 1.1: Data Exploration
Data science fundamentally revolves around data—its collection, structure, and the narratives it conveys. Addressing questions like why the data was collected and what insights it offers is the initial and most crucial step in any data science project.
To uncover these insights, data scientists employ both manual and automated analytical techniques to deepen their understanding of the data. Data exploration often involves visualizing the data using various tools, which aids in identifying patterns and trends that facilitate more insightful analyses.
Section 1.2: Data Mining
Data mining refers to the process of organizing, analyzing, and interpreting raw data to uncover patterns and anomalies using mathematical and computational algorithms. This technique is essential for extracting actionable insights from datasets, enabling the development of useful applications.
To derive meaningful information from data, it is crucial that the data is cleaned, structured, and well-organized—elements that are central to the data mining process. Effective data mining plays a vital role in interpreting data, guiding decisions that may influence future data collection.
Section 1.3: Data Pipelines
In any data science project, data must undergo a series of processing steps to yield valuable results. This series of procedures is known as data pipelines. Each step in a data pipeline produces output that serves as the input for the subsequent step, starting from raw data and culminating in the final desired output.
Typically, data pipelines consist of three key components: sources, processing steps, and destinations. The arrangement of these components varies based on the specific application and the results being targeted.
Section 1.4: Data Wrangling
Data wrangling is a comprehensive term that encompasses the processes of collecting, selecting, and transforming data to address analytical queries. It is also referred to as data cleaning or data munging. The primary aim of data wrangling is to ensure consistency across all datasets.
Remarkably, data wrangling can consume up to 80% of the time allocated for a project, while modeling and data exploration account for the remaining 20%. When data scientists engage in wrangling, they typically transform data into one of four structures: the most common being an analytic base table, along with denormalized transactions, time series, and document libraries.
Section 1.5: ETL Process
The ETL process, which stands for Extract, Transform, and Load, is divided into three distinct sub-processes. This process is essential for preparing data that isn’t yet ready for analysis, optimizing it for analytical purposes. Although these three steps are standard, their execution may differ depending on the ETL tool utilized.
Conducting ETL is critical for ensuring accurate analysis results. The quality of your analysis hinges on the input data, making ETL a vital step in structuring data for analysis and modeling.
Section 1.6: Web Scraping
While the terminologies discussed thus far pertain to procedures applied to collected data, web scraping involves searching the internet to gather necessary data. During web scraping, data scientists or developers create scripts to automatically collect information on specific topics for further analysis and modeling.
A variety of tools can facilitate data collection from the web, including BeautifulSoup in Python, Scrapy, and Cheerio for JavaScript.
Takeaways
Navigating the field of data science can be overwhelming for many, given the plethora of online resources and the extensive terminology required to engage fully and create impactful projects.
One valuable strategy that aided my comprehension of the field was developing a dictionary of essential terms for reference whenever I felt lost or confused. This article serves as a segment of that dictionary, and I plan to write additional articles focusing on specific topics, such as machine learning and statistical tools, to further assist those who are new or uncertain in this exciting field.