“Big Data” is a broad field that spans at least five disciplines and three job titles. While data science, data engineering, and data analytics are certainly overlapping career paths, there are essential differences between them. Here’s what it looks like.
When you think of today’s data industry, affectionately called “Big Data,” it’s easy to lump all “data people” into the generic term “data scientist.” But in reality, many related disciplines are necessary to solve Big Data problems at the enterprise level.
Setting aside database administrators (often called DBAs) for the time being leaves data analysts, data engineers, and data scientists. While human resources people don’t always know how these related roles in a company differ, they are pretty different in terms of day-to-day responsibilities and experience.
What’s the difference between a data analyst and a data engineer?
Data analysts typically work in “data warehousing,” using tools like Snowflake, Amazon Redshift, and Google BigQuery. They generally are responsible for moving structured data, neatly organized in accounting systems, into high-performance data warehouses and “data maps” for specific teams to create analytical reports and business intelligence (BI) reports.
Data engineers typically handle “data engineering” and “event streaming” projects. The role of a data engineer is conceptually similar to that of a data analyst, but the main difference is that a data engineer is more likely to specialize in working with semi-structured, unstructured, and streaming data (such as from real-time events) than a “pure” data analyst.
To work with data that may contain duplicate or incomplete records, the data engineer should use tools such as Airflow, dbt, Fivetran, or Airbyte to extract, transform and load data (ETL). (In fact, many data engineers now prefer to load data before converting it, which leads to the ELT process.) These complex processes are often partly manual and can involve data lakes and streaming data mechanisms – software such as Apache Spark, Kafka, and Amazon Kinesis.
What is the difference between a data scientist and a data engineer?
“Data science” and “machine learning” (ML) are the last two data-related disciplines we’ll look at, and these projects tend to be done by people with titles like “data scientist.” Data scientists, like data engineers, are often used to working with all types of data – so data scientists may use the same data lakes and data preparation tools as data engineers. However, a data scientist typically transforms data with the ultimate goal of solving data science or ML problems. In contrast, data engineers are stereotypically more interested in creating repeatable engineering processes to support other parts of their organizations.
Compared to data analysts, who may deal with creating many one-off reports for business intelligence and competitive analysis, data scientists tend to seek statistical inference (to prove or disprove a hypothesis) or help create ML applications (e.g., image recognition with ML). This means that data scientists like to use software such as Scikit-learn, TensorFlow, or PyTorch for their data science and ML work. These frameworks tend to be more specialized for data science or ML work than the corresponding data engineering tools, which may not support, for example, ML data model selection, training, and evaluation.
Meanwhile, data engineers typically take data from data warehouses, data maps, and analytical reports, convert that data into various formats and then hand it off to data scientists or data analysts. They are more likely to get their hands dirty with software setup and configuration as part of complex data development projects that can take months. Creating analytics within a product for a software-as-a-service (SaaS) company is an example of a project that typically requires a team of data engineers. In these types of projects, data analysts are somewhat less likely to be involved unless statistical analysis or ML features are needed.
The differences between data analysts, data engineers, and data scientists
We’ve seen that these three career paths in Big Data are related and overlap in many ways, but the main differences between engineers, data scientists, and data analysts come down to two things: 1) the typical problems they are trying to solve, and 2) their choice of tools to do so.
A data analyst is likely to be associated with “business intelligence” (BI) problems, which means they are tasked with generating actionable BI for the company. While they often use data engineering tools and probably know how to create data warehouses, data analysts in the organization are likely engaged in creating analytical reports for specific teams using data maps. They may be assigned to teams of business analysts or to particular functions of the organization (e.g., marketing) and may regularly report to executive management.
Meanwhile, a data engineer tends to be less focused on BI reporting and instead is responsible for cleaning up and processing complex data. They may use more “software” approaches (like software engineers) and are probably comfortable doing manual data extraction, loading, and transformation (ELT) activities. Data engineers are probably familiar with the difference between a data warehouse and a data lake, and they are often involved in platform-level initiatives related to event-driven architecture for real-time streaming analytics.
Last but certainly not least, data scientists are likely to be more involved in research, at least in formal training and an educational program. Specialists in machine learning (ML) and statistical analysis are much more likely to use the term data scientist, although many works as statisticians (statistical analysts), data scientists (information scientists), or ML engineers. Given that ML can theoretically be applied to almost any problem imaginable, data scientists are incredibly in demand as organizations try to optimize their business and provide value to customers. But they are typically not the ones providing BI down the chain to the CEO.
Conclusion
While the job descriptions for each data-related discipline are far from unambiguous, it is helpful to understand the similarities and differences between data science, data engineering, and data analytics.
In general, there is a continuum between statistical machine learning on one side – “pure” data science and ML – and ad hoc manual reports to support management decision-making on the other side – “pure” data analytics and BI. Data engineers are somewhere in the middle, and they are often deeply involved in software development and product architecture.
There are no hard and fast rules in Big Data, and data-related disciplines are changing faster than almost any other part of the technology space as the volume of data continues to grow. If you’re not quite sure what someone’s background is in data science, analytics, or engineering, just ask them about the types of projects they like to work on and the tools they prefer to use.
You can also ask if they prefer specific specialties (e.g., designing software flow architecture) or if they are generally comfortable working on a wide range of data-related projects. In the end, remember that Big Data job titles mean both a lot and little; they can be useful to deepen your understanding, but they should not be used to box someone in.