WiseAnalytics | 5 tips for aspiring and junior data engineers

Insights

5 tips for aspiring and junior data engineers

12 min read

By Julien Kervizic

Data Engineering is an interdisciplinary profession requiring a mix of technical and business knowledge to have the most impact. Starting a career in data engineering, it is not always clear what is necessary to be successful. Some people believe that there is a need to learn particular technologies (e.g., Big data); others believe it is a high degree of software engineering expertise; others believe it is focusing on business.

There are five main tips I would give to data engineers starting their career:

1. Learn fast

2. Don’t succumb to the technology hype

3. Data Engineering is not all about coding or technology or data

4. Coding is still an important part of the job

5. Know your data and where it is coming from

Learn Fast

Learning fast is essential to growing your career as a data engineer. There are many areas to cover — general data engineering, significant data engineering, software engineering, machine learning engineering — domain knowledge to acquire, soft skills to master, different tools to grasp, or datasets to understand. Data Engineering is also a domain where things evolve quickly, and specific technology can quickly become outdated. It is crucial to be continually learning new things in order not to be put on the sideline.

Multiple factors influence learning speed:

The individual general capacity and ability to learn. Which isn’t something that is easily controlled. But it is worth considering as different people can learn more effectively in different ways.
The amount of time dedicated to learning. This is something that can be influenced in multiple ways.
The quality of the learning experiences.

The individual general capacity and ability to learn

Within data engineering, there are different concepts to learn. Some can be quite academic, such as the underlying concepts of probabilistic data structures, to quite practical such as optimizing an ETL pipeline.

Every individual has a particular learning style to work best for them. For some having a specific problem and going through documentation or notes might be sufficient; some others require more structure and explanations.

Depending on the preferred learning style, different companies and organizations might be more appropriate.

Structured training

Some companies offer structured training programs for graduates. This is quite common, particularly in data consulting companies, where you get introduced to the basics principles of data processing.

Consulting companies are also good at providing structured training at later stages in individual careers by pushing for the obtention certificates such as “AWS Certified Cloud Architect.” They usually offer time, support, and incentives for preparing for these exams.

Some tech companies also offer structured training — although the focus there is much less about teaching the fundamentals or core concepts than teaching you the tools, frameworks, and ways of working in the company.

For more academic types of learning, besides the possibility of picking up a part-time specialized master, different MOOC courses such as Coursera’s machine learning provide the right learning path for certain subjects, with specific deadlines. These MOOCs sometimes offer groups of courses such as the deep learning specialization.

Unstructured training

In general, the more the expertise required, the less hand-holding available, and the more the need to learn through unstructured training. Either by being faced with challenging problems and figuring, or deep-diving into how people have approached the issues before.

In prIn practice for a data engineer, this type of learning can take multiple forms, such as going through the documentation of some specific framework that you intend to use, deep-diving into open source code, and understanding what is truly going on behind the hood.

One of the best factors for structured training is practical project training, such as trying to implement features that are not available, or going through machine learning articles, to figure out how to implement a new algorithm.

The amount of time dedicated to learning

It is a significant factor for the rate at which you can absorb and master new concepts and technology. It is possible to increase the amount of time dedicated to learning, but it will impact either your life or company performance after a certain point.

About eight or more hours of your time daily is spent at work and is, therefore, an area of high-leverage if you can make your daily work into a learning experience.

The quality of the learning experiences

There are different quality of learning experience. Trying to understand a cryptic book will not be as effective as a nicely written one, with a breakdown of the problems and solutions. It is the same with learning in the workplace. There are three main factors to the quality of the learning experience: People, Projects & Environment.

People: To optimize learning, it is essential to find places where you can learn from people who know more than you. If you are offered the chance to join a startup as their first engineer or data engineer early on, you will have the opportunity to touch a bit of everything. But you also, very likely,won’t get the chance to learn or get the same grasp on the technologies you would have in an environment with more knowledgeable peers or managers.

Projects: It is vital to have a stream of projects that stretch you. If you always keep doing the same thing and specialized, let say in one area. You will end up having a very narrow focus. Becoming an expert at a single thing is hard and slow. It is often a more effective aim for good mastery of the subjects, and once you get to that level, aim for breadth. This applies to both the technical component and business components.

Environment: The speed of the organization will have a significant impact on your learning speed. If you work in a very bureaucratic environment, you will not have the time to do much and have a considerable impact. You will end up focusing on internal organizational aspects, rather than spending time applying and learning your craft.

If, on top, the organization has a major bureaucratic process for deploying code to production, you will not be able to learn as fast from your mistake compared to a more agile environment. For this reason, it is good to avoid large non-digital corporations and instead prefer more dynamic environments such as tech or e-commerce companies. Startups can be fine as well, but sometimes it is worth being careful that they might not give you the right “things” to learn in your early career, especially if there isn’t the right “support system” there.

It is worth switching projects and environments every few years. There is a lot to learn from how different things are done in another department or organization. Your position might be different, giving you various opportunities and experiences and learning from other people.

Code/Peer review is one of the quickest paths to learning

Besides its purpose as a gatekeeper to introduce bugs into the codebase, code/peer review is also an excellent tool for learning from others. It can teach you what approaches are more efficient, implement best practices in terms of code design, and new alternative implementations… To a certain extent, code/peer review forces and facilitate knowledge sharing across the team.

It can also get you exposed to other domains/products or features, widen your domain/business knowledge, and get you exposed to different datasets. It is one way to also get a more holistic picture of how the company works.

Code review also teaches you to write readable code, that others can adopt that. Lets you understand what should be commented and what can flow directly from the code and structure the code so that it can be easily understood.

Don’t succumb to the technology hype

As a data engineer, there is a need to work with different types of technologies: cloud, infrastructure, big data, streaming, … And as with everything in technology, there is a lot of hype. Most of these technologies have been made to address specific problems and have been hyped around as the cure to everyone’s’ problem.

Big Data doesn’t necessarily mean fun data

When I worked at Facebook, sometimes doing a simple query would take more than a day to run. Query failure could happen at any point in the process, with nodes becoming overloaded. The reasons for failure being sometimes under the control of the data engineers and sometimes due to infrastructure issues. The larger the data to process, the larger the number of nodes used, the larger the chance of failure.

Working with Big data allows you to get some perspective on how to work in a distributed setting. It is something that data engineers should be exposed to at some point to be able to grow in their careers. However, it isn’t the only skill that a modern data engineer needs, and always working with Big data might slow down learning in other areas due to a lower iteration speed.

You don’t need to be using the fanciest tools.

Plain and simple, technology changes, and often the fanciest tools are only really beneficial in certain situations. I started using MySQL/Oracle, slowly replaced by Redshift. Moved on to Hive, then Presto, then Spark. And after having picked up Spark, I got back to using DB like Postgres and MsSQL.

For most purposes, traditional databases can still fit the job. It is only in 2014 that Amazon, for instance, started replacing its’ internal data warehouse built on oracle. Back then, Amazon was an $89B digital business. Chances are that you are working with smaller datasets that amazon had back then. Improvements in computer hardware, database performance improvement (e.g., see PostgreSQL 13 release notes), or distributing distributed MPP parallelism features such as Hyperscale (Citus) extension are keeping relational databases still relevant.

It is important to select the right tool for the job, not necessarily use the fanciest tool. It is important to be/have been in an environment where you can pick up some of the fundamental principles for the next step in your career. New tools will come along, but the principles won’t change.

Data Engineering is not all about coding or technology or data

Communication is underrated

The role of a data engineer requires them to be highly communicative. The role is not about just coding in a specific area. But also about gathering requirements, understanding the business logic to implement, and all of that takes time and a lot of communication (and back and forth).

It is also important to inform different data users about changes in the data model or the reports created. And on communicating how the diverse data structures can be used (e.g., sample queries …).

There also needs to be communication towards software engineers to source some of the data, be it through instrumenting additional logic, getting database extracts, or direct access to the data structures used.

A decent amount of communication happens in writing. It is used to discuss/document the requirements, implementation details, or generally communicate a plan. Data engineers need to learn to write and communicate in written form.

Business knowledge is as important as the technology

Getting to know the business and gaining domain knowledge is one of the critical factors to success as a data engineer. It helps you better understand what is genuinely required, better anticipate the requirements, and prepare your data structures to accommodate these changes.

Getting this domain knowledge also helps data engineers understand what is expected from the data and detect if there could be data anomalies.

Coding is still an important part of the job

An integral component of the role of a data engineer is still coding. Coding is still one of the best ways to define and abstract the different transformations that need to be computed on datasets. It also empowers data engineers to take up tasks that have been typically backend or DevOps related tasks.

Develop your programming skills

Data Engineers need to know how to code. Data Engineering is progressively becoming a specialized subset of back-end engineering and DevOps. These days, it is not just about leveraging SQL (PL-SQL) or GUI tools, but about coding full pipelines and infrastructure as codes.

If you are in an organization that only asks or allows you to leverage SQL and GUI tools, it is worth considering getting some experience elsewhere to progress your career.

Besides running queries on databases or big data environment, data engineering now starts to encompass the need for API Integration, exposing datasets using their API, leveraging stream processing, or putting machine learning models into production. These days, more and more data engineering framework requires programming skills. Airflow requires python knowledge; DBT leverages Jinja templates; queries can be executed through processing engines such as Pandas, Dask, Spark, …

These programming skills are also needed to best exploit features such as infrastructure scaling (cloud), scraping, or machine learning use cases.

Writing efficient code is important

Besides being able to just write code, a data engineer needs to write efficient code. As a data engineer, you need to understand your will scale and how it can be improved. Not having this visibility might lead you to spend resources making pre-mature optimization because you run unoptimized code, or end up costing significantly more because you need to scale up your hardware resources.

Within this area, there are two important things for a data engineer to master. 1) An understanding of the complexity of their code 2) an understanding of what optimization is available within their processing platform.

Complexity: Getting a sense of the complexity of the different operations give an understanding of how the code will scale with an increased amount of data.

In a data engineering context, it is important to understand the complexity of the code in a naive and optimized way. For example, let’s take the example of a query performing a self-join (a.user_id = b.user_id), naively, the join would be considered a nested loop, and its’ time complexity of the query would be O(N²). Apply an index on it, and the join can become a hash join. The complexity of doing this query is now in reading O(N) rows + performing O(N*1) hash lookups. As can be seen, there can be significant differences in performance based on the actual implementation —a data engineer must keep both perspectives.

From a data engineering perspective, both time and space complexity are important. In the real of Big data, memory is often a bottleneck. Knowing what can be done to perform queries in a memory-efficient way, for example, leveraging probabilistic data structures, can have a massive impact on performance.

Optimization possible: There are a significant number of optimization that a data engineer can apply, either to the transformations or the data structures. It can involve setting up partitions and buckets, setting up the right indices, having the data stored in the right file format, or writing pipelines that are distributed efficiently across nodes.

Balancing structure and flexibility

There is a strong need to balance structure and flexibility in data engineering. It is important to understand the tradeoffs of specificity versus flexibility you make as a data engineer. These tradeoffs impact the overall performance of the system and in the implementation and maintenance effort. Understanding the right level of abstraction and specificity needed requires quite a lot of experience and domain knowledge, making it particularly crucial to have to bring about the most impact.

Know your data and where it is coming from

As a data engineer, it is essential to understand your data and how it is generated. This allows you to structure the different transformations in the most proper way and avoid many of the pitfalls that may come from data quality issues.

There are a lot of pitfall in your data

There can be a lot of data quality pitfalls in the data being used. For instance, customers needing to be deduplicated, traffic coming from bots, … Data Engineers should learn how the data is set up and help find a pragmatic way forward. Getting an in-depth understanding of your dataset allows us to find the right way to make sense of the data and turn it into information.

Understand the type of system your data is coming from

There are a lot of pitfalls that can arise from the setup of your source systems. For instance, the way dates are implemented can vary significantly. When implementing a UNIX timestamp, it is quite common to have 1970–01–01 outputted as the default date (UNIX time = 0), while systems using an MsSQL server as their database, such as Salesforce Marketing Cloud, might set up their default as 1753–01–01. Other issues with dates can arise from how the data is set up in the source system and outputted, such as a UTC date rather than a local timezone, requiring some conversion on the date.

Distributed systems, in turn, bring their own set of challenges. With them comes some challenges in terms of consistency (be it eventual or partial) or dealing with out of order events.

Summary

It can be quite complex to navigate one’s early career as a data engineer. It is not about learning a particular technology, although getting exposure to big data or distributed system can be helpful, but rather about setting up the fundamentals into the different components that make data engineering.

It is particularly important to learn how to adapt and learn independently, as the more you progress, the less structured teaching will be available to you.

It is also worth not forgetting that data engineering is not just about leveraging technologies, but about collaborating with other functions. Other softer skills and experience can be incredibly helpful in that area.

‍