Idiosyncrasies of Data Source Systems

Chayan Shrang Raj
6 min readNov 12, 2023

--

How is a data element born?

The job of a data engineer is to allow data to exist in this world, do something with it, make it useful and serve the people who want to play with it.

In psychology, You may have heard about the effects of childhood trauma in shaping the behavior of adults (Research). Well, data also has a life and, I think you know where I am trying to go with this. How data is generated has a big impact on how it traverses the data engineering world.

In order to understand how data will be treated in its lifecycle, it is important to understand its childhood, where it took birth, how it is generated, its environment, its characteristics, and quirks.

Data Engineering Lifecycle (Image credits: Author)

Data is an unorganized, context-less collection of facts and figures. The source can be many things, both analog and digital. Data is everywhere in the world around us, each day humans and machines, together, produce exabytes of data.

Analog data happens everywhere around us, such as vocal speech, pre-computer handwritten records or writing on paper, temperature sensors using IoT, and many more…

Digital data has even more sources, it is either created by converting analog data to digital form or is the native product of a digital system. Imagine scrolling Netflix, it’s all data. Buying groceries? Using the mobile or credit card to pay, creates data. Placing an order on ecommerce website? That’s data.

Understanding the source systems and how it generates data is a key step in creating efficient and robust data pipelines. Imagine it as a brick laying system, if you do not understand the ins and outs of the brick (different architecture requires different types of bricks), your foundation or architecture may not be as strong as intended to be. And that’s not good.

Data Sources (Non-Exhaustive)

  • Files and Unstructured Data — A file is a series of bytes usually saved on a storage disk, and applications frequently use files to store various types of data such as local parameters, events, logs, images, and audio. In the realm of data engineering, you’ll frequently encounter various source file formats, originating either through manual input or as outputs from system processes. These formats include Excel, CSV, TXT, JSON, and XML, each with its unique characteristics. Files may be structured (e.g., Excel, CSV), semi-structured (JSON, XML, CSV), or unstructured (TXT, PNG), each presenting its own set of nuances.
Files and unstructured data (Image credits: Author)
  • Application Databases (OLTP Systems) — An application database is designed to store the current state of an application, with a common example being a system that maintains account balances for bank accounts. This type of database, often categorized as an online transaction processing (OLTP) system, is responsible for efficiently handling the constant flow of individual data record reads and writes. While commonly known as transactional databases, it’s important to note that not all OLTP systems necessarily support atomic transactions. Support for atomic transactions is one of the critical aspects of database characteristics widely known as ACID ( Atomicity, Consistency, Isolation, Durability) properties. Consistency ensures that any database retrieval will yield the most recently written version of the requested item. Isolation guarantees that if two updates are simultaneously in progress for the same entity, the final database state will align with the sequential execution of these updates in the order of their submission. Durability signifies that once data is committed, it will remain intact and unaffected, even in the event of a power loss.
OLTP RDBMS System (Image credits: Link)
  • Application programming interfaces (APIs) — APIs serve as a standardized means for systems to exchange data. They are sets of rules and protocols that enable different software applications to communicate with each other. They define the methods and data formats that applications can use to request and exchange information. APIs are crucial for facilitating the integration of diverse systems and allowing them to work together seamlessly. Data engineers commonly work with various types of APIs, depending on their specific tasks and the nature of the data they are handling. Data engineers commonly work with various types of APIs, depending on their specific tasks and the nature of the data they are handling. Some types of APIs relevant to data engineering include:
API workflow (Image credits: Link)
  1. Web APIs (RESTful APIs): These are commonly used in web development and data engineering. RESTful APIs (Representational State Transfer) follow a set of principles for creating scalable and stateless services. For example, the Twitter API, which allows developers to access and interact with Twitter data programmatically, and the GitHub API for managing and retrieving information from GitHub repositories.
  2. Database APIs: Data engineers often interact with APIs provided by databases to perform tasks such as data extraction, transformation, and loading (ETL). For instance, the SQL API for Microsoft SQL Server allows engineers to query and manipulate data in SQL Server databases.
  3. Streaming APIs: In cases where real-time data is crucial, streaming APIs are used. For example, the Instagram Streaming API provides a continuous flow of posts in real time, allowing data engineers to capture and process live data streams.
  4. GraphQL APIs: This query language for APIs allows clients to request only the data they need. The GitHub GraphQL API (Link) is an example, providing a more flexible and efficient way to interact with the GitHub platform compared to traditional RESTful APIs.
  • Logs — A log serves to record information pertaining to occurrences within systems. It can document various events such as web server traffic, usage patterns, and operating system activities, including application launches and crashes. Logs constitute a valuable data source for subsequent analysis, machine learning, and automation. Common log origins encompass operating systems, applications, servers, containers, networks, and IoT devices. Regardless of the source, logs uniformly capture event details and metadata, focusing on the “who,” “what,” and “when” of an event:
  1. Who: Identifies the human, system, or service account linked to the event, such as a web browser user agent or a user ID.
  2. What happened: Encompasses the event itself along with relevant metadata.
  3. When: Specifies the timestamp indicating when the event occurred.
Database Logs (Image credits: Link)

Logs come in various resolutions and levels, with log resolution indicating the extent of event data captured. For instance, database logs provide sufficient information to reconstruct the database state at any given moment. In contrast, big data system logs may not capture every data change but rather note specific commit events. Log levels determine the conditions under which a log entry is recorded, such as errors or debugging information. Software configurations often allow for logging every event or only specific conditions, like errors. Log latency is categorized into batch or real-time processing, where batch logs are continuously written to a file, and real-time applications may utilize messaging systems like Kafka or Pulsar for immediate log entries.

  • Message Queues and Streams — In the realm of event-driven architecture, two terms that are commonly used interchangeably are “message queue” and “streaming platform.” However, it’s crucial to recognize a subtle yet significant distinction between the two, as they encompass key concepts related to source systems, practices, and technologies throughout the data engineering lifecycle.
Simplified streaming data model (Image credits: Link)

A “message” refers to raw data transmitted between two or more systems. Typically, a message is routed through a message queue from a publisher to a consumer, and once delivered, it is removed from the queue. Messages in an event-driven system are discrete and singular signals.

On the other hand, a “stream” refers to an append-only log of event records. Events accumulate in an ordered sequence over time, with a timestamp or ID determining their order. Streams are employed when there’s a need to analyze a series of events.

There are many other data sources that are not covered here. Maybe in the next post… Thanks!

Github: https://github.com/chayansraj

--

--

Chayan Shrang Raj
Chayan Shrang Raj

Written by Chayan Shrang Raj

Learning by writing. Writing by experiencing. Experiencing by taking action. GitHub - https://github.com/chayansraj

No responses yet