Data & AI Lifecycle 101: Unpacking the Unique Stages, Tools, and Technologies

Published: Jan 23, 2025 · 7 min. read

Over the past two years, the pressure to embed AI into applications has skyrocketed, putting data engineering and machine learning teams in the spotlight. For many tech-forward organizations, these teams have long supported business functions like reporting and analytics. But today, their roles have evolved into developing business-critical AI-powered applications—from traditional ML systems to generative AI and LLMs—that drive innovation and impact.

This renewed focus and shift has also brought attention to the underlying lifecycle of processes, tools, and technology these teams build and use to harness the power of AI within their applications.

This is the Data & AI Lifecycle.

Data & AI Lifecycle 101

The Data & AI Lifecycle is made up of new processes, tools (i.e., Jupyter notebooks, data pipelines & MLOps tools), open source components, and runtime technologies (ML and GenAI models). It operates separately from the SDLC, managed by different teams in distinct workflows. Since the Data & AI Lifecycle is still part of the application, however, the responsibility for securing it falls to the same teams responsible for traditional software security: application and product security teams.

Over the last year, we have spoken with hundreds of application security leaders and practitioners and heard the same thing time and time again: The Data & AI Lifecycle feels like a black box.

That’s why we’re writing this post—to break down the Data & AI Lifecycle’s unique stages and tools, as well as the various components, technologies, and artifacts throughout.

Data & AI Lifecycle Stages

Similar to the SDLC, which is made up of several stages (i.e., design, development, build/deploy, and runtime), the Data & AI Lifecycle has distinct phases that have different processes, tools, gates, hand-offs, and outcomes:
1. Data preparation and curation
2. Model development, training, and analysis
3. Model deployment and service
4. AI runtime operations

Diagram comparing the software development lifecycle and the Data & AI Lifecycle

Data Preparation and Curation

One of the major differences between the software lifecycle and the Data & AI Lifecycle is the part data plays. ML/AI models depend on data to learn and perform effectively, underscoring the importance of relevant and accurate data.

This phase involves cleaning, enriching, transforming, and moving data to prepare it for the final goal of either analytics or AI model training. For GenAI specifically, this data might also be used for retrieval-augmented generation (RAG), which is data that the LLM model can access and use during runtime.

Datasets may be sourced internally or third-party from sources such as Hugging Face or other data sources. This is often one of the more time-consuming stages, requiring robust underlying infrastructure of data pipelines (which we’ll talk about more in the next section) and structured storage solutions, such as databases and data lakes.

Model Development, Training & Analysis

Once data is prepared in alignment with the model’s goal, development begins. This is a very non-linear, iterative phase involving many cycles of development, training, analysis, and refinement based on performance. The data science teams build and adjust different parameters and data, experiment and test the results, and this process continues in many cycles until the model performance meets the goals.

For GenAI and LLMs, this phase might include fine-tuning, to maximize model performance on a more concrete task. Teams may also incorporate open source models from ecosystems such as Hugging Face and even embed 3rd party models such as OpenAI models.

This stage relies not only on data pipelines but also on development environments called Notebooks (which we’ll also cover in the next section) that have ready access to data.

For GenAI models, together with the modeling phase we covered, another layer is setting the AI application architecture. In case the application includes “agents,” meaning models that independently interact with its environment and take actions, the AI application architecture defines what tools and data are available for the agent.

Model Deployment & Serving

Next, once the outcomes meet expectations, the model is ready to be deployed to production and integrated into existing applications.

Custom-built ML models require packaging with dependencies, setting up dedicated environments (either on-premise or cloud), deploying using a model serving framework, and making it accessible via APIs for efficient prediction requests. For LLMs, deployment options include using third-party APIs for immediate access or self-hosting open-source models on high-performance infrastructure with GPUs/TPUs; this process often involves containerization (i.e., Docker, Kubernetes), load balancing for scalability, API management, and caching to optimize response times and costs.

AI Runtime Operations

And then we get into runtime. While data assets, models, and code lay the groundwork for AI systems, it is the runtime processes that bring AI applications to life—and are equally important for AppSec teams to understand. At runtime, the models act as an intelligent function, getting inputs and returning outputs, either it is ML models for prediction and classification or LLM models that communicate in native language by prompts and responses. This involves the inference process, where trained models respond to input prompts to generate responses. Prompts guide these interactions, while response generation adapts dynamically and utilizes contextual memory to maintain coherence with prior interactions. Inference pipelines ensure the smooth, scalable processing of these operations, making AI systems responsive and capable of handling complex user interactions seamlessly.

The Data & AI Lifecycle never ends. Continuous monitoring, maintenance, and updates are required to ensure models are performing as expected, are kept up to date with new data, and meet evolving business needs based on user feedback. Unlike conventional applications, AI-powered applications have very unique considerations from other applications that require purpose-built tools and ongoing training, fine-tuning, and analysis.

Noma Security provides protection through every stage of the AI lifecycle. See how it works.

Data & AI Lifecycle Tools

Now that we have a high-level understanding of the Data & AI Lifecycle let’s explore the stack of tools that support it. Facilitating the use of data for modeling, analyzing, and fine-tuning requires unique development environments, pipelines, and tools.

Diagram comparing the software development lifecycle and the Data & AI Lifecycle and its tools.

Jupyter Notebooks

Jupyter Notebooks are to the Data & AI Lifecycle what IDEs and SCMs are to the software lifecycle—although they have some key differences. Notebooks are unique in that they give data scientists direct access to the data they need as they work, enabling them to dynamically and iteratively transform data and tweak models. This also makes these environments much more sensitive.

Jupyter Notebooks are versatile and can be set up in various configurations, including simple Jupyter Notebook installations on local servers or cloud-based infrastructure, cloud-based environments using Google Colab, AWS Sagemaker, Azure Notebooks, or SaaS-based on AI/ML platforms such as Databricks, Snowflake or Domino.

Data Pipelines

Data pipelines are essential to collect, prepare, validate, transform, and transport data for any number of use cases, including model training and fine-tuning. Within these pipelines, pipeline jobs (tasks within the pipeline) can be orchestrated to schedule, monitor, and execute data workflows seamlessly, enabling scalable, distributed processing for complex tasks.

Data pipelines can be deployed on local servers or cloud-based platforms like AWS Glue, Google Cloud Dataflow, Azure Data Factory, and Databricks. Databricks specifically stands out for its ability to integrate ETL and ELT processes with big data and machine learning workflows. Advanced configurations may include hybrid pipelines that combine batch and real-time processing, managed through orchestration tools like Apache Airflow, Prefect, or built-in Databricks Jobs for streamlined automation.

Model Registries

It goes without saying that models—whether custom-built or open source—are the AI crown jewels. To manage and organize different versions of models throughout their lifecycle, MLOps tools such as model registries are essential. They support tasks such as tracking model metadata, version control, and model lineage, ensuring easy access to the right model versions during deployment.

Popular model registries include tools like MLFlow, which offers experiment tracking and testing platforms to streamline model management and documentation. For leveraging open source models, ecosystems like Hugging Face facilitate sharing to accelerate time-to-value.

Source Control Managers (SCMs)

The Data & AI Lifecycle includes distinct tools and environments, but it intersects with the software lifecycle when models are embedded into applications. Being the source of truth for code that they are, source control managers (SCMs) like GitHub, GitLab, and Bitbucket hold the AI application code—including training scripts, data processing logic, and inference routines—that defines how models learn, make predictions, and integrate with applications and systems. It also serves as the foundation for prompt engineering, a new frontier in AI development. Prompt engineering is a new frontier in AI development and involves crafting precise prompts and logic to guide LLMs in generating relevant, task-specific outputs, balancing the nuances of language with the technical demands of model interaction.

And, of course, SCMs allow teams to manage versions and changes to models efficiently, collaborate on model updates, and ensure traceability across both lifecycles.

Model Servers

Model servers are a type of MLOps tool that play a crucial role in the deployment and production phases of the Data & AI Lifecycle. They are designed to facilitate the serving of models as APIs or endpoints, enabling real-time predictions in production environments. Model servers ensure scalability, efficient load balancing, and smooth integration with other applications. These tools are integral to maintaining model performance and reliability, forming the backbone of machine learning inference in production.

Data & AI Lifecycle Risks

As is true for most (if not all) emerging tech, the Data & AI Lifecycle is not secure by design. While vendors and suppliers producing third-party systems have responsibilities to secure customer data and access to some extent, users have their own responsibilities, much like the Shared Responsibility Model for the cloud.

Today, however, despite the abundance of application security testing tools—from SCA and SAST to DAST and API Security—the Data & AI remains a huge blind spot for risks. From classic application security risks and misconfiguration across this new data and AI supply chain to training data and AI-specific vulnerabilities in runtime, the Data & AI Lifecycle has introduced a whole new attack surface.

Diagram comparing the security tools across the software development lifecycle and the risks across the Data & AI Lifecycle.

To start detecting, preventing, and remediating those risks, the first step is an understanding of the underlying assets and processes. That’s where Noma comes in! We’re on a mission to shed light on this blind spot and bridge the gap between data engineering and application security teams.

To learn more about how Noma can help your organization secure its AI efforts, contact us.