Thursday, July 10, 2025

AI and ML perspective: Performance optimization

This document in the Well-Architected Framework: AI and ML perspective provides an overview of principles and recommendations to help you to optimize the performance of your AI and ML workloads on Google Cloud. The recommendations in this document align with the performance optimization pillar of the Google Cloud Well-Architected Framework.

AI and ML systems enable new automation and decision-making capabilities for your organization. The performance of these systems can directly affect your business drivers like revenue, costs, and customer satisfaction. To realize the full potential of your AI and ML systems, you need to optimize their performance based on your business goals and technical requirements. The performance optimization process often involves certain trade-offs. For example, a design choice that provides the required performance might lead to higher costs. The recommendations in this document prioritize performance over other considerations like costs.

Translate business goals to performance objectives

To make architectural decisions that optimize performance, start with a clear set of business goals. Design AI and ML systems that provide the technical performance that's required to support your business goals and priorities. Your technical teams must understand the mapping between performance objectives and business goals.

Consider the following recommendations:

Translate business objectives into technical requirements: Translate the business objectives of your AI and ML systems into specific technical performance requirements and assess the effects of not meeting the requirements. For example, for an application that predicts customer churn, the ML model should perform well on standard metrics, like accuracy and recall, and the application should meet operational requirements like low latency.

Monitor performance at all stages of the model lifecycle: During experimentation and training after model deployment, monitor your key performance indicators (KPIs) and observe any deviations from business objectives.

Automate evaluation to make it reproducible and standardized: With a standardized and comparable platform and methodology for experiment evaluation, your engineers can increase the pace of performance improvement.

Run and track frequent experiments

To transform innovation and creativity into performance improvements, you need a culture and a platform that supports experimentation. Performance improvement is an ongoing process because AI and ML technologies are developing continuously and quickly. To maintain a fast-paced, iterative process, you need to separate the experimentation space from your training and serving platforms. A standardized and robust experimentation process is important.

Consider the following recommendations:

Build an experimentation environment: Performance improvements require a dedicated, powerful, and interactive environment that supports the experimentation and collaborative development of ML pipelines.

Embed experimentation as a culture: Run experiments before any production deployment. Release new versions iteratively and always collect performance data. Experiment with different data types, feature transformations, algorithms, and hyperparameters.

Build and automate training and serving services

Training and serving AI models are core components of your AI services. You need robust platforms and practices that support fast and reliable creation, deployment, and serving of AI models. Invest time and effort to create foundational platforms for your core AI training and serving tasks. These foundational platforms help to reduce time and effort for your teams and improve the quality of outputs in the medium and long term.

Consider the following recommendations:

Use AI-specialized components of a training service: Such components include high-performance compute and MLOps components like feature stores, model registries, metadata stores, and model performance-evaluation services.

Use AI-specialized components of a prediction service: Such components provide high-performance and scalable resources, support feature monitoring, and enable model performance monitoring. To prevent and manage performance degradation, implement reliable deployment and rollback strategies.

Match design choices to performance requirements

When you make design choices to improve performance, carefully assess whether the choices support your business requirements or are wasteful and counterproductive. To choose the appropriate infrastructure, models, or configurations, identify performance bottlenecks and assess how they're linked to your performance measures. For example, even on very powerful GPU accelerators, your training tasks can experience performance bottlenecks due to data I/O issues from the storage layer or due to performance limitations of the model itself.

Consider the following recommendations:

Optimize hardware consumption based on performance goals: To train and serve ML models that meet your performance requirements, you need to optimize infrastructure at the compute, storage, and network layers. You must measure and understand the variables that affect your performance goals. These variables are different for training and inference.

Focus on workload-specific requirements: Focus your performance optimization efforts on the unique requirements of your AI and ML workloads. Rely on managed services for the performance of the underlying infrastructure.

Choose appropriate training strategies: Several pre-trained and foundational models are available, and more such models are released often. Choose a training strategy that can deliver optimal performance for your task. Decide whether you should build your own model, tune a pre-trained model on your data, or use a pre-trained model API.

Recognize that performance-optimization strategies can have diminishing returns: When a particular performance-optimization strategy doesn't provide incremental business value that's measurable, stop pursuing that strategy.

Link performance metrics to design and configuration choices

To innovate, troubleshoot, and investigate performance issues, establish a clear link between design choices and performance outcomes. In addition to experimentation, you must reliably record the lineage of your assets, deployments, model outputs, and the configurations and inputs that produced the outputs.

Consider the following recommendations:

Build a data and model lineage system: All of your deployed assets and their performance metrics must be linked back to the data, configurations, code, and the choices that resulted in the deployed systems. In addition, model outputs must be linked to specific model versions and how the outputs were produced.

Use explainability tools to improve model performance: Adopt and standardize tools and benchmarks for model exploration and explainability. These tools help your ML engineers understand model behavior and improve performance or remove biases.


What is a summary of items that we can do for Cost optimisation of AI/ML

Reduce training and development costs

Select an appropriate model or API for each ML task and combine them to create an end-to-end ML development process.


Vertex AI Model Garden offers a vast collection of pre-trained models for tasks such as image classification, object detection, and natural language processing. The models are grouped into the following categories:


Google models like the Gemini family of models and Imagen for image generation.

Open-source models like Gemma and Llama.

Third-party models from partners like Anthropic and Mistral AI.

Google Cloud provides AI and ML APIs that let developers integrate powerful AI capabilities into applications without the need to build models from scratch.


Cloud Vision API lets you derive insights from images. This API is valuable for applications like image analysis, content moderation, and automated data entry.

Cloud Natural Language API lets you analyze text to understand its structure and meaning. This API is useful for tasks like customer feedback analysis, content categorization, and understanding social media trends.

Speech-to-Text API converts audio to text. This API supports a wide range of languages and dialects.

Video Intelligence API analyzes video content to identify objects, scenes, and actions. Use this API for video content analysis, content moderation, and video search.

Document AI API processes documents to extract, classify, and understand data. This API helps you automate document processing workflows.

Dialogflow API enables the creation of conversational interfaces, such as chatbots and voice assistants. You can use this API to create customer service bots and virtual assistants.

Gemini API in Vertex AI provides access to Google's most capable and general-purpose AI model.

Reduce tuning costs

To help reduce the need for extensive data and compute time, fine-tune your pre-trained models on specific datasets. We recommend the following approaches:


Learning transfer: Use the knowledge from a pre-trained model for a new task, instead of starting from scratch. This approach requires less data and compute time, which helps to reduce costs.

Adapter tuning (parameter-efficient tuning): Adapt models to new tasks or domains without full fine-tuning. This approach requires significantly lower computational resources and a smaller dataset.

Supervised fine tuning: Adapt model behavior with a labeled dataset. This approach simplifies the management of the underlying infrastructure and the development effort that's required for a custom training job.

Explore and experiment by using Vertex AI Studio

Vertex AI Studio lets you rapidly test, prototype, and deploy generative AI applications.


Integration with Model Garden: Provides quick access to the latest models and lets you efficiently deploy the models to save time and costs.

Unified access to specialized models: Consolidates access to a wide range of pre-trained models and APIs, including those for chat, text, media, translation, and speech. This unified access can help you reduce the time spent searching for and integrating individual services.

Use managed services to train or serve models

Managed services can help reduce the cost of model training and simplify the infrastructure management, which lets you focus on model development and optimization. This approach can result in significant cost benefits and increased efficiency.


Reduce operational overhead

To reduce the complexity and cost of infrastructure management, use managed services such as the following:


Vertex AI training provides a fully managed environment for training your models at scale. You can choose from various prebuilt containers with popular ML frameworks or use your own custom containers. Google Cloud handles infrastructure provisioning, scaling, and maintenance, so you incur lower operational overhead.

Vertex AI predictions handles infrastructure scaling, load balancing, and request routing. You get high availability and performance without manual intervention.

Ray on Vertex AI provides a fully managed Ray cluster. You can use the cluster to run complex custom AI workloads that perform many computations (hyperparameter tuning, model fine-tuning, distributed model training, and reinforcement learning from human feedback) without the need to manage your own infrastructure.

Use managed services to optimize resource utilization


Optimize resource allocation


To achieve cost efficiency for your AI and ML workloads in Google Cloud, you must optimize resource allocation. To help you avoid unnecessary expenses and ensure that your workloads have the resources that they need to perform optimally, align resource allocation with the needs of your workloads.

To optimize the allocation of cloud resources to AI and ML workloads, consider the following recommendations.


Use autoscaling to dynamically adjust resources

Use Google Cloud services that support autoscaling, which automatically adjusts resource allocation to match the current demand. Autoscaling provides the following benefits:


Cost and performance optimization: You avoid paying for idle resources. At the same time, autoscaling ensures that your systems have the necessary resources to perform optimally, even at peak load.

Improved efficiency: You free up your team to focus on other tasks.

Increased agility: You can respond quickly to changing demands and maintain high availability for your applications.

The following table summarizes the techniques that you can use to implement autoscaling for different stages of your AI projects.


Training

Use managed services like Vertex AI or GKE, which offer built-in autoscaling capabilities for training jobs.

Configure autoscaling policies to scale the number of training instances based on metrics like CPU utilization, memory usage, and job queue length.

Use custom scaling metrics to fine-tune autoscaling behavior for your specific workloads.


Inference

Deploy your models on scalable platforms like Vertex AI Prediction, GPUs on GKE, or TPUs on GKE.

Use autoscaling features to adjust the number of replicas based on metrics like request rate, latency, and resource utilization.

Implement load balancing to distribute traffic evenly across replicas and ensure high availability.


Start with small models and datasets

To help reduce costs, test ML hypotheses at a small scale when possible and use an iterative approach. This approach, with smaller models and datasets, provides the following benefits:


Reduced costs from the start: Less compute power, storage, and processing time can result in lower costs during the initial experimentation and development phases.

Faster iteration: Less training time is required, which lets you iterate faster, explore alternative approaches, and identify promising directions more efficiently.

Reduced complexity: Simpler debugging, analysis, and interpretation of results, which leads to faster development cycles.

Efficient resource utilization: Reduced chance of over-provisioning resources. You provision only the resources that are necessary for the current workload.



Consider the following recommendations:


Use sample data first: Train your models on a representative subset of your data. This approach lets you assess the model's performance and identify potential issues without processing the entire dataset.

Experiment by using notebooks: Start with smaller instances and scale as needed. You can use Vertex AI Workbench, a managed Jupyter notebook environment that's well suited for experimentation with different model architectures and datasets.

Start with simpler or pre-trained models: Use Vertex AI Model Garden to discover and explore the pre-trained models. Such models require fewer computational resources. Gradually increase the complexity as needed based on performance requirements.


Use pre-trained models for tasks like image classification and natural language processing. To save on training costs, you can fine-tune the models on smaller datasets initially.

Use BigQuery ML for structured data. BigQuery ML lets you create and deploy models directly within BigQuery. This approach can be cost-effective for initial experimentation, because you can take advantage of the pay-per-query pricing model for BigQuery.

Scale for resource optimization: Use Google Cloud's flexible infrastructure to scale resources as needed. Start with smaller instances and adjust their size or number when necessary.


Discover resource requirements through experimentation

Resource requirements for AI and ML workloads can vary significantly. To optimize resource allocation and costs, you must understand the specific needs of your workloads through systematic experimentation. To identify the most efficient configuration for your models, test different configurations and analyze their performance. Then, based on the requirements, right-size the resources that you used for training and serving.


We recommend the following approach for experimentation:


Start with a baseline: Begin with a baseline configuration based on your initial estimates of the workload requirements. To create a baseline, you can use the cost estimator for new workloads or use an existing billing report. For more information, see Unlock the true cost of enterprise AI on Google Cloud.

Understand your quotas: Before launching extensive experiments, familiarize yourself with your Google Cloud project quotas for the resources and APIs that you plan to use. The quotas determine the range of configurations that you can realistically test. By becoming familiar with quotas, you can work within the available resource limits during the experimentation phase.

Experiment systematically: Adjust parameters like the number of CPUs, amount of memory, number and type of GPUs and TPUs, and storage capacity. Vertex AI training and Vertex AI predictions let you experiment with different machine types and configurations.


Monitor utilization, cost, and performance: Track the resource utilization, cost, and key performance metrics such as training time, inference latency, and model accuracy, for each configuration that you experiment with.


To track resource utilization and performance metrics, you can use the Vertex AI console.

To collect and analyze detailed performance metrics, use Cloud Monitoring.

To view costs, use Cloud Billing reports and Cloud Monitoring dashboards.

To identify performance bottlenecks in your models and optimize resource utilization, use profiling tools like Vertex AI TensorBoard.



Implement the data governance framework

Google Cloud provides the following services and tools to help you implement a robust data governance framework:


Dataplex Universal Catalog is an intelligent data fabric that helps you unify distributed data and automate data governance without the need to consolidate data sets in one place. This helps to reduce the cost to distribute and maintain data, facilitate data discovery, and promote reuse.


To organize data, use Dataplex Universal Catalog abstractions and set up logical data lakes and zones.

To administer access to data lakes and zones, use Google Groups and Dataplex Universal Catalog roles.

To streamline data quality processes, enable auto data quality.

Dataplex Universal Catalog is also a fully managed and scalable metadata management service. The catalog provides a foundation that ensures that data assets are accessible and reusable.


Metadata from the supported Google Cloud sources is automatically ingested into the universal catalog. For data sources outside of Google Cloud, create custom entries.

To improve the discoverability and management of data assets, enrich technical metadata with business metadata by using aspects.

Ensure that data scientists and ML practitioners have sufficient permissions to access Dataplex Universal Catalog and use the search function.


Expand reusability beyond pipelines

Look for opportunities to expand reusability beyond training pipelines. The following are examples of Google Cloud capabilities that let you reuse ML features, datasets, models, and code.


Vertex AI Feature Store provides a centralized repository for organizing, storing, and serving ML features. It lets you reuse features across different projects and models, which can improve consistency and reduce feature engineering effort. You can store, share, and access features for both online and offline use cases.

Vertex AI datasets enable teams to create and manage datasets centrally, so your organization can maximize reusability and reduce data duplication. Your teams can search and discover the datasets by using Dataplex Universal Catalog.

Vertex AI Model Registry lets you store, manage, and deploy your trained models. Model Registry lets you reuse the models in subsequent pipelines or for online prediction, which helps you take advantage of previous training efforts.

Custom containers let you package your training code and dependencies into containers and store the containers in Artifact Registry. Custom containers let you provide consistent and reproducible training environments across different pipelines and projects.

AI and ML perspective: Cost optimization

This document in Well-Architected Framework: AI and ML perspective provides an overview of principles and recommendations to optimize the cost of your AI systems throughout the ML lifecycle. By adopting a proactive and informed cost management approach, your organization can realize the full potential of AI and ML systems and also maintain financial discipline.

AI and ML systems can help you unlock valuable insights and predictive capabilities from data. For example, you can reduce friction in internal processes, improve user experiences, and gain deeper customer insights. The cloud offers vast amounts of resources and quick time-to-value without large up-front investments for AI and ML workloads. To maximize business value and to align the spending with your business goals, you need to understand the cost drivers, proactively optimize costs, set up spending controls, and adopt FinOps practices.

The recommendations in this document are mapped to the following core principles:

Define and measure costs and returns

Optimize resource allocation

Enforce data management and governance practices

Automate and streamline with MLOps

Use managed services and pre-trained models


Define and measure costs and returns

To effectively manage AI and ML costs in Google Cloud, you must define and measure the cloud resource costs and the business value of your AI and ML initiatives. To help you track expenses granularly, Google Cloud provides comprehensive billing and cost management tools, such as the following:


Cloud Billing reports and tables

Looker Studio dashboards, budgets, and alerts

Cloud Monitoring

Cloud Logging

To make informed decisions about resource allocation and optimization, consider the following recommendations.




Establish business goals and KPIs

Align the technical choices in your AI and ML projects with business goals and key performance indicators (KPIs).


Define strategic objectives and ROI-focused KPIs

Ensure that AI and ML projects are aligned with strategic objectives like revenue growth, cost reduction, customer satisfaction, and efficiency. Engage stakeholders to understand the business priorities. Define AI and ML objectives that are specific, measurable, attainable, relevant, and time-bound (SMART). For example, a SMART objective is: "Reduce chat handling time for customer support by 15% in 6 months by using an AI chatbot".


To make progress towards your business goals and to measure the return on investment (ROI), define KPIs for the following categories of metrics:


Costs for training, inference, storage, and network resources, including specific unit costs (such as the cost per inference, data point, or task). These metrics help you gain insights into efficiency and cost optimization opportunities. You can track these costs by using Cloud Billing reports and Cloud Monitoring dashboards.


Project-specific metrics. You can track these metrics by using Vertex AI Experiments and evaluation.


Predictive AI: measure accuracy and precision

Generative AI: measure adoption, satisfaction, and content quality

Computer vision AI: measure accuracy


To validate your ROI hypotheses, start with pilot projects and use the following iterative optimization cycle:


Monitor continuously and analyze data: Monitor KPIs and costs to identify deviations and opportunities for optimization.

Make data-driven adjustments: Optimize strategies, models, infrastructure, and resource allocation based on data insights.

Refine iteratively: Adapt business objectives and KPIs based on the things you learned and the evolving business needs. This iteration helps you maintain relevance and strategic alignment.

Establish a feedback loop: Review performance, costs, and value with stakeholders to inform ongoing optimization and future project planning.



Use Cloud Monitoring to collect metrics from various sources, including your applications, infrastructure, and Google Cloud services like Compute Engine, Google Kubernetes Engine (GKE), and Cloud Run functions. To visualize metrics and logs in real time, you can use the prebuilt Cloud Monitoring dashboard or create custom dashboards. Custom dashboards let you define and add metrics to track specific aspects of your systems, like model performance, API calls, or business-level KPIs.


Use Cloud Logging for centralized collection and storage of logs from your applications, systems, and Google Cloud services. Use the logs for the following purposes:


Track costs and utilization of resources like CPU, memory, storage, and network.

Identify cases of over-provisioning (where resources aren't fully utilized) and under-provisioning (where there are insufficient resources). Over-provisioning results in unnecessary costs. Under-provisioning slows training times and might cause performance issues.

Identify idle or underutilized resources, such as VMs and GPUs, and take steps to shut down or rightsize them to optimize costs.

Identify cost spikes to detect sudden and unexpected increases in resource usage or costs.

Use Looker or Looker Studio to create interactive dashboards and reports. Connect the dashboards and reports to various data sources, including BigQuery and Cloud Monitoring.


Optimize resource allocation

To achieve cost efficiency for your AI and ML workloads in Google Cloud, you must optimize resource allocation. To help you avoid unnecessary expenses and ensure that your workloads have the resources that they need to perform optimally, align resource allocation with the needs of your workloads.


To optimize the allocation of cloud resources to AI and ML workloads, consider the following recommendations.


Use autoscaling to dynamically adjust resources

Use Google Cloud services that support autoscaling, which automatically adjusts resource allocation to match the current demand. Autoscaling provides the following benefits:


Cost and performance optimization: You avoid paying for idle resources. At the same time, autoscaling ensures that your systems have the necessary resources to perform optimally, even at peak load.

Improved efficiency: You free up your team to focus on other tasks.

Increased agility: You can respond quickly to changing demands and maintain high availability for your applications.

The following table summarizes the techniques that you can use to implement autoscaling for different stages of your AI projects.



AI and ML perspective: Reliability

This document in the Well-Architected Framework: AI and ML perspective provides an overview of the principles and recommendations to design and operate reliable AI and ML systems on Google Cloud. It explores how to integrate advanced reliability practices and observability into your architectural blueprints. The recommendations in this document align with the reliability pillar of the Google Cloud Well-Architected Framework.


By architecting for scalability and availability, you enable your applications to handle varying levels of demand without service disruptions or performance degradation. This means that your AI services are still available to users during infrastructure outages and when traffic is very high.


Consider the following recommendations:


Design your AI systems with automatic and dynamic scaling capabilities to handle fluctuations in demand. This helps to ensure optimal performance, even during traffic spikes.

Manage resources proactively and anticipate future needs through load testing and performance monitoring. Use historical data and predictive analytics to make informed decisions about resource allocation.

Design for high availability and fault tolerance by adopting the multi-zone and multi-region deployment archetypes in Google Cloud and by implementing redundancy and replication.

Distribute incoming traffic across multiple instances of your AI and ML services and endpoints. Load balancing helps to prevent any single instance from being overloaded and helps to ensure consistent performance and availability.


Use a modular and loosely coupled architecture

To make your AI systems resilient to failures in individual components, use a modular architecture. For example, design the data processing and data validation components as separate modules. When a particular component fails, the modular architecture helps to minimize downtime and lets your teams develop and deploy fixes faster.


Consider the following recommendations:


Separate your AI and ML system into small self-contained modules or components. This approach promotes code reusability, simplifies testing and maintenance, and lets you develop and deploy individual components independently.

Design the loosely coupled modules with well-defined interfaces. This approach minimizes dependencies, and it lets you make independent updates and changes without impacting the entire system.

Plan for graceful degradation. When a component fails, the other parts of the system must continue to provide an adequate level of functionality.

Use APIs to create clear boundaries between modules and to hide the module-level implementation details. This approach lets you update or replace individual components without affecting interactions with other parts of the system.


Build an automated MLOps platform

With an automated MLOps platform, the stages and outputs of your model lifecycle are more reliable. By promoting consistency, loose coupling, and modularity, and by expressing operations and infrastructure as code, you remove fragile manual steps and maintain AI and ML systems that are more robust and reliable.


Consider the following recommendations:


Automate the model development lifecycle, from data preparation and validation to model training, evaluation, deployment, and monitoring.

Manage your infrastructure as code (IaC). This approach enables efficient version control, quick rollbacks when necessary, and repeatable deployments.

Validate that your models behave as expected with relevant data. Automate performance monitoring of your models, and build appropriate alerts for unexpected outputs.

Validate the inputs and outputs of your AI and ML pipelines. For example, validate data, configurations, command arguments, files, and predictions. Configure alerts for unexpected or unallowed values.

Adopt a managed version-control strategy for your model endpoints. This kind of strategy enables incremental releases and quick recovery in the event of problems.


Maintain trust and control through data and model governance

The reliability of AI and ML systems depends on the trust and governance capabilities of your data and models. AI outputs can fail to meet expectations in silent ways. For example, the outputs might be formally consistent but they might be incorrect or unwanted. By implementing traceability and strong governance, you can ensure that the outputs are reliable and trustworthy.


Consider the following recommendations:


Use a data and model catalog to track and manage your assets effectively. To facilitate tracing and audits, maintain a comprehensive record of data and model versions throughout the lifecycle.

Implement strict access controls and audit trails to protect sensitive data and models.

Address the critical issue of bias in AI, particularly in generative AI applications. To build trust, strive for transparency and explainability in model outputs.

Automate the generation of feature statistics and implement anomaly detection to proactively identify data issues. To ensure model reliability, establish mechanisms to detect and mitigate the impact of changes in data distributions.




Implement holistic AI and ML observability and reliability practices

To continuously improve your AI operations, you need to define meaningful reliability goals and measure progress. Observability is a foundational element of reliable systems. Observability lets you manage ongoing operations and critical events. Well-implemented observability helps you to build and maintain a reliable service for your users.


Consider the following recommendations:


Track infrastructure metrics for processors (CPUs, GPUs, and TPUs) and for other resources like memory usage, network latency, and disk usage. Perform load testing and performance monitoring. Use the test results and metrics from monitoring to manage scaling and capacity for your AI and ML systems.

Establish reliability goals and track application metrics. Measure metrics like throughput and latency for the AI applications that you build. Monitor the usage patterns of your applications and the exposed endpoints.

Establish model-specific metrics like accuracy or safety indicators in order to evaluate model reliability. Track these metrics over time to identify any drift or degradation. For efficient version control and automation, define the monitoring configurations as code.

Define and track business-level metrics to understand the impact of your models and reliability on business outcomes. To measure the reliability of your AI and ML services, consider adopting the SRE approach and define service level objectives (SLOs).

AI and ML perspective: Security :

This document in the Well-Architected Framework: AI and ML perspective provides an overview of principles and recommendations to ensure that your AI and ML deployments meet the security and compliance requirements of your organization. The recommendations in this document align with the security pillar of the Google Cloud Well-Architected Framework.

Consider the following recommendations:

Identify potential attack vectors and adopt a security and compliance perspective from the start. As you design and evolve your AI systems, keep track of the attack surface, potential risks, and obligations that you might face.

Align your AI and ML security efforts with your business goals and ensure that security is an integral part of your overall strategy. Understand the effects of your security choices on your main business goals.

Consider the following recommendations:

Don't collect, keep, or use data that's not strictly necessary for your business goals. If possible, use synthetic or fully anonymized data.

Monitor data collection, storage, and transformation. Maintain logs for all data access and manipulation activities. The logs help you to audit data access, detect unauthorized access attempts, and prevent unwanted access.

Implement different levels of access (for example, no-access, read-only, or write) based on user roles. Ensure that permissions are assigned based on the principle of least privilege. Users must have only the minimum permissions that are necessary to let them perform their role activities.

Implement measures like encryption, secure perimeters, and restrictions on data movement. These measures help you to prevent data exfiltration and data loss.

Guard against data poisoning for your ML training systems.


Keep AI pipelines secure and robust against tampering

Your AI and ML code and the code-defined pipelines are critical assets. Code that isn't secured can be tampered with, which can lead to data leaks, compliance failure, and disruption of critical business activities. Keeping your AI and ML code secure helps to ensure the integrity and value of your models and model outputs.


Consider the following recommendations:

Use secure coding practices, such as dependency management or input validation and sanitization, during model development to prevent vulnerabilities.

Protect your pipeline code and your model artifacts, like files, model weights, and deployment specifications, from unauthorized access. Implement different access levels for each artifact based on user roles and needs.

Enforce lineage and tracking of your assets and pipeline runs. This enforcement helps you to meet compliance requirements and to avoid compromising production systems.


Deploy on secure systems with secure tools and artifacts

Ensure that your code and models run in a secure environment that has a robust access control system with security assurances for the tools and artifacts that are deployed in the environment.


Consider the following recommendations:


Train and deploy your models in a secure environment that has appropriate access controls and protection against unauthorized use or manipulation.

Follow standard Supply-chain Levels for Software Artifacts (SLSA) guidelines for your AI-specific artifacts, like models and software packages.

Prefer using validated prebuilt container images that are specifically designed for AI workloads.


Protect and monitor inputs

AI systems need inputs to make predictions, generate content, or automate actions. Some inputs might pose risks or be used as attack vectors that must be detected and sanitized. Detecting potential malicious inputs early helps you to keep your AI systems secure and operating as intended.


Consider the following recommendations:

Implement secure practices to develop and manage prompts for generative AI systems, and ensure that the prompts are screened for harmful intent.

Monitor inputs to predictive or generative systems to prevent issues like overloaded endpoints or prompts that the systems aren't designed to handle.

Ensure that only the intended users of a deployed system can use it.



Monitor, evaluate, and prepare to respond to outputs

AI systems deliver value because they produce outputs that augment, optimize, or automate human decision-making. To maintain the integrity and trustworthiness of your AI systems and applications, you need to make sure that the outputs are secure and within expected parameters. You also need a plan to respond to incidents.


Consider the following recommendations:


Monitor the outputs of your AI and ML models in production, and identify any performance, security, and compliance issues.

Evaluate model performance by implementing robust metrics and security measures, like identifying out-of-scope generative responses or extreme outputs in predictive models. Collect user feedback on model performance.

Implement robust alerting and incident response procedures to address any potential issues.


Build a robust foundation for model development, Well Architected framework , AI Perspective : Part 2

Implement observability

The behavior of AI and ML systems can change over time due to changes in the data or environment and updates to the models. This dynamic nature makes observability crucial to detect performance issues, biases, or unexpected behavior. This is especially true for generative AI models because the outputs can be highly variable and subjective. Observability lets you proactively address unexpected behavior and ensure that your AI and ML systems remain reliable, accurate, and fair.


You can use Vertex AI Model Monitoring to proactively track model performance, identify training-serving skew and prediction drift, and receive alerts to trigger necessary model retraining or other interventions. To effectively monitor for training-serving skew, construct a golden dataset that represents the ideal data distribution, and use TFDV to analyze your training data and establish a baseline schema.


Configure Model Monitoring to compare the distribution of input data against the golden dataset for automatic skew detection. For traditional ML models, focus on metrics like accuracy, precision, recall, F1-score, AUC-ROC, and log loss. Define custom thresholds for alerts in Model Monitoring. 


You can also enable automatic evaluation metrics for response quality, safety, instruction adherence, grounding, writing style, and verbosity. To assess the generated outputs for quality, relevance, safety, and adherence to guidelines, you can incorporate human-in-the-loop evaluation.


Use Vertex AI rapid evaluation to let Google Cloud automatically run evaluations based on the dataset and prompts that you provide.



Use adversarial testing to identify vulnerabilities and potential failure modes. To identify and mitigate potential biases, use techniques like subgroup analysis and counterfactual generation. Use the insights gathered from the evaluations that were completed during the development phase to define your model monitoring strategy in production. Prepare your solution for continuous monitoring as described in the Monitor performance continuously section of this document


Monitor for availability

To gain visibility into the health and performance of your deployed endpoints and infrastructure, use Cloud Monitoring. For your Vertex AI endpoints, track key metrics like request rate, error rate, latency, and resource utilization, and set up alerts for anomalies. For more information, see Cloud Monitoring metrics for Vertex AI.


Monitor the health of the underlying infrastructure, which can include Compute Engine instances, Google Kubernetes Engine (GKE) clusters, and TPUs and GPUs. Get automated optimization recommendations from Active Assist. If you use autoscaling, monitor the scaling behavior to ensure that autoscaling responds appropriately to changes in traffic patterns.



For timely identification and rectification of anomalies and issues, set up custom alerting based on thresholds that are specific to your business objectives. Examples of Google Cloud products that you can use to implement a custom alerting system include the following:


Cloud Logging: Collect, store, and analyze logs from all components of your AI and ML system.

Cloud Monitoring: Create custom dashboards to visualize key metrics and trends, and define custom metrics based on your needs. Configure alerts to get notifications about critical issues, and integrate the alerts with your incident management tools like PagerDuty or Slack.

Error Reporting: Automatically capture and analyze errors and exceptions.

Cloud Trace: Analyze the performance of distributed systems and identify bottlenecks. Tracing is particularly useful for understanding latency between different components of your AI and ML pipeline.

Cloud Profiler: Continuously analyze the performance of your code in production and identify performance bottlenecks in CPU or memory usage.


Prepare for peak events

Ensure that your system can handle sudden spikes in traffic or workload during peak events. Document your peak event strategy and conduct regular drills to test your system's ability to handle increased load.


To aggressively scale up resources when the demand spikes, configure autoscaling policies in Compute Engine and GKE. For predictable peak patterns, consider using predictive autoscaling. To trigger autoscaling based on application-specific signals, use custom metrics in Cloud Monitoring.