Loading…

Engineering teams need to adapt to AI’s scaling challenges

AI is not a linear process. To scale effectively, engineering leaders must account for varied edge cases, presenting a new set of challenges.

Article hero image

AI offers huge wins for engineering teams, but almost as many headaches along the path to successful implementations. Tech leaders need to adapt development processes as GenAI projects move from pilot to rollout. With growing demand for high-quality data, testing, and security moving up a gear, teams need to rethink previous ways of working.

In the latest Leaders of Code podcast episode, Maureen Makes, Recursion's VP of Engineering and Ellen Brandenberger, Senior Director of Product Management for Overflow API, discuss the infrastructure and data challenges engineering leaders face when scaling AI programs.

The thorny challenge: AI development isn't a linear, step-by-step process. Between generating content, establishing AI agents, and managing data, Brandenberger says there are many edge cases when things can behave unexpectedly.

The growing pains of scaling AI models

Many engineering and product teams are building AI concepts that may use a foundational model or internal data, but Brandenberger believes scaling up is proving challenging. "How do we make it better? How do we iterate over time? My team's conversations with industry folks center on that problem. Some see agents as a solution. Capacity, throttling, and protection of data are central across the board."

Thanks to AI, customers now expect quick solutions. "What people used to expect to wait on, they no longer do. And in some cases, what they didn't expect to wait on, they want even faster now."

The ‘need for speed’ is challenging development teams. Recursion uses AI to bring pharmaceutical drugs to market quicker to cure diseases. Makes says, “We want to be constantly pushing ourselves forward to say, 'Can we do this better? Can we do this faster?' For rare diseases and things that don't have cures, time matters to those patients."

They use foundational image models, NVIDIA's Phenom-Beta platform, and different LLMs, rather than a one-size-fits-all approach. Makes asks, "What are the tools that make sense for this job?"Infrastructure needs to change to meet the needs of AI.

Data has exploded in volume and complexity in recent years, but the rise of AI has shown enterprises how difficult it can be to work with unstructured data. It's overwhelming the traditional infrastructure; bottlenecks and quality issues can hamper the success of AI programs.

With this, infrastructure needs to change as AI models evolve. Databricks reports that only 22% of organizations believe their current architecture can support AI workloads without modifications. As organizations start to build AI agents, it will be one of the biggest investments.

AI demand is hard to plan for at the organizational level, and it’s even harder for AI developers and data centers. OpenAI CEO Sam Altman claimed a new image model that led to a craze of Studio Ghibli-style images was "melting" its CPUs. New feature releases and other spikes of interest place demands on the networks that power AI applications. Data centers need low latency and reliable connectivity for developers to rely on the infrastructure. Blackstone estimates that the US will invest more than $1 trillion in data centers in the next five years and a further $1 trillion internationally.

For Recursion, data storage and retrieval became challenging as the speed of data creation exceeded its fiber bandwidth capacity, requiring an upgrade. Their global data solution combines Google Cloud with Biohive, a Utah-based supercomputer, planning data location to minimize cross-region movement. For global operators, this makes sense: Moving data across borders is practically and legally tricky due to legislation like the GDPR in the EU. Gartner previously estimated that by 2025, 75% of the personal data of the world’s population will be covered by privacy regulations. Makes says, "We don't want to constantly be thinking about egress and moving [data] across regions."

Global organisations need to balance performance, compliance, and cost with their data strategy. This starts with mapping where customers are, where data is collected, and where it can be legally moved according to digital sovereignty regulations. Many split workloads across cloud providers or regions within the same provider. For instance, European customer data should be kept in the territory to comply with GDPR, while US customer data should sit in a US data cloud to set data location policies and avoid unnecessary egress costs. Data is then paired with local high-performance computing to handle the heavy lifting closer to where the data sits.

Data quality trumps data quantity

Previous thinking was that LLMs were hungry for more data: More powerful LLMs needed more training data. But with the growth of smaller and faster models like DeepSeek, this is now up for debate.

Organizations are finding that clean, relevant data produces better results than just high volumes. Gartner estimates 30% of internal AI projects fail due to poor data quality, while a Deloitte survey shows infrastructure and data quality are the top barriers to adoption. In a previous episode of Leaders of Code, Don Woodlock, Head of Global Healthcare Solutions at InterSystems, explains how data needs to be fine-tuned and cleaned to support training AI models. Less is more.

Integrating data into AI models: It’s complicated

70% of top-performing tech leaders surveyed by McKinsey said they’ve experienced difficulties integrating data into AI models, including issues with data quality and having sufficient training data, and defining data governance processes.

Data governance is especially crucial for successful AI programs. McKinsey’s 2025 AI Global Survey shows that data governance is managed centrally by 46% of organizations, with a data strategy and data leader.

Recursion manages governance through dedicated object storage teams focused on management and archiving. This keeps data accessible while controlling costs and maintaining contextual relevance and performance.

Strengthening security for production AI

For organizations building AI applications, security concerns deepen as data volume grows. AI creates new vulnerabilities, yet conversely, better detection capabilities. For regulated industries like financial services and healthcare, these challenges are acute, needing careful planning for privacy and legal compliance.

Yet many organizations are lax about monitoring GenAI outputs; 30% of organizations review just 20% of outputs.

Brandenberger has a neat analogy for exploring risk appetite. "How do you make the best cup of coffee? It's subjective. We all have an opinion. But we're less inclined to re-use things like legal or healthcare knowledge, which are higher risk."

She thinks data access is a growing concern. "CTOs are interested right now in understanding who has access to their data, what data is going into the models, and the services their teams are consuming. We're thinking about how we store data, who can access it, when and how."

What engineering teams need to know

Engineering leaders are experiencing the growing pains of implementing AI at scale. It takes a practical mindset to balance data quality and new ways of working, supported with the right tools and infrastructure, for AI pilots to scale up for success.

Add to the discussion

Login with your stackoverflow.com account to take part in the discussion.