Aller au contenu principal

Thèse IA Frugale et Agentique pour les Tâches à Forte Intensité de Connaissances un Cadre d'Adaptation pour les Langues Peu-Dotées et les Domaines Spécialisés H/F

Doctorat.Gouv.Fr

  • Paris - 75
  • CDD
  • Bac +5
  • Service public d'état
Lire dans l'app

Détail du poste

Établissement : Université Paris-Saclay GS Informatique et sciences du numérique
École doctorale : Sciences et Technologies de l'Information et de la Communication
Laboratoire de recherche : CEA /LIST - Laboratoire d'intégration de systèmes et de technologies
Direction de la thèse : Nasredine SEMMAR ORCID 0000000246856053
Début de la thèse : 2026-04-15
Date limite de candidature : 2026-04-15T23:59:59

Cette thèse propose un pipeline d'IA frugal et agentique pour le développement d'un large éventail de tâches et d'applications de traitement automatique des langues (TAL). Elle se concentrera sur plusieurs cas d'usage, tels que la traduction automatique pour les langues et les domaines disposant de peu de ressources, et l'application d'agents à des activités d'entreprise spécialisées (juridique, capital-investissement, ingénierie, OSINT).

La thèse exploitera toutes les données disponibles, numériques et analogiques, afin d'enrichir ces langues et ces domaines et de soutenir la création d'outils linguistiques et de bases de connaissances adaptés. Elle propose un cadre complet englobant la collecte et l'analyse des données, l'identification de la langue, la validation des sources, l'extraction d'informations à partir d'entrées multimodales (texte, image et vidéo) et la fusion de données multi-sources, aboutissant au développement de grands modèles de langage spécialisés.

De plus, la recherche explorera comment les boucles agentiques peuvent valider de manière autonome la qualité des sources et structurer les données « analogiques » ou non structurées en corpus professionnels exploitables. Afin de garantir des résultats de haute qualité dans des environnements à forts enjeux, l'apprentissage par renforcement à partir de retours humains (RLHF) sera utilisé pour aligner le raisonnement du modèle sur la logique, la nomenclature et les attentes spécifiques des experts du domaine et des locuteurs natifs. Cette approche va au-delà du simple suivi d'instructions, offrant un cadre à la fois frugal et robuste pour le déploiement de l'IA dans des contextes spécialisés et aux ressources limitées.

Historically, the NLP domain was defined by specialized models built from explicit expert knowledge to handle specific, narrow tasks. However, the field has undergone a radical paradigm shift, transitioning toward single, multi-purpose models capable of following specific instructions across a vast range of contexts. Oddly enough, these models have demonstrated 'emergent abilities' performing extraordinarily well on various tasks and frequently bypassing the need for traditional specialized architectures [1]. While this might suggest the end of the specialized era, the reality of deploying AI for professional, high-stakes applications indicates that we now require domain adaptation more than ever [2].

This necessity is compounded in specialized sectors such as Engineering, where technical precision and specific nomenclature are paramount, and Private Equity, where the synthesis of confidential, complex financial datasets requires high-fidelity reasoning. In these high-stakes professional environments, general-purpose models often lack the nuanced 'expert' grounding required for reliable output. Consequently, the lack of domain-specific data and professional corpora creates a secondary barrier to the effective deployment of AI tools [3].

While NLP has progressed drastically with the advent of Large Language Models (LLMs) capable of unprecedented text understanding and generation, such advances primarily benefit well-resourced languages like English and French. Low-resource languages, by contrast, face persistent challenges including a lack of corpora, absence of basic tools (e.g., spell checkers, translators, or speech recognition systems), and under-representation in multilingual models. This disparity makes it difficult to use these languages effectively in digital contexts. Today, thousands of languages remain largely excluded from modern digital technologies, particularly those based on artificial intelligence, and lack the textual resources and platforms necessary for their preservation and use. Of the nearly 7,000 languages spoken worldwide, only a small fraction benefits from recent advances in Natural Language Processing (NLP) [4].

Several factors contribute to this divide. Digital data are scarce, whether annotated or not, limiting linguistic diversity in computational models. Existing texts often cover narrow domains, and parallel corpora necessary for translation are rare. The current performance of Neural Machine Translation (NMT) systems for language pairs with sufficient parallel training corpora is close to human level, particularly in general-purpose domains, thanks to deep learning techniques trained on large annotated datasets. Conversely, achieving human-level performance on low-resource language pairs remains challenging due to the scarcity of linguistic resources and training data. To address these challenges, several innovative techniques such as zero-shot and few-shot learning have been employed to facilitate translation with limited data [5]. Furthermore, data augmentation techniques have played a vital role in improving translation quality in low-resource settings [6]. Back-translation has been widely used to generate synthetic data compensating for the lack of large parallel corpora. In prior work on machine translation at CEA List, we proposed an approach to assess back-translation effectiveness, focusing on sentence length and Part-Of-Speech (POS) tags [7]. Our findings suggest that specialized models may be needed to handle sentences outside typical length ranges, and that fine-tuning translation models to emphasize specific POS tags (e.g., verbs, nouns) can improve performance. However, this approach requires a POS tagger which is an uncommon resource for most low-resource languages. Despite the existence of some NLP tools for low-resource languages, the major issue lies in the lack of language corpora on the one hand, and the absence of scientifically grounded approaches and operational tools on the other hand. This technical and data-driven scarcity creates a direct parallel with highly specialized professional domains; just as a lack of basic linguistic tools cripples a low-resource language, the absence of domain-specific corpora and specialized expert lexicons prevents the reliable application of AI in complex fields like engineering or private equity. Being able to leverage expert knowledge stored in ontologies or specialized corpora might help reduce this gap.

In today's AI landscape, the prevailing strategy for deploying technology in professional contexts is the use of AI agents, modular systems driven by Large Language Models (LLMs) that are designed to plan, reason, and interact with external tools to perform task-specific automation [8]. However, this approach often falls victim to a 'specialized knowledge fallacy', where general-purpose models are expected to perform expert tasks despite the necessary professional grounding being absent from their training data. Because foundational models are primarily trained on broad, well-resourced datasets, they frequently lack the high-fidelity reasoning required for specialized fields like engineering or private equity. To overcome this and reach acceptable performance, there is a critical need to adapt these agents, which is currently done by either modifying the model itself (optimizing its internal planning and strategy via holistic rewards) or by adapting its tools through agent-supervised refinement and specialized memory stores [9].

Recent research into agentic adaptation reveals a significant 'Generalist Model Paradox': while models are marketed as universal solvers, intensive specialized training remains necessary for every new professional frontier. Current literature often overlooks the critical role of data quality and data leverage, focusing instead on architecture while failing to reflect on how 'noise' in specialized datasets can lead to catastrophic reasoning failures. Furthermore, there is insufficient exploration into how to frugally leverage unstructured or 'analog' data, a gap that is particularly visible when dealing with low-resource languages.

A central theoretical tension in this research lies in the 'chicken and egg' relationship between agentic systems and domain adaptation: is the agent the object to be adapted, or is agentic architecture the mechanism for adaptation? On one hand, the thesis explores whether domain adaptation of agents represents a new frontier in NLP, where the challenge shifts from simple linguistic alignment to the adaptation of an agent's internal reasoning and tool-use strategies for specific professional tasks. In this view, the 'frugal' constraint requires developing methods to fine-tune these complex behaviors with minimal data. On the other hand, the thesis investigates whether agentic AI is itself a sovereign solution to the domain adaptation problem. In this paradigm, an autonomous pipeline, capable of multisource data fusion, source validation, and identifying relevant fragments from heterogeneous analog and digital sources, serves as the engine that creates structured resources where none existed. By treating the agent as both the result and the facilitator of adaptation, this work aims to determine if an agentic framework can bypass traditional data-scarcity bottlenecks to support professional applications in low-resource languages and domains.

This thesis proposes a Frugal Agentic AI pipeline for developing a wide range of NLP tasks and applications, and will focus on several use cases, such as machine translation for low-resource languages and domains, and application of Agents to specialized Enterprise activities (Legal, Private Equity, Engineering, OSINT).

The thesis will exploit all available data, digital and analog, to enhance these languages and domains and support the creation of suitable linguistic tools and knowledge bases. It proposes a complete framework encompassing data collection and analysis, language identification, source validation, information extraction from multimodal inputs (text, image, and video), and multisource data fusion, leading ultimately to the development of specialized large language models.

We propose to tackle these issues along the following key aspects:
- Identification, collection, analysis, and evaluation of relevant data, as well as the selection of useful resources, forming a critical step in the enhancement and enrichment of low-resource languages and domains. This thesis handles various types of data: texts and multimedia content (videos and audio recordings, historical narratives, or cultural content). Each collected source is evaluated for quality and linguistic value, to determine its relevance to the target language. The data are then organized and categorized, linking each linguistic fragment with its corresponding metadata and sources.
- Transforming, aggregating, and exploiting heterogeneous external data to create structured, usable resources for documenting and preserving low-resource languages. This involves converting the unstructured and dispersed sources collected into reliable, annotated textual data while preserving their linguistic and cultural richness.
- Adapting, fine-tuning, and optimizing agentic architectures to enhance performance in knowledge-intensive tasks within specialized domains and low-resource languages. Rather than focusing solely on linguistic translation, this thesis seeks to refine existing models to handle both the technical specificities of professional domains, such as Engineering and Private Equity, and the data scarcity typical of under-represented languages. By leveraging specialized ontologies, heterogeneous data augmentation, and agent-supervised refinement, these systems will be optimized for complex reasoning, expert information extraction, and cross-lingual knowledge transfer.

Furthermore, the research will explore how agentic loops can autonomously validate source quality and structure 'analog' or unstructured data into usable professional corpora. To ensure high-fidelity outputs in high-stakes environments, Reinforcement Learning from Human Feedback (RLHF) will be employed to align model reasoning with the specific logic, nomenclature, and expectations of domain experts and native speakers. This approach moves beyond general-purpose instruction following, providing a frugal yet robust framework for deploying AI in specialized, resource-constrained contexts.
- To operationalize these agentic architectures, we will focus on developing the underlying datasets and tools required to study and preserve low-resourced languages while capturing expert knowledge in specific fields. These resources may include datasets on specific characters and alphabets, dictionaries, annotated text corpora, audio and video content, or specialized lexicons.
- Exploring new data augmentation strategies based on Large Language Models [10].
- Investigating the use of LLMs as a solution for NMT by leveraging in-context learning (ICL) [11]. In particular, we will study how LLMs exploit different parts of the provided context (e.g., few-shot examples, source text) in translation, with the goal of identifying and quantifying the minimum number of examples required to enhance translation quality [12].

The objective of this thesis is to develop frugal adaptation techniques that enable Generative and Agentic AI systems to excel at knowledge-intensive tasks where current leading LLMs struggle, or that allow smaller models to match the performance of larger ones at a fraction of the computational cost.

The research will follow a two-phase methodology:

Phase 1: Task Identification and Resource Construction
We will identify precise use cases spanning diverse knowledge-intensive challenges, including low-resource machine translation, financial data analysis, and specialized domain report generation. For each selected task, we will: (1) gather specialized resources (ontologies, expert lexicons, domain corpora), (2) construct evaluation datasets with clearly defined quality metrics, and (3) collect feedback from domain specialists to establish production-grade performance thresholds.

Phase 2: Agentic System Development and Frugal Adaptation
Using existing frameworks for data processing and agent orchestration, we will develop AI agents designed to solve each identified task. This phase addresses our primary research question: Can agentic architectures, equipped with appropriate tooling, external knowledge access, and orchestration strategies, solve NLP tasks where single-LLM systems fail or fall short of state-of-the-art performance? We anticipate that initial systems will exhibit below-acceptable error rates on certain tasks, primarily due to LLMs' limited understanding of domain-specific knowledge and task requirements. The core research challenge will be identifying and validating frugal techniques to bridge this gap, including:
- Enhanced data integration: Connecting heterogeneous knowledge sources (structured ontologies, unstructured corpora, expert feedback) to provide grounded context for reasoning.
- Frugal reinforcement learning: Adapting models through sample-efficient RL techniques that leverage limited expert feedback to align with domain-specific logic and nomenclature.
- Feedback-driven synthetic data augmentation: Expanding training corpora with LLM-generated synthetic examples, guided by evaluation insights and expert validation to target specific failure modes.
The success of these techniques will be measured against production-grade benchmarks established in Phase 1, with emphasis on achieving comparable performance to larger models while minimizing computational cost, training data requirements, and expert annotation burden.

Phase 3: Domain and Task Transfer, Optimization
Time permitting, we will evaluate the generalizability of these adaptation techniques across additional domains and tasks to assess their robustness to domain variation, while maintaining our focus on computational efficiency.

Le profil recherché

Le candidat doit avoir:
- un diplôme de Master Recherche ou équivalent en informatique ou en linguistique informatique;
- des compétences en développement (Linux, C++, Java, Python, Perl, etc.);
- des connaissances en IA générative, traitement automatique des langues, statistique et apprentissage automatique.

Publiée le 03/04/2026 - Réf : 3310efae92b9d88dcebf1bcdff9da022

Thèse IA Frugale et Agentique pour les Tâches à Forte Intensité de Connaissances un Cadre d'Adaptation pour les Langues Peu-Dotées et les Domaines Spécialisés H/F

Doctorat.Gouv.Fr
  • Paris - 75
  • CDD
Postuler sur le site du partenaire Publiée le 03/04/2026 - Réf : 3310efae92b9d88dcebf1bcdff9da022

Finalisez votre candidature

sur le site du partenaire

Créez votre compte
Hellowork et postulez

sur le site du partenaire !

Ces offres pourraient aussi
vous intéresser

Direct assurance recrutement
Suresnes - 92
CDI
Télétravail partiel
Voir l’offre
il y a 9 jours
ECM recrutement
Paris 1er - 75
CDI
35 000 - 45 000 € / an
Voir l’offre
il y a 25 jours
Voir plus d'offres
Initialisation…
Les sites
L'emploi
  • Offres d'emploi par métier
  • Offres d'emploi par ville
  • Offres d'emploi par entreprise
  • Offres d'emploi par mots clés
L'entreprise
  • Qui sommes-nous ?
  • On recrute
  • Accès client
Les apps
Nous suivre sur :
Informations légales CGU Politique de confidentialité Gérer les traceurs Accessibilité : non conforme Aide et contact