Pedro Ortiz Suarez

Leitender Forschungswissenschaftler

Common Crawl Foundation

Über mich

Ich bin leitender Forschungswissenschaftler bei der Common Crawl Foundation.

Ich interessiere mich für große Korpora zum Trainieren von Sprachmodellen, insbesondere für unterversorgte Sprachen und historische Sprachen. Ich interessiere mich für Aufgaben wie Name Entity Recognition (NER), Dependency Parsing und Part-of-Speech Tagging, maschinelle Übersetzung und Dokumentenstrukturierung.

Ich liebe Kaffee, Kekse und Mathe. ☕🍪

Interessen

Sprachmodellierung
Korpuslinguistik
Named Entity Recognition
Computerlinguistik
Maschinelle Übersetzung

Bildung

Ph.D. in Informatik, 2022
Sorbonne Université
BASc MIASHS, 2018
Université Paris 8
MSc in Mathematik, 2017
Aix-Marseille Université
BSc in Mathematik, 2016
Universidad Nacional de Colombia

Ausgewählte Veröffentlichungen

A Data-driven Approach to Named Entity Recognition for Early Modern French

We opt for a data-driven approach by developing a new corpus with fine-grained entity annotation, covering three centuries of literature corresponding to the early modern period, We then fine-tune existing state-of-the-art architectures obtaining results that are on par with those of the current state-of-the-art NER systems for Contemporary English.

Simon Gabay, Pedro Ortiz Suarez

A Data-driven Approach to Named Entity Recognition for Early Modern French

A Data-driven Approach to Natural Language Processing for Contemporary and Historical French

We determine that the importance of the pre-training dataset size was largely overestimated, as we are able to repeatedly show that language models can be pre-trained with corpora of a modest size.

Pedro Ortiz Suarez

A Data-driven Approach to Natural Language Processing for Contemporary and Historical French

Le projet FREEM : ressources, outils et enjeux pour l’étude du français d’Ancien Régime

We present annotated corpora and NLP models for some downstream tasks in Early Modern French.

Simon Gabay, Pedro Ortiz Suarez, Rachel Bawden, Alexandre Bartz, Philippe Gambette, Benoît Sagot

Le projet FREEM : ressources, outils et enjeux pour l’étude du français d’Ancien Régime

BERTrade: Using Contextual Embeddings to Parse Old French

We consider several neural language models, some of which trained or fine-tuned on a new corpus of raw Old and Middle French texts, and use their internal representations of words as inputs to train taggers and parsers on the SRCMF treebank.

Loïc Grobol, Mathilde Regnault, Pedro Ortiz Suarez, Benoît Sagot, Laurent Romary, Benoît Crabbé

BERTrade: Using Contextual Embeddings to Parse Old French

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

We present the $\text{FreEM}_{\text{max}}$ corpus of Early Modern French and D’AlemBERT, a RoBERTa-based language model trained on $\text{FreEM}_{\text{max}}$.

Simon Gabay, Pedro Ortiz Suarez, Alexandre Bartz, Alix Chagué, Rachel Bawden, Philippe Gambette, Benoît Sagot

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR.

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

We audit 5 multilingual corpora, finding that lower-resource corpora have systematic issues.

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, Mofetoluwa Adeyemi

Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus

We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data.

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus

SinNer@CLEF-HIPE2020: Sinful Adaptation of SotA models for Named Entity Recognition in Historical French and German Newspapers

In this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers.

Pedro Ortiz Suarez, Yoann Dupont, Gaël Lejeune, Tian Tian

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

We explore the impact of the training corpus on contextualized word embeddings in five mid-resource languages.

Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell

We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect.

Djamé Seddah, Farah Essaidi, Amal Fethi, Matthieu Futeral, Benjamin Muller, Pedro Ortiz Suarez, Benoît Sagot, Abhishek Srivastava

Les modèles de langue contextuels Camembert pour le Français : impact de la taille et de l'hétérogénéité des données d'entrainement

We explore the impact of the training data size and heterogeneity on French language modeling. (Equal contribution by the first three authors).

Louis Martin, Benjamin Muller, Pedro Ortiz Suarez, Yoann Dupont, Laurent Romary, Éric de la Clergerie, Benoît Sagot, Djamé Seddah

Les modèles de langue contextuels Camembert pour le Français : impact de la taille et de l'hétérogénéité des données d'entrainement

Establishing a New State-of-the-Art for French Named Entity Recognition

We explore convert the NER annotations of the French TreeBank to a more user-friendly format and establish a new state of the art for French NER.

Pedro Ortiz Suarez, Yoann Dupont, Benjamin Muller, Laurent Romary, Benoît Sagot

Establishing a New State-of-the-Art for French Named Entity Recognition

French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus

We investigate the impact of different types and size of training corpora on language models.

Murielle Popa-Fabre, Pedro Ortiz Suarez, Benoît Sagot, Éric de la Clergerie

French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus

How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures

We explore the impact of the OCR quality on grobid-dictionaries models.

Mohamed Khemakhem, Ioana Galleron, Geoffrey Williams, Laurent Romary, Pedro Ortiz Suarez

Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

We propose a new pipeline to filter, clean and classify Common Crawl by language, we publish the final corpus under the name OSCAR.

Pedro Ortiz Suarez, Benoît Sagot, Laurent Romary

Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

Aktuelle Veröffentlichungen

A Data-driven Approach to Named Entity Recognition for Early Modern French

We opt for a data-driven approach by developing a new corpus with fine-grained entity annotation, covering three centuries of literature corresponding to the early modern period, We then fine-tune existing state-of-the-art architectures obtaining results that are on par with those of the current state-of-the-art NER systems for Contemporary English.

Simon Gabay, Pedro Ortiz Suarez

A Data-driven Approach to Named Entity Recognition for Early Modern French

A Data-driven Approach to Natural Language Processing for Contemporary and Historical French

We determine that the importance of the pre-training dataset size was largely overestimated, as we are able to repeatedly show that language models can be pre-trained with corpora of a modest size.

Pedro Ortiz Suarez

A Data-driven Approach to Natural Language Processing for Contemporary and Historical French

Le projet FREEM : ressources, outils et enjeux pour l’étude du français d’Ancien Régime

We present annotated corpora and NLP models for some downstream tasks in Early Modern French.

Simon Gabay, Pedro Ortiz Suarez, Rachel Bawden, Alexandre Bartz, Philippe Gambette, Benoît Sagot

Le projet FREEM : ressources, outils et enjeux pour l’étude du français d’Ancien Régime

BERTrade: Using Contextual Embeddings to Parse Old French

We consider several neural language models, some of which trained or fine-tuned on a new corpus of raw Old and Middle French texts, and use their internal representations of words as inputs to train taggers and parsers on the SRCMF treebank.

Loïc Grobol, Mathilde Regnault, Pedro Ortiz Suarez, Benoît Sagot, Laurent Romary, Benoît Crabbé

BERTrade: Using Contextual Embeddings to Parse Old French

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

We present the $\text{FreEM}_{\text{max}}$ corpus of Early Modern French and D’AlemBERT, a RoBERTa-based language model trained on $\text{FreEM}_{\text{max}}$.

Simon Gabay, Pedro Ortiz Suarez, Alexandre Bartz, Alix Chagué, Rachel Bawden, Philippe Gambette, Benoît Sagot

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR.

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

We audit 5 multilingual corpora, finding that lower-resource corpora have systematic issues.

Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, Mofetoluwa Adeyemi

Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus

We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data.

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus

SinNer@CLEF-HIPE2020: Sinful Adaptation of SotA models for Named Entity Recognition in Historical French and German Newspapers

In this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers.

Pedro Ortiz Suarez, Yoann Dupont, Gaël Lejeune, Tian Tian

Weitere Publikationen

Projekte

Digitization and analysis of Basnage de Beauval’s Universal Dictionary: lexicography and scientific networks

BASNUM

A state-of-the-art language model for French.

CamemBERT

OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus

OSCAR

Aktuelle und kommende Vorträge

Des Méthodes de TAL modernes pour l'Enrichissement de Documents

Nous présentons une pipeline pour le traitement et l’enrichissement de documents basée sur les dernières méthodes d’apprentissage neuronal.

Pedro Ortiz Suarez

Sept. 22, 2020

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

We explore the impact of the training corpus on contextualized word embeddings in five mid-resource languages.

Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

Juli 6, 2020

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

We propose a new pipeline to filter, clean and classify Common Crawl by language, we publish the final corpus under the name OSCAR.

Pedro Ortiz Suarez, Benoît Sagot, Laurent Romary

Juli 22, 2019

Preparing the Dictionnaire Universel for Automatic Enrichment

A talk about automatic enrichment of dictionaries.

Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

Juni 13, 2019

Reducing computation time by months by rewriting Bash scripts in Go

Pedro Ortiz Suarez

März 24, 2019

Kontakt