en | fr | es | gl
|
Team
|
Contact

About PaFreS


The French/Spanish parallel Corpus, PaFreS, is part of an ongoing major project, PaCorES, Parallel Corpora Spanish, which aims to collect a series of bilingual parallel corpora with Spanish as the central language. So far German/Spanish (www.corpuspages.eu), English/Spanish (www.corpuspaens.eu) and this one.

The corpus PaFreS is comprised of original texts in French or Spanish and their translation and French and Spanish translations of a third language. So far PaFreS contains some 3,000,000 tokens, segmented into 58,000 bisegments, i.e. sentence or subsentence aligned pairs of text chunks.

We aim at building a multifunctional and representative language resource for the language pair French / Spanish that is able to meet differentiated need of users and that can be exploited for multiple purposes such as general research in contrastive linguistics, linguistic typology, translation studies and bilingual lexicography, as well as the supply of training data to machine translation systems.

Main purpose of the corpus PaFreS is to be a useful and easy to use tool for translators and learners of French or Spanish as Foreign Languages at intermediate and advanced levels. With this tool they can get a multitude of translation suggestions made by humans and presented within examples of real language use.

It includes so far:

  1. A Collection of 12 books of classical literature, 5 of them original in French and 7 in English.
  2. Brontë, Charlotte (1847): Jane Eyre.
               [Jane Eyre.  Translation: András Farkas]
               Review of the alignment: I. Doval. [32001]

    Carroll, Lewis (1865): Alice's Adventures in Wonderland.
               [Alice au pays des merveilles.  Translation: András Farkas]
               Review of the alignment: I. Doval. [32002]

    Defoe, Daniel (1719): Robinson Crusoe.
               [Robinson Crusoe.  Translation: András Farkas]
               Review of the alignment: I. Doval. [32003]

    Doyle, Arthur Conan (1902): The Hound of the Baskervilles.
               [Le Chien des Baskerville.  Translation: András Farkas]
               Review of the alignment: I. Doval. [32004]

    Doyle, Arthur Conan (1887): A Study in Scarlet.
               [Une étude en rouge.  Translation: András Farkas]
               Review of the alignment: I. Doval. [32005]

    Poe, Edgar Allan (1839): The Fall of the House of Usher.
               [La chute de la maison Usher.  Translation: András Farkas]
               Review of the alignment: I. Doval. [32006]

    Wilde, Oscar (1890): The Picture of Dorian Gray.
               [Le Portrait de Dorian Gray.  Translation: András Farkas]
               Review of the alignment: I. Doval. [32007]

    Dumas, Alexandre (1844): Les Trois Mousquetaires.
               [Los tres mosqueteros.  Translation: András Farkas]
               Review of the alignment: I. Doval. [32008]

    Verne, Jules (1870): Vingt mille lieues sous les mers.
               [Veinte mil leguas de viaje submarino.  Translation: András Farkas]
               Review of the alignment: I. Doval. [32009]

    Verne, Jules (1875): L'île mystérieuse.
               [La isla misteriosa.  Translation: András Farkas]
               Review of the alignment: I. Doval. [32010]

    Verne, Jules (1759): Le Tour du monde en quatre-vingts jours.
               [La vuelta al mundo en 80 días.  Translation: András Farkas]
               Review of the alignment: I. Doval. [32011]

    Voltaire (1759): Candide, ou l'Optimisme Candide, ou l'Optimisme.
               [Cándido o el optimismo.  Translation: András Farkas]
               Review of the alignment: I. Doval. [32012]

  3. Ted-Talks, a corpus that collects the French and Spanish translations of the transcriptions of 2008 Ted-Talks from 2006 to 2016.
  4. Europarl v7, a corpus that collects the proceedings (Verbatim reports) of the European Parliament from 1996 to 2011.
  5. Global-Voices a corpus of 2362 texts written by an international, multilingual, primarily volunteer community of writers, translators, academics, and human rights activists.

This is an ongoing project and in the future it is planned to add new collections of bilingual texts diverse origin.

Despite our best efforts, some mistakes have undoubtedly slipped through. If you come across any, please let us know by by clicking here.

Notice:

If you use PaFreS in your work, please indicate it and let us know: corpuspafres@usc.es. This way you contribute to the sustainability of the project.

Statistics PaFreS

COLLECTION LANGUAGE TOKENS WORDS MSTTRATIO* BISEGMENTS
Literature   French 1.399.200 1.151.212 0,545 57.733
Spanish 1.282.476 1.106.637 0,537
Europarl v7   French 59.651.196 51.954.734 0,496 1.944.439
Spanish 53.583.854 48.664.574 0,482
TED-Talks   French 5.197.553 4.332.950 0,496 254.222
Spanish 4.686.514 4.062.259 0,504
Global Voices   French 1.179.414 1.016.690 0,539 50.270
Spanish 1.097.488 985.542 0,553
Total   French 67.427.363 58.455.586 0,523 2.306.664
Spanish 60.650.332 54.819.012 0,515

*MSTTR is the average TTR (Type/Token Ratio) for each non-overlapping segment of equal size (in this case 1000 tokens).

                                                    
PaFreS Vers. 1.0
Last updated: 15.04.2023
©PaCorES
Creative Commons Licencia Creative Commons
University of Santiago de Compostela
This project is funded by the State Research Agency (AEI) of Spanish Ministry of Science, Innovation and University (PID2021-125313OB-I00).