About PaFreS

The French/Spanish parallel Corpus, PaFreS, is part of an ongoing major project, PaCorES, Parallel Corpora Spanish, which aims to collect a series of bilingual parallel corpora with Spanish as the central language. So far German/Spanish (www.corpuspages.eu), English/Spanish (www.corpuspaens.eu) and this one.

The corpus PaFreS is comprised of original texts in French or Spanish and their translation and French and Spanish translations of a third language. So far PaFreS contains some 3,000,000 tokens, segmented into 58,000 bisegments, i.e. sentence or subsentence aligned pairs of text chunks.

We aim at building a multifunctional and representative language resource for the language pair French / Spanish that is able to meet differentiated need of users and that can be exploited for multiple purposes such as general research in contrastive linguistics, linguistic typology, translation studies and bilingual lexicography, as well as the supply of training data to machine translation systems.

Main purpose of the corpus PaFreS is to be a useful and easy to use tool for translators and learners of French or Spanish as Foreign Languages at intermediate and advanced levels. With this tool they can get a multitude of translation suggestions made by humans and presented within examples of real language use.

It includes so far:

A Collection of 12 books of classical literature, 5 of them original in French and 7 in English.

Brontë, Charlotte (1847): Jane Eyre.
[Jane Eyre. Translation: András Farkas]
Review of the alignment: I. Doval. [32001]

Carroll, Lewis (1865): Alice's Adventures in Wonderland.
[Alice au pays des merveilles. Translation: András Farkas]
Review of the alignment: I. Doval. [32002]

Defoe, Daniel (1719): Robinson Crusoe.
[Robinson Crusoe. Translation: András Farkas]
Review of the alignment: I. Doval. [32003]

Doyle, Arthur Conan (1902): The Hound of the Baskervilles.
[Le Chien des Baskerville. Translation: András Farkas]
Review of the alignment: I. Doval. [32004]

Doyle, Arthur Conan (1887): A Study in Scarlet.
[Une étude en rouge. Translation: András Farkas]
Review of the alignment: I. Doval. [32005]

Poe, Edgar Allan (1839): The Fall of the House of Usher.
[La chute de la maison Usher. Translation: András Farkas]
Review of the alignment: I. Doval. [32006]

Wilde, Oscar (1890): The Picture of Dorian Gray.
[Le Portrait de Dorian Gray. Translation: András Farkas]
Review of the alignment: I. Doval. [32007]

Dumas, Alexandre (1844): Les Trois Mousquetaires.
[Los tres mosqueteros. Translation: András Farkas]
Review of the alignment: I. Doval. [32008]

Verne, Jules (1870): Vingt mille lieues sous les mers.
[Veinte mil leguas de viaje submarino. Translation: András Farkas]
Review of the alignment: I. Doval. [32009]

Verne, Jules (1875): L'île mystérieuse.
[La isla misteriosa. Translation: András Farkas]
Review of the alignment: I. Doval. [32010]

Verne, Jules (1759): Le Tour du monde en quatre-vingts jours.
[La vuelta al mundo en 80 días. Translation: András Farkas]
Review of the alignment: I. Doval. [32011]

Voltaire (1759): Candide, ou l'Optimisme Candide, ou l'Optimisme.
[Cándido o el optimismo. Translation: András Farkas]
Review of the alignment: I. Doval. [32012]

Ted-Talks, a corpus that collects the French and Spanish translations of the transcriptions of 2008 Ted-Talks from 2006 to 2016.
Europarl v7, a corpus that collects the proceedings (Verbatim reports) of the European Parliament from 1996 to 2011.
Global-Voices a corpus of 2362 texts written by an international, multilingual, primarily volunteer community of writers, translators, academics, and human rights activists.

This is an ongoing project and in the future it is planned to add new collections of bilingual texts diverse origin.

Despite our best efforts, some mistakes have undoubtedly slipped through. If you come across any, please let us know by by clicking here.

Notice:

If you use PaFreS in your work, please indicate it and let us know: corpuspafres@usc.es. This way you contribute to the sustainability of the project.

Statistics PaFreS

COLLECTION	LANGUAGE	TOKENS	WORDS	MSTTRATIO*	BISEGMENTS
Literature	French	1.399.200	1.151.212	0,545	57.733
Literature	Spanish	1.282.476	1.106.637	0,537	57.733
Europarl v7	French	59.651.196	51.954.734	0,496	1.944.439
Europarl v7	Spanish	53.583.854	48.664.574	0,482	1.944.439
TED-Talks	French	5.197.553	4.332.950	0,496	254.222
TED-Talks	Spanish	4.686.514	4.062.259	0,504	254.222
Global Voices	French	1.179.414	1.016.690	0,539	50.270
Global Voices	Spanish	1.097.488	985.542	0,553	50.270
Total	French	67.427.363	58.455.586	0,523	2.306.664
Total	Spanish	60.650.332	54.819.012	0,515	2.306.664

*MSTTR is the average TTR (Type/Token Ratio) for each non-overlapping segment of equal size (in this case 1000 tokens).