Sem título

Financiador

Organização

Publicações

Building a Corpus of Errors and Quality in Machine Translation

Publication . Costa, Ângela; Correia, Rui; Coheur, Luísa; Centro de Linguística da UNL (CLUNL)

In this paper we describe a corpus of automatic translations annotated with both error type and quality. The 300 sentences that we have selected were generated by Google Translate, Systran and two in-house Machine Translation systems that use Moses technology. The errors present on the translations were annotated with an error taxonomy that divides errors in five main linguistic categories (Orthography, Lexis, Grammar, Semantics and Discourse), reflecting the language level where the error is located. After the error annotation process, we accessed the translation quality of each sentence using a four point comprehension scale from 1 to 5. Both tasks of error and quality annotation were performed by two different annotators, achieving good levels of inter-annotator agreement. The creation of this corpus allowed us to use it as training data for a translation quality classifier. We concluded on error severity by observing the outputs of two machine learning classifiers: a decision tree and a regression model.

2016Documento de conferência

Acesso aberto

Ver mais

El-WOZ

Publication . Pellegrini, Thomas; Hedayati, Vahid; Costa, Ângela; Centro de Linguística da UNL (CLUNL)

In this paper, we present a speech recording interface developed in the context of a project on automatic speech recognition for elderly native speakers of European Portuguese. In order to collect spontaneous speech in a situation of interaction with a machine, this interface was designed as a Wizard-of-Oz (WOZ) plateform. In this setup, users interact with a fake automated dialog system controled by a human wizard. It was implemented as a client-server application and the subjects interact with a talking head. The human wizard chooses pre-defined questions or sentences in a graphical user interface, which are then synthesized and spoken aloud by the avatar on the client side. A small spontaneous speech corpus was collected in a daily center. Eight speakers between 75 and 90 years old were recorded. They appreciated the interface and felt at ease with the avatar. Manual orthographic transcriptions were created for the total of about 45 minutes of speech.

2014Capítulo de livro

Acesso aberto

Ver mais

Translation errors from English to Portuguese

Publication . Costa, Ângela; Luís, Tiago; Coheur, Luísa; Centro de Linguística da UNL (CLUNL)

Analysing the translation errors is a task that can help us finding and describing translation problems in greater detail, but can also suggest where the automatic engines should be improved. Having these aims in mind we have created a corpus composed of 150 sentences, 50 from the TAP magazine, 50 from a TED talk and the other 50 from the from the TREC collection of factoid questions. We have automatically translated these sentences from English into Portuguese using Google Translate and Moses. After we have analysed the errors and created the error annotation taxonomy, the corpus was annotated by a linguist native speaker of Portuguese. Although Google’s overall performance was better in the translation task (we have also calculated the BLUE and NIST scores), there are some error types that Moses was better at coping with, specially discourse level errors.

2014Documento de conferência

Acesso aberto

Ver mais

Entidade financiadora

Fundação para a Ciência e a Tecnologia

Programa de financiamento

3599-PPCDT

Número da atribuição

PEst-OE/EEI/LA0021/2013

Sem título

Financiador

Autores

Publicações

Unidades organizacionais

Descrição

Palavras-chave

Contribuidores

Financiadores

Entidade financiadora

Programa de financiamento

Número da atribuição

ID