Syntax Archive

The Syntax Archive is a research archive specialised in developing and maintaining Finnish language corpora. It houses digitised text, video and sound corpora available to researchers and students. The majority of the data is organised into digitised annotated corpora.

We have developed five digital corpora which have been morphologically and syntactically annotated. These corpora can be accessed through the Language Bank of Finland. The collected data extends from the earliest written Finnish to modern everyday conversations, and covers Finnish language produced by native speakers as well as non-native learners of Finnish. In addition to these corpora, the Syntax Archive houses audio, video and text data available to researchers and students. The Syntax Archive participates in the Digilang project which aims at developing the digital language resources at the School of Languages and Translation Studies.

Grammatically annotated corpora

Dialect corpus

Arkisyn corpus

The Arkisyn corpus contains Finnish everyday conversations which have been morphologically and syntactically annotated. The data comes from the Conversation Analysis Archive at the University of Helsinki and the Finnish language Recording Archive at the University of Turku.

Contains currently 46,808 clauses, 6,246 NPs, 18,583 particles or particle chains, 4,969 grammatical fragments, and 279,023 words.

The Agricola corpus

Corpus of advanced learners of Finnish (LAS2)

The Corpus of advanced learners of Finnish contains written data produced by advanced learners of Finnish in various academic settings. Additionally, the corpus contains reference material produced by native speakers of Finnish. All the material has been morphologically and syntactically annotated.

The LAS2 contains 41,628 clauses and 271,331 words
The reference corpus of native speakers contains 176,526 words.

Corpus of Academic Finnish (LAS1)

The Sapu Corpus

The Sapu* corpus contains the colloquial Finnish of the 21st century spoken in the province of Satakunta. Recorded interviews and conversations were obtained during 2007—2013 and 2016—2019 in the Sapu project. These years, over 300 recordings (totaling 262 hours) were made for the project, and over 200 hours of this data were transcribed for dialectological and sociolinguistic analyses. A set of 35 recordings were selected for the Sapu Corpus, and this dataset was lemmatized and morphologically and syntactically annotated.

These 35 data units represent six dialects (three representing Southwestern Dialects and three Transitional Dialects) and five age cohorts.
1912 minutes

* Sapu is the abbreviation for Satakuntalaisuus puheessa (’Satakunta in the Speech’), the Finnish name of the project. The official name for the project in English is "Linguistic Variation in the Province of Satakunta in the 21st Century"

Non-annotated corpora

The non-annotated corpora include recordings and transcriptions from the Satakunta region (Sapu) (255 transcriptions of over 200 hours) and the data from the Prosovar project studying the regional variation of Finnish prosody.

Sound and video recordings

The Syntax Archive houses the sound and video recordings of the Finnish language recording archive at the University of Turku. All the data is available in digital format. In total, the data comprises more than 5,600 hours of recordings.

Contact information

Visiting address
Hämeenkatu 1, Turku

Postal address
Syntax Archive
Department of Finnish and Finno-Ugric Studies
FI-20014 University of Turku, Finland

Keywords

Finnish Language and Finno-Ugric Languages