Dissertation defence (Computer Science): FM Jenna Kanerva
Time
15.3.2024 at 11.00 - 15.00
FM Jenna Kanerva defends the dissertation in Computer Science titled ”Understanding the Structure and Meaning of Finnish Texts: From Corpus Creation to Deep Language Modelling” at the University of Turku on 15 March 2024 at 11.00 (University of Turku, Educarium, Edu2, Assistentinkatu 5, Turku).
The audience can participate in the defence by remote access: https://echo360.org.uk/section/bcc6b487-8fee-445c-a37f-eba6fa28d8a0/public
Opponent: Associate Professor Kairit Sirts (University of Tartu, Estonia)
Custos: Professor Tapio Salakoski (University of Turku)
Doctoral Dissertation at UTUPub: https://urn.fi/URN:ISBN:978-951-29-9623-0
***
Summary of the Doctoral Dissertation:
Natural Language Processing (NLP) is a field that aims to develop methods for analysing, understanding or generating human language. The primary aim of this thesis is to advance NLP in Finnish by providing more resources and investigating machine learning based practices for their use. While NLP includes various topics involving textual or speech data, this thesis specifically focuses on understanding the structure and meaning of written language. The research concentrates on structural and grammatical analysis (syntactic parsing) as well as exploring statements that convey the same meaning but use different words (paraphrase modelling).
The first set of contributions of this thesis centers on the development of a state-of-the-art Finnish parser, a tool for analysing Finnish text by its grammatical structure. The overall outcome of this line of research is a machine-learned tool that approaches or nearly matches human performance on analysing standard written Finnish. Major advances were obtained by using pre-trained, neural language models.
The success of large language models in syntactic parsing, as well as in many other tasks, raises the question of whether these models genuinely comprehend language. However, datasets designed to measure semantic comprehension in Finnish have been non-existent, or very scarce. To address this limitation, the second part of the thesis shifts its focus to language understanding through paraphrase modelling. The second contribution of the thesis is the creation of a novel, large-scale, manually annotated corpus of Finnish paraphrases, which can be used e.g. to measure the ability of language models to handle variation in expressing similar ideas.
The audience can participate in the defence by remote access: https://echo360.org.uk/section/bcc6b487-8fee-445c-a37f-eba6fa28d8a0/public
Opponent: Associate Professor Kairit Sirts (University of Tartu, Estonia)
Custos: Professor Tapio Salakoski (University of Turku)
Doctoral Dissertation at UTUPub: https://urn.fi/URN:ISBN:978-951-29-9623-0
***
Summary of the Doctoral Dissertation:
Natural Language Processing (NLP) is a field that aims to develop methods for analysing, understanding or generating human language. The primary aim of this thesis is to advance NLP in Finnish by providing more resources and investigating machine learning based practices for their use. While NLP includes various topics involving textual or speech data, this thesis specifically focuses on understanding the structure and meaning of written language. The research concentrates on structural and grammatical analysis (syntactic parsing) as well as exploring statements that convey the same meaning but use different words (paraphrase modelling).
The first set of contributions of this thesis centers on the development of a state-of-the-art Finnish parser, a tool for analysing Finnish text by its grammatical structure. The overall outcome of this line of research is a machine-learned tool that approaches or nearly matches human performance on analysing standard written Finnish. Major advances were obtained by using pre-trained, neural language models.
The success of large language models in syntactic parsing, as well as in many other tasks, raises the question of whether these models genuinely comprehend language. However, datasets designed to measure semantic comprehension in Finnish have been non-existent, or very scarce. To address this limitation, the second part of the thesis shifts its focus to language understanding through paraphrase modelling. The second contribution of the thesis is the creation of a novel, large-scale, manually annotated corpus of Finnish paraphrases, which can be used e.g. to measure the ability of language models to handle variation in expressing similar ideas.
University Communications