paint-brush
Large Language Models Being Used In Thematic Analysis: How It Worksby@textmodels
113 reads

Large Language Models Being Used In Thematic Analysis: How It Works

tldt arrow

Too Long; Didn't Read

There have been several studies exploring the use of LLMs in thematic analysis. De Paoli evaluated to what extent GPT-3.5 can carry out a full-blown thematic analysis of semi structured interviews, finding that the LLM was indeed able to perform some of the steps while also cautioning about the methodological implications of using the approach [11].
featured image - Large Language Models Being Used In Thematic Analysis: How It Works
Writings, Papers and Blogs on Text Models HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Jakub DRÁPAL, Institute of State and Law of the Czech Academy of Sciences, Czechia, Institute of Criminal Law and Criminology, Leiden University, the Netherlands;

(2) Hannes WESTERMANN, Cyberjustice Laboratory, Université de Montréal, Canada;

(3) Jaromir SAVELKA, School of Computer Science, Carnegie Mellon University, USA.

Abstract & Introduction

Related Work

Dataset

Proposed Framework

Experimental Design

Results and Discussion

Conclusions, Future Work and References

There have been several studies exploring the use of LLMs in thematic analysis. De Paoli evaluated to what extent GPT-3.5 can carry out a full-blown thematic analysis of semi structured interviews, finding that the LLM was indeed able to perform some of the steps while also cautioning about the methodological implications of using the approach [11].


Gao et al. developed a collaborative coding platform powered by GPT-3.5 that provides code and code group suggestions to support the process of defining a codebook [12]. Gamieldien et al. used GPT-3.5 to generate codes for automatically clustered comments, finding that the produced codes were granular but not coherent, as similar clusters were assigned very different names [13].


There is a long tradition of studies identifying patterns in criminal justice data, including those focused on offense categories. The studies typically employed content analysis together with approaches such as factor, latent profile or cluster analyses. Santtila et al. identified 14 types of burglaries from the descriptions of crime scene behavior [14]. Higgs et al. collated descriptions of 700 sexual murderers to describe the overall patterns and motives underlying the offense [15].


Canter et al. performed a thematic classification of stranger rapes [16]. Gˇrivna and Drápal focused on criminal offenses involving computer data and systems (cybercrime) in the Czech Republic, identifying the most frequent types of such criminal behavior [17].


There is a similar tradition focused on discovering stereotypical patterns in court opinions in AI & Law. Ashley identified factors from trade secret law through reading cases and doctrine [18]. Similar analysis was performed by Gray et al. to discover typical factors of suspicion considered in auto stop cases [19].


Westermann et al. used the grounded theory approach (a close kin of thematic analysis) to discover relevant factors considered by judges in certain types of landlord-tenant disputes [20].


Notably, Salaun et al. used a topic modelling approach in the same domain to identify the factors automatically, finding that 33% of the discovered topics were relevant [4]. To our best knowledge, that study is the state-of-the-art attempt on inductive coding of legal texts. Our work differs by subscribing to a well-established framework (i.e., thematic analysis) and the use of GPT-4 that enables a subject matter expert to drive and influence the analysis through specified research questions and instructions.


There were multiple proposals of frameworks focused on supporting legal experts in deductive coding of legal texts. Branting et al. proposed manual annotation of factors in a small number of cases, which were then projected across a much larger dataset [2]. Westermann et al. described an approach where legal experts formulated sets of complex search terms (as classifiers) based on constantly updated dataset statistics [1].


Westermann et al. also proposed a framework utilizing sentence embeddings and similarity retrieval to support annotators in annotating legal documents [3]. Recently, Savelka et al. explored performing annotations with LLMs (GPT-3.5 and GPT-4) in zero-shot settings by providing the model with excerpts from annotation guidelines [21,22,23].


In this paper, we perform deductive coding when predicting the themes for the case facts descriptions (RQ3) as one of the steps in predominantly inductive coding-based analysis.


This paper is available on arxiv under CC 4.0 license.