Ashoka University
Abstract
Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets.Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets. In this paper, we present a Datalake infrastructure tailored to the needs of interdisciplinary social science research. Our system supports ingestion and integration of diverse data types, automatic provenance and version tracking, role-based access control, and built-in tools for visualization and analysis. We demonstrate the utility of our Datalake using real-world use cases spanning governance, health, and education. A detailed walkthrough of one such use case -- analyzing the relationship between income, education, and infant mortality -- shows how our platform streamlines the research process while maintaining transparency and reproducibility. We argue that such infrastructure can democratize access to advanced data science practices, especially for NGOs, students, and grassroots organizations. The Datalake continues to evolve with plans to support ML pipelines, mobile access, and citizen data feedback mechanisms.
AI Summary - The Datalake is a cloud-based platform that enables users to perform simple and complex analytical tasks on multiple datasets. [3]
- It provides an end-to-end solution for data management, including provenance and version tracking. [3]
- The ease of use of the Datalake is key in democratizing access to data and good data science practices. [3]
- Datalake: A cloud-based platform that enables users to perform simple and complex analytical tasks on multiple datasets. [3]
- Provenance: The origin or history of a dataset or its components. [3]
- Version tracking: The process of keeping track of changes made to a dataset over time. [3]
- Funding support by Mphasis AI Lab at Ashoka University. [3]
- Access to data and analysis tools are the most important factors in lowering barriers for NGOs, grassroots organizations, and students who may not be well-versed in using computer science tools for data processing. [2]
- Limited user engagement with social scientists. [1]
Abertay
Abstract
This paper reflects on the literature that rejects the use of Large Language Models (LLMs) in qualitative data analysis. It illustrates through empirical evidence as well as critical reflections why the current critical debate is focusing on the wrong problems. The paper proposes that the focus of researching the use of the LLMs for qualitative analysis is not the method per se, but rather the empirical investigation of an artificial system performing an analysis. The paper builds on the seminal work of Alan Turing and reads the current debate using key ideas from Turing "Computing Machinery and Intelligence". This paper therefore reframes the debate on qualitative analysis with LLMs and states that rather than asking whether machines can perform qualitative analysis in principle, we should ask whether with LLMs we can produce analyses that are sufficiently comparable to human analysts. In the final part the contrary views to performing qualitative analysis with LLMs are analysed using the same writing and rhetorical style that Turing used in his seminal work, to discuss the contrary views to the main question.