Towards transparency and knowledge exchange in AI-assisted data analysis code generation

Generative artificial intelligence (AI) and large language models (LLMs) in particular are changing the way we do data science. Most prominently, scientists use the technology for interacting with scientific data1, answering data analysis questions2,3, generating data analysis code4,5,6, and (re-)writing scientific manuscripts7. Unfortunately, the prompts sent to LLMs are commonly not conserved, and thus, at the time of publication, it might be hard to differentiate human-made and AI-generated parts of the scientific work. A professional peer-review system, for documenting how LLM-generated code was prompted for, and which human reviewed it, is not established in contemporary scientific culture. However, such systems do exist for collaborative code editing involving multiple humans. For example, the source code repositories GitHub and GitLab are well-established in the open-source software community for discussing issues and potential solutions, building code together, and for peer-reviewing content. As it was shown before that LLMs can solve real-world GitHub issues8, developing an AI-assistant that interacts with humans directly within the GitHub platform is the obvious next step.

Here, I present git-bob, a GitHub/GitLab-integration of an LLM-based AI-assistant that can respond to GitHub issues, discuss potential solutions with humans iteratively, write code for them, and submit it as a pull-request to be reviewed by humans. It is technically similar to various online services for data analysis such as the OpenAI ChatGPT Data Analyst or GitHub Copilot workflows, with three major differences. First, multiple humans can interact with git-bob in one communication thread. This allows bringing together domain specialists, such as life scientists, data-analysts and the AI-assistant in one discussion, stimulating knowledge exchange on how to interact properly with the AI-assistant. Second, discussions with git-bob and resulting code modifications are conserved in an online platform that others can read and follow, making the interaction with the AI-assistant fully transparent. Third, git-bob is completely open-source and extensible. Other developers can read its built-in system prompts and modify them to their needs. Developers can implement custom connectors to other LLM service providers and write plugins for their custom AI agents, which may deal with GitHub issues differently.

Towards transparency and knowledge exchange in AI-assisted data analysis code generation

Tags: