1. We introduce TopicGPT, a prompt-based framework that uses large language models (LLMs) to uncover latent topics in a text collection. TopicGPT generates interpretable topics, dispensing with ambiguous bags of words in favor of topics with natural language labels and associated free-form descriptions.
2. TopicGPT works in three main stages.
3. TopicGPT produces topics that align better with human categorizations compared to competing methods: it achieves a harmonic mean purity of 0.74 against human-annotated Wikipedia topics compared to 0.64 for the strongest baseline.
You can download the raw datasets used in the paper (Bills and Wiki) from the following link: Dataset Link.
Otherwise, prepare your .jsonl
input data file with the following format:
{
"id": "ID (optional)",
"text": "Document",
"label": "Ground-truth label (optional)"
}
Check out demo.ipynb for a complete pipeline and more detailed instructions. We advise trying a subset with more affordable (or open-source) models before scaling to the full dataset.
Metric calculation functions are available in
topicgpt_python.metrics
to evaluate topic alignment
with ground-truth labels (Adjusted Rand Index, Harmonic Purity,
Normalized Mutual Information).
Our package supports OpenAI API, Google Cloud Vertex AI API, Gemini API, Azure API, and vLLM inference. vLLM requires GPUs to run. Please refer to OpenAI API pricing or to Vertex API pricing for cost details.
1. Make a new Python 3.9+ environment using virtualenv or conda.
2. Install the required packages:
pip install --upgrade topicgpt_python
3. Set environment variables:
# Needed only for the OpenAI API deployment
export OPENAI_API_KEY={your_openai_api_key}
# Needed only for the Vertex AI deployment
export VERTEX_PROJECT={your_vertex_project} # e.g. my-project
export VERTEX_LOCATION={your_vertex_location} # e.g. us-central1
# Needed only for Gemini deployment
export GEMINI_API_KEY={your_gemini_api_key}
# Needed only for the Azure API deployment
export AZURE_OPENAI_API_KEY={your_azure_api_key}
export AZURE_OPENAI_ENDPOINT={your_azure_endpoint}
4. (Optional) Define I/O paths in config.yml
.
5. (Optional) Run the following code snippet to load the configuration file:
from topicgpt_python import *
import yaml
with open("config.yml", "r") as f:
config = yaml.safe_load(f)
Generate high-level topics.
generate_topic_lvl1(
api, model, data, prompt_file, seed_file, out_file, topic_file, verbose
)
Generate subtopics for each top-level topic.
generate_topic_lvl2(
api, model, seed_file, data, prompt_file, out_file, topic_file, verbose
)
Refine topics by merging and updating based on API response.
refine_topics(
api, model, prompt_file, generation_file, topic_file, out_file, updated_file, verbose, remove, mapping_file
)
correct_topics
module until there are no more hallucinations.Assign topics to a list of documents.
assign_topics(
api, model, data, prompt_file, out_file, topic_file, verbose
)
Correct hallucinated topic assignments or errors.
correct_topics(
api, model, data_path, prompt_path, topic_path, output_path, verbose
)
@misc{pham2024topicgptpromptbasedtopicmodeling, title={TopicGPT: A Prompt-based Topic Modeling Framework}, author={Chau Minh Pham and Alexander Hoyle and Simeng Sun and Philip Resnik and Mohit Iyyer}, year={2024}, eprint={2311.01449}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2311.01449}, }