TopicGPT: A Prompt-based Topic Modeling Framework

Chau Minh Pham🔍, Alexander Hoyle🔦, Simeng Sun🔍,
Philip Resnik🔦, Mohit Iyyer🔍
🔍University of Massachusetts Amherst
🔦University of Maryland College Park
[Paper] [Code] []

TLDR;

1. We introduce TopicGPT, a prompt-based framework that uses large language models (LLMs) to uncover latent topics in a text collection. TopicGPT generates interpretable topics, dispensing with ambiguous bags of words in favor of topics with natural language labels and associated free-form descriptions.

2. TopicGPT works in three main stages.

Generation: It generates high-level topics using a prompt-based approach.
Refinement: It refines the topics by merging similar ones and removing outliers.
Assignment: It assigns topics to documents with supporting quotes.

3. TopicGPT produces topics that align better with human categorizations compared to competing methods: it achieves a harmonic mean purity of 0.74 against human-annotated Wikipedia topics compared to 0.64 for the strongest baseline.

Data Preparation

You can download the raw datasets used in the paper (Bills and Wiki) from the following link: Dataset Link.

Otherwise, prepare your .jsonl input data file with the following format:


  {
      "id": "ID (optional)",
      "text": "Document",
      "label": "Ground-truth label (optional)"
  }

Setting up

Check out demo.ipynb for a complete pipeline and more detailed instructions. We advise trying a subset with more affordable (or open-source) models before scaling to the full dataset.

Metric calculation functions are available in topicgpt_python.metrics to evaluate topic alignment with ground-truth labels (Adjusted Rand Index, Harmonic Purity, Normalized Mutual Information).

Our package supports OpenAI API, Google Cloud Vertex AI API, Gemini API, Azure API, and vLLM inference. vLLM requires GPUs to run. Please refer to OpenAI API pricing or to Vertex API pricing for cost details.

1. Make a new Python 3.9+ environment using virtualenv or conda.

2. Install the required packages: pip install --upgrade topicgpt_python

3. Set environment variables:


  # Needed only for the OpenAI API deployment
  export OPENAI_API_KEY={your_openai_api_key}
  
  # Needed only for the Vertex AI deployment
  export VERTEX_PROJECT={your_vertex_project}   # e.g. my-project
  export VERTEX_LOCATION={your_vertex_location} # e.g. us-central1
  
  # Needed only for Gemini deployment
  export GEMINI_API_KEY={your_gemini_api_key}
  
  # Needed only for the Azure API deployment
  export AZURE_OPENAI_API_KEY={your_azure_api_key}
  export AZURE_OPENAI_ENDPOINT={your_azure_endpoint}

4. (Optional) Define I/O paths in config.yml.

5. (Optional) Run the following code snippet to load the configuration file:


  from topicgpt_python import *
  import yaml
  
  with open("config.yml", "r") as f:
      config = yaml.safe_load(f)

Generating Topics

Define your seed topics, like in seed_1.md.
(Optional) Define few-shot examples, like in generation_1.txt.
Right now, early stopping is set to 100, meaning that if no new topic has been generated in the last 100 iterations, the generation process will stop.

Function: generate_topic_lvl1

Generate high-level topics.

api (str): API to use ('openai', 'vertex', 'vllm', 'gemini', 'azure')
model (str): Model to run topic generation with
data (str): Input data file
prompt_file (str): File to read prompt from
seed_file (str): Markdown file to read seed topics from
out_file (str): File to write results (original texts with the corresponding generated topics) to
topic_file (str): File to write generated topics to
verbose (bool): Enable verbose output


  generate_topic_lvl1(
    api, model, data, prompt_file, seed_file, out_file, topic_file, verbose
  )

Function: generate_topic_lvl2

Generate subtopics for each top-level topic.

api (str): API to use ('openai', 'vertex', 'vllm', 'azure', 'gemini')
model (str): Model to run topic generation with
seed_file (str): File to read seed topics from
data (str): Input data file
prompt_file (str): Prompt file
out_file (str): Output result file (original texts with corresponding generated topics)
topic_file (str): Output topics file
verbose (bool): Enable verbose output


  generate_topic_lvl2(
    api, model, seed_file, data, prompt_file, out_file, topic_file, verbose
  )

Refining Topics

If topics are generated by a weaker model, there sometimes exist irrelevant or redundant topics. This module:

Merges similar topics
Removes overly specific or redundant topics that occur < 1% of the time

Function: refine_topics

Refine topics by merging and updating based on API response.

api (str): API to use ('openai', 'vertex', 'vllm', 'azure', 'gemini')
model (str): Model to run topic refinement with
prompt_file (str): Path to the refinement prompt file
generation_file (str): Path to the generation JSON file (obtained from the topic generation stage/from the previous refinement iteration)
topic_file (str): Path to the topic file (obtained from the topic generation stage/from the previous refinement iteration)
out_file (str): Path to save the refined topic file
updated_file (str): Path to save the updated generation JSON file
verbose (bool): If True, prints out implemntation details
remove (bool): If True, removes low-frequency topics (< 1% occurence times)
mapping_file (str): Path to save the mapping as a JSON file


  refine_topics(
    api, model, prompt_file, generation_file, topic_file, out_file, updated_file, verbose, remove, mapping_file
  )

Assigning Topics

Each assignment is supported by a quote from the input text.
The model used here is often a weaker model to save cost, so the topics may not be grounded in the topic list. To correct this, apply the correct_topics module until there are no more hallucinations.

Function: assign_topics

Assign topics to a list of documents.

api (str): API to use ('openai', 'vertex', 'vllm', 'azure', 'gemini')
model (str): Model to use
data (str): Data file
prompt_file (str): Prompt file
out_file (str): Output file
topic_file (str): File containing topic list (obtained from the generation/refinement stage)
verbose (bool): Whether to print out results


  assign_topics(
    api, model, data, prompt_file, out_file, topic_file, verbose
  )

Function: correct_topics

Correct hallucinated topic assignments or errors.

api: API type (e.g., 'openai', 'vertex', 'vllm', 'azure', 'gemini')
model: Model name (e.g., 'gpt-4')
data_path: Input data file (should be the output file from the assignment stage)
prompt_path: File to read prompt from
topic_path: File containing topic list (obtained from the generation/refinement stage)
output_path: Output file
verbose: Print verbose output


  correct_topics(
    api, model, data_path, prompt_path, topic_path, output_path, verbose
  )

Citation

      @misc{pham2024topicgptpromptbasedtopicmodeling,
        title={TopicGPT: A Prompt-based Topic Modeling Framework}, 
        author={Chau Minh Pham and Alexander Hoyle and Simeng Sun and Philip Resnik and Mohit Iyyer},
        year={2024},
        eprint={2311.01449},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2311.01449}, 
      }