TopicGPT: A Prompt-based Topic Modeling Framework

Chau Minh Pham🔍, Alexander Hoyle🔦, Simeng Sun🔍,
Philip Resnik🔦, Mohit Iyyer🔍
🔍University of Massachusetts Amherst
🔦University of Maryland College Park
[Paper] [Code] [ ]


TLDR;


1. We introduce TopicGPT, a prompt-based framework that uses large language models (LLMs) to uncover latent topics in a text collection. TopicGPT generates interpretable topics, dispensing with ambiguous bags of words in favor of topics with natural language labels and associated free-form descriptions.

2. TopicGPT works in three main stages.

3. TopicGPT produces topics that align better with human categorizations compared to competing methods: it achieves a harmonic mean purity of 0.74 against human-annotated Wikipedia topics compared to 0.64 for the strongest baseline.


Data Preparation


You can download the raw datasets used in the paper (Bills and Wiki) from the following link: Dataset Link.

Otherwise, prepare your .jsonl input data file with the following format:


  {
      "id": "ID (optional)",
      "text": "Document",
      "label": "Ground-truth label (optional)"
  }
      


Setting up


Check out demo.ipynb for a complete pipeline and more detailed instructions. We advise trying a subset with more affordable (or open-source) models before scaling to the full dataset.

Metric calculation functions are available in topicgpt_python.metrics to evaluate topic alignment with ground-truth labels (Adjusted Rand Index, Harmonic Purity, Normalized Mutual Information).

Our package supports OpenAI API, Google Cloud Vertex AI API, Gemini API, Azure API, and vLLM inference. vLLM requires GPUs to run. Please refer to OpenAI API pricing or to Vertex API pricing for cost details.

1. Make a new Python 3.9+ environment using virtualenv or conda.

2. Install the required packages: pip install --upgrade topicgpt_python

3. Set environment variables:


  # Needed only for the OpenAI API deployment
  export OPENAI_API_KEY={your_openai_api_key}
  
  # Needed only for the Vertex AI deployment
  export VERTEX_PROJECT={your_vertex_project}   # e.g. my-project
  export VERTEX_LOCATION={your_vertex_location} # e.g. us-central1
  
  # Needed only for Gemini deployment
  export GEMINI_API_KEY={your_gemini_api_key}
  
  # Needed only for the Azure API deployment
  export AZURE_OPENAI_API_KEY={your_azure_api_key}
  export AZURE_OPENAI_ENDPOINT={your_azure_endpoint}
      

4. (Optional) Define I/O paths in config.yml.

5. (Optional) Run the following code snippet to load the configuration file:


  from topicgpt_python import *
  import yaml
  
  with open("config.yml", "r") as f:
      config = yaml.safe_load(f)
      

Generating Topics


Function: generate_topic_lvl1

Generate high-level topics.


  generate_topic_lvl1(
    api, model, data, prompt_file, seed_file, out_file, topic_file, verbose
  )                
      

Function: generate_topic_lvl2

Generate subtopics for each top-level topic.


  generate_topic_lvl2(
    api, model, seed_file, data, prompt_file, out_file, topic_file, verbose
  )
      

Refining Topics


If topics are generated by a weaker model, there sometimes exist irrelevant or redundant topics. This module:
Function: refine_topics

Refine topics by merging and updating based on API response.


  refine_topics(
    api, model, prompt_file, generation_file, topic_file, out_file, updated_file, verbose, remove, mapping_file
  )
      

Assigning Topics


Function: assign_topics

Assign topics to a list of documents.


  assign_topics(
    api, model, data, prompt_file, out_file, topic_file, verbose
  )
      

Function: correct_topics

Correct hallucinated topic assignments or errors.


  correct_topics(
    api, model, data_path, prompt_path, topic_path, output_path, verbose
  ) 
      

Citation


      @misc{pham2024topicgptpromptbasedtopicmodeling,
        title={TopicGPT: A Prompt-based Topic Modeling Framework}, 
        author={Chau Minh Pham and Alexander Hoyle and Simeng Sun and Philip Resnik and Mohit Iyyer},
        year={2024},
        eprint={2311.01449},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2311.01449}, 
      }