Initial Commit

2025-02-20 21:09:33 -05:00 · 2025-02-20 21:09:33 -05:00 · 27de462bc3
commit 27de462bc3
7 changed files with 392 additions and 0 deletions
--- a/.env.sample
+++ b/.env.sample
@ -0,0 +1,3 @@
 CHATGPT_TOKEN=https://platform.openai.com/api-keys
 PAPERLESS_TOKEN=Login->Click User (Top Right)->Profile->API Auth Token
 PAPERLESS_BASE_URL=https://paperless.domain.com/api
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,6 @@
 .env
 content/
 cleaned-content/
 .venv/
 error.log
 .idea/
--- a/README.md
+++ b/README.md
@ -0,0 +1,93 @@
 # Paperless-ngx ChatGPT Python Script
 This Python script is designed to help you manage and organize your documents using the [Paperless-ngx](https://github.com/the-paperless-project/paperless-ngx) document management system, with the assistance of the OpenAI ChatGPT model. It can be used to automatically rename documents based on their content and to create cleaned copies of documents in a specific directory. This was a helpful script but I no longer use it; I wanted to add local AI support via Ollama. Maybe one day! I made it easy for anyone to update it.
 ## Prerequisites
 Before using this script, make sure you have the following:
 - OpenAI API Token (for ChatGPT)
 - Paperless API Token (for Paperless-ngx)
 - Python 3.x
 ## Setup
 1. Clone or download this repository.
 2. Install the required Python libraries using pip:
   ```
   pip install openai requests
   ```
 3. Set up your environment variables:
   - `CHATGPT_TOKEN`: Your OpenAI API Token.
   - `PAPERLESS_TOKEN`: Your Paperless API Token.
   - `PAPERLESS_BASE_URL`: The base URL of your Paperless-ngx instance. For example, `https://paperless.domain.com/api`.
   You can set these environment variables in your system or create a `.env` file in the root directory of the project with the following content:
   ```
   CHATGPT_TOKEN=your_chatgpt_api_token
   PAPERLESS_TOKEN=your_paperless_api_token
   PAPERLESS_BASE_URL=https://paperless.domain.com/api
   ```
 4. Modify the `search_params` variable in the script to specify the patterns for filtering documents in Paperless-ngx. By default, it is set to `["*"]`, which matches all documents. You can customize this to match specific document titles.
 ## Usage
 ### Main Script (main.py)
 The `main.py` script performs the following tasks:
 - Retrieves all documents from Paperless-ngx.
 - Filters documents based on the specified search parameters.
 - Uses ChatGPT to suggest a new title for each document based on its content.
 - Renames the documents with the suggested title (with retries in case of failure).
 - Logs failed renaming attempts in the `error.log` file.
 To run the main script, execute the following command:
 ```
 python main.py
 ```
 ### Test Script (test-chatgpt.py)
 The `test-chatgpt.py` script is designed to generate new names for documents based on their content using ChatGPT. It reads text files from the `content/` directory, suggests new names, and copies the renamed files to the `cleaned-content/` directory.
 To use this script, follow these steps:
 1. Place the text files you want to rename in the `content/` directory.
 2. Run the test script using the following command:
   ```
   python test-chatgpt.py
   ```
   This will generate new names for the files and save them in the `cleaned-content/` directory.
 ### Paperless Document Retrieval (test-paperless.py)
 The `test-paperless.py` script is used to retrieve documents from Paperless-ngx based on search parameters and save their content as text files in the `content/` directory.
 To use this script, run it using the following command:
 ```
 python test-paperless.py
 ```
 The script will retrieve and save documents from Paperless-ngx to the `content/` directory based on the specified search parameters.
 ## Important Notes
 - The main script (`main.py`) and the test script (`test-chatgpt.py`) use the OpenAI ChatGPT API to suggest new names for documents. Make sure you have an active API subscription and the necessary API key.
 - The scripts assume that you have set up the Paperless-ngx document management system and provided the correct API token.
 - Customization: You can customize the search parameters, retry count, and other settings in the scripts to suit your specific requirements.
 Feel free to use and modify these scripts to automate your document management workflow with Paperless-ngx and ChatGPT.
--- a/main.py
+++ b/main.py
@ -0,0 +1,134 @@
 import os
 import requests
 import re
 import fnmatch
 from openai import OpenAI
 # Load environment variables
 CHATGPT_TOKEN = os.getenv("CHATGPT_TOKEN")
 PAPERLESS_TOKEN = os.getenv("PAPERLESS_TOKEN")
 PAPERLESS_BASE_URL = os.getenv("PAPERLESS_BASE_URL")
 client = OpenAI(api_key=CHATGPT_TOKEN)
 # Ensure that tokens are available
 if not CHATGPT_TOKEN or not PAPERLESS_TOKEN:
    raise ValueError("Environment variables for tokens are not set")
 # Set search parameters to get all documents
 search_params = ["*"]
 # Maximum number of retries for renaming a failed document
 MAX_RETRIES = 3
 def get_all_documents():
    headers = {"Authorization": f"Token {PAPERLESS_TOKEN}"}
    url = f"{PAPERLESS_BASE_URL}/documents/"
    documents = []
    while url:
        response = requests.get(url, headers=headers)
        data = response.json()
        for doc in data.get("results", []):
            documents.append(
                {
                    "id": doc["id"],
                    "title": doc["title"],
                    "content": doc.get("content", ""),
                }
            )
        url = data.get("next")
    return documents
 def filter_documents(documents, search_params):
    filtered_docs = []
    for doc in documents:
        for pattern in search_params:
            if fnmatch.fnmatch(doc["title"], pattern):
                filtered_docs.append(doc)
                break
    return filtered_docs
 def sanitize_filename(name):
    # Remove invalid characters and replace spaces with underscores
    return re.sub(r'[\\/*?:"<>|]', "", name)
 def generate_pdf_name(ocr_content):
    formatted_content = ocr_content.replace("\n", " ")
    try:
        # Prompt for generating a PDF title based on OCR content
        prompt = f"""
        Please suggest a descriptive and unique document title. 
        Spaces are preferred over underscores. 
        First, correct any spelling mistakes in the following content, then suggest the title: {formatted_content}
        """
        response = client.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=[
                {
                    "role": "system",
                    "content": prompt,
                },
                {"role": "user", "content": ""},
            ],
            temperature=1,
            max_tokens=256,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
        )
        if response and response.choices:
            suggested_name = response.choices[0].message.content.strip()
            if "unable to suggest" not in suggested_name:
                return sanitize_filename(suggested_name)
            else:
                return "Unable_To_Suggest_Title"
        else:
            return "No_Response"
    except Exception as e:
        print(f"An error occurred: {e}")
        return "Error_Generated_Title"
 def rename_pdf(document_id, new_name):
    headers = {"Authorization": f"Token {PAPERLESS_TOKEN}"}
    data = {"title": new_name}
    response = requests.patch(
        f"{PAPERLESS_BASE_URL}/documents/{document_id}/", headers=headers, data=data
    )
    return response.ok
 def main():
    documents = get_all_documents()
    filtered_documents = filter_documents(documents, search_params)
    # Create a dictionary to store the number of retries for each document
    retry_counts = {doc["id"]: 0 for doc in filtered_documents}
    for doc in filtered_documents:
        # Retry renaming a failed document up to MAX_RETRIES times
        while retry_counts[doc["id"]] < MAX_RETRIES:
            ocr_content = doc["content"]
            new_name = generate_pdf_name(ocr_content)
            if new_name and rename_pdf(doc["id"], new_name):
                print(f"Renamed document {doc['id']} to {new_name}")
                break  # Rename successful, move to the next document
            else:
                retry_counts[doc["id"]] += 1
                print(f"Retry {retry_counts[doc['id']]} for document {doc['id']}")
        # Log failed documents
        if retry_counts[doc["id"]] == MAX_RETRIES:
            with open("error.log", "a") as error_log:
                error_log.write(
                    f"Failed to rename document {doc['id']} after {MAX_RETRIES} retries\n"
                )
 if __name__ == "__main__":
    main()
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,2 @@
 requests
 openai
--- a/test-chatgpt.py
+++ b/test-chatgpt.py
@ -0,0 +1,89 @@
 import os
 import requests
 import shutil
 import openai
 from openai import OpenAI
 import re
 client = OpenAI(api_key=os.getenv("CHATGPT_TOKEN"))
 def sanitize_filename(name):
    # Remove invalid characters from the file name
    return re.sub(r'[\\/*?:"<>|]', "", name)
 def generate_pdf_name(file_content):
    formatted_content = file_content.replace("\n", " ")
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {
                    "role": "system",
                    "content": "Suggest a new name for a document with the following content: "
                    + formatted_content[:500],
                },
                {"role": "user", "content": ""},
                {
                    "role": "assistant",
                    "content": '"Energetic Greetings: An Expressive Salutation"',
                },
            ],
            temperature=1,
            max_tokens=256,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
        )
        print("Response Object:", response)
        if response and response.choices:
            suggested_name = response.choices[0].message.content.strip()
            if "unable to suggest" not in suggested_name:
                # Sanitize the suggested file name
                sanitized_name = sanitize_filename(suggested_name)
                return sanitized_name + ".txt"
            else:
                return "Unable_To_Suggest_Name.txt"
        else:
            return "No_Response.txt"
    except Exception as e:
        print("The server could not be reached")
        print(e.__cause__)  # an underlying Exception, likely raised within httpx.
    except openai.RateLimitError as e:
        print("A 429 status code was received; we should back off a bit.")
    except openai.APIStatusError as e:
        print("Another non-200-range status code was received")
        print(e.status_code)
        print(e.response)
 def main():
    content_dir = "content/"
    cleaned_content_dir = "cleaned-content/"
    # Create the cleaned content directory if it doesn't exist
    if not os.path.exists(cleaned_content_dir):
        os.makedirs(cleaned_content_dir)
    # Process each file in the content directory
    for filename in os.listdir(content_dir):
        if filename.endswith(".txt"):
            file_path = os.path.join(content_dir, filename)
            # Read the content of the file
            with open(file_path, "r") as file:
                file_content = file.read()
            # Generate a new name for the document
            new_name = generate_pdf_name(file_content)
            # Copy the file to the cleaned-content directory
            new_file_path = os.path.join(cleaned_content_dir, new_name)
            shutil.copy(file_path, new_file_path)
            print(f"Copied and renamed '{filename}' to '{new_name}'")
 if __name__ == "__main__":
    main()
--- a/test-paperless.py
+++ b/test-paperless.py
@ -0,0 +1,65 @@
 import os
 import requests
 import fnmatch
 # Load environment variables
 PAPERLESS_TOKEN = os.getenv("PAPERLESS_TOKEN")
 PAPERLESS_BASE_URL = os.getenv("PAPERLESS_BASE_URL")
 # Ensure that the token is available
 if not PAPERLESS_TOKEN:
    raise ValueError("Paperless token is not set")
 if not PAPERLESS_BASE_URL:
    raise ValueError("Paperless base URL is not set")
 # Set search parameters
 search_params = ["Scan*", "PDF*"]
 # Function to get all documents from Paperless with pagination
 def get_all_documents():
    headers = {"Authorization": f"Token {PAPERLESS_TOKEN}"}
    url = f"{PAPERLESS_BASE_URL}/documents/"
    documents = []
    while url:
        response = requests.get(url, headers=headers)
        data = response.json()
        documents.extend(data.get("results", []))
        url = data.get("next")
    return documents
 # Function to filter documents based on search parameters
 def filter_documents(documents, search_params):
    filtered_docs = []
    for doc in documents:
        for pattern in search_params:
            if fnmatch.fnmatch(doc["title"], pattern):
                filtered_docs.append(doc)
                break
    return filtered_docs
 def save_document_content(file_name, content):
    os.makedirs("content", exist_ok=True)  # Ensure the content directory exists
    with open(f"content/{file_name}.txt", "w", encoding="utf-8") as file:
        file.write(content)
 def main():
    all_documents = get_all_documents()
    filtered_documents = filter_documents(all_documents, search_params)
    for doc in filtered_documents:
        # Use the 'content' field directly
        doc_content = doc.get("content", "")
        save_document_content(doc["title"], doc_content)
        print(
            f"Saved content for Document ID: {doc['id']} to content/{doc['title']}.txt"
        )
 if __name__ == "__main__":
    main()