Initial Commit

This commit is contained in:
Ronnie 2025-02-20 21:09:33 -05:00
commit 27de462bc3
7 changed files with 392 additions and 0 deletions

3
.env.sample Normal file
View file

@ -0,0 +1,3 @@
CHATGPT_TOKEN=https://platform.openai.com/api-keys
PAPERLESS_TOKEN=Login->Click User (Top Right)->Profile->API Auth Token
PAPERLESS_BASE_URL=https://paperless.domain.com/api

6
.gitignore vendored Normal file
View file

@ -0,0 +1,6 @@
.env
content/
cleaned-content/
.venv/
error.log
.idea/

93
README.md Normal file
View file

@ -0,0 +1,93 @@
# Paperless-ngx ChatGPT Python Script
This Python script is designed to help you manage and organize your documents using the [Paperless-ngx](https://github.com/the-paperless-project/paperless-ngx) document management system, with the assistance of the OpenAI ChatGPT model. It can be used to automatically rename documents based on their content and to create cleaned copies of documents in a specific directory. This was a helpful script but I no longer use it; I wanted to add local AI support via Ollama. Maybe one day! I made it easy for anyone to update it.
## Prerequisites
Before using this script, make sure you have the following:
- OpenAI API Token (for ChatGPT)
- Paperless API Token (for Paperless-ngx)
- Python 3.x
## Setup
1. Clone or download this repository.
2. Install the required Python libraries using pip:
```
pip install openai requests
```
3. Set up your environment variables:
- `CHATGPT_TOKEN`: Your OpenAI API Token.
- `PAPERLESS_TOKEN`: Your Paperless API Token.
- `PAPERLESS_BASE_URL`: The base URL of your Paperless-ngx instance. For example, `https://paperless.domain.com/api`.
You can set these environment variables in your system or create a `.env` file in the root directory of the project with the following content:
```
CHATGPT_TOKEN=your_chatgpt_api_token
PAPERLESS_TOKEN=your_paperless_api_token
PAPERLESS_BASE_URL=https://paperless.domain.com/api
```
4. Modify the `search_params` variable in the script to specify the patterns for filtering documents in Paperless-ngx. By default, it is set to `["*"]`, which matches all documents. You can customize this to match specific document titles.
## Usage
### Main Script (main.py)
The `main.py` script performs the following tasks:
- Retrieves all documents from Paperless-ngx.
- Filters documents based on the specified search parameters.
- Uses ChatGPT to suggest a new title for each document based on its content.
- Renames the documents with the suggested title (with retries in case of failure).
- Logs failed renaming attempts in the `error.log` file.
To run the main script, execute the following command:
```
python main.py
```
### Test Script (test-chatgpt.py)
The `test-chatgpt.py` script is designed to generate new names for documents based on their content using ChatGPT. It reads text files from the `content/` directory, suggests new names, and copies the renamed files to the `cleaned-content/` directory.
To use this script, follow these steps:
1. Place the text files you want to rename in the `content/` directory.
2. Run the test script using the following command:
```
python test-chatgpt.py
```
This will generate new names for the files and save them in the `cleaned-content/` directory.
### Paperless Document Retrieval (test-paperless.py)
The `test-paperless.py` script is used to retrieve documents from Paperless-ngx based on search parameters and save their content as text files in the `content/` directory.
To use this script, run it using the following command:
```
python test-paperless.py
```
The script will retrieve and save documents from Paperless-ngx to the `content/` directory based on the specified search parameters.
## Important Notes
- The main script (`main.py`) and the test script (`test-chatgpt.py`) use the OpenAI ChatGPT API to suggest new names for documents. Make sure you have an active API subscription and the necessary API key.
- The scripts assume that you have set up the Paperless-ngx document management system and provided the correct API token.
- Customization: You can customize the search parameters, retry count, and other settings in the scripts to suit your specific requirements.
Feel free to use and modify these scripts to automate your document management workflow with Paperless-ngx and ChatGPT.

134
main.py Normal file
View file

@ -0,0 +1,134 @@
import os
import requests
import re
import fnmatch
from openai import OpenAI
# Load environment variables
CHATGPT_TOKEN = os.getenv("CHATGPT_TOKEN")
PAPERLESS_TOKEN = os.getenv("PAPERLESS_TOKEN")
PAPERLESS_BASE_URL = os.getenv("PAPERLESS_BASE_URL")
client = OpenAI(api_key=CHATGPT_TOKEN)
# Ensure that tokens are available
if not CHATGPT_TOKEN or not PAPERLESS_TOKEN:
raise ValueError("Environment variables for tokens are not set")
# Set search parameters to get all documents
search_params = ["*"]
# Maximum number of retries for renaming a failed document
MAX_RETRIES = 3
def get_all_documents():
headers = {"Authorization": f"Token {PAPERLESS_TOKEN}"}
url = f"{PAPERLESS_BASE_URL}/documents/"
documents = []
while url:
response = requests.get(url, headers=headers)
data = response.json()
for doc in data.get("results", []):
documents.append(
{
"id": doc["id"],
"title": doc["title"],
"content": doc.get("content", ""),
}
)
url = data.get("next")
return documents
def filter_documents(documents, search_params):
filtered_docs = []
for doc in documents:
for pattern in search_params:
if fnmatch.fnmatch(doc["title"], pattern):
filtered_docs.append(doc)
break
return filtered_docs
def sanitize_filename(name):
# Remove invalid characters and replace spaces with underscores
return re.sub(r'[\\/*?:"<>|]', "", name)
def generate_pdf_name(ocr_content):
formatted_content = ocr_content.replace("\n", " ")
try:
# Prompt for generating a PDF title based on OCR content
prompt = f"""
Please suggest a descriptive and unique document title.
Spaces are preferred over underscores.
First, correct any spelling mistakes in the following content, then suggest the title: {formatted_content}
"""
response = client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{
"role": "system",
"content": prompt,
},
{"role": "user", "content": ""},
],
temperature=1,
max_tokens=256,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
)
if response and response.choices:
suggested_name = response.choices[0].message.content.strip()
if "unable to suggest" not in suggested_name:
return sanitize_filename(suggested_name)
else:
return "Unable_To_Suggest_Title"
else:
return "No_Response"
except Exception as e:
print(f"An error occurred: {e}")
return "Error_Generated_Title"
def rename_pdf(document_id, new_name):
headers = {"Authorization": f"Token {PAPERLESS_TOKEN}"}
data = {"title": new_name}
response = requests.patch(
f"{PAPERLESS_BASE_URL}/documents/{document_id}/", headers=headers, data=data
)
return response.ok
def main():
documents = get_all_documents()
filtered_documents = filter_documents(documents, search_params)
# Create a dictionary to store the number of retries for each document
retry_counts = {doc["id"]: 0 for doc in filtered_documents}
for doc in filtered_documents:
# Retry renaming a failed document up to MAX_RETRIES times
while retry_counts[doc["id"]] < MAX_RETRIES:
ocr_content = doc["content"]
new_name = generate_pdf_name(ocr_content)
if new_name and rename_pdf(doc["id"], new_name):
print(f"Renamed document {doc['id']} to {new_name}")
break # Rename successful, move to the next document
else:
retry_counts[doc["id"]] += 1
print(f"Retry {retry_counts[doc['id']]} for document {doc['id']}")
# Log failed documents
if retry_counts[doc["id"]] == MAX_RETRIES:
with open("error.log", "a") as error_log:
error_log.write(
f"Failed to rename document {doc['id']} after {MAX_RETRIES} retries\n"
)
if __name__ == "__main__":
main()

2
requirements.txt Normal file
View file

@ -0,0 +1,2 @@
requests
openai

89
test-chatgpt.py Normal file
View file

@ -0,0 +1,89 @@
import os
import requests
import shutil
import openai
from openai import OpenAI
import re
client = OpenAI(api_key=os.getenv("CHATGPT_TOKEN"))
def sanitize_filename(name):
# Remove invalid characters from the file name
return re.sub(r'[\\/*?:"<>|]', "", name)
def generate_pdf_name(file_content):
formatted_content = file_content.replace("\n", " ")
try:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{
"role": "system",
"content": "Suggest a new name for a document with the following content: "
+ formatted_content[:500],
},
{"role": "user", "content": ""},
{
"role": "assistant",
"content": '"Energetic Greetings: An Expressive Salutation"',
},
],
temperature=1,
max_tokens=256,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
)
print("Response Object:", response)
if response and response.choices:
suggested_name = response.choices[0].message.content.strip()
if "unable to suggest" not in suggested_name:
# Sanitize the suggested file name
sanitized_name = sanitize_filename(suggested_name)
return sanitized_name + ".txt"
else:
return "Unable_To_Suggest_Name.txt"
else:
return "No_Response.txt"
except Exception as e:
print("The server could not be reached")
print(e.__cause__) # an underlying Exception, likely raised within httpx.
except openai.RateLimitError as e:
print("A 429 status code was received; we should back off a bit.")
except openai.APIStatusError as e:
print("Another non-200-range status code was received")
print(e.status_code)
print(e.response)
def main():
content_dir = "content/"
cleaned_content_dir = "cleaned-content/"
# Create the cleaned content directory if it doesn't exist
if not os.path.exists(cleaned_content_dir):
os.makedirs(cleaned_content_dir)
# Process each file in the content directory
for filename in os.listdir(content_dir):
if filename.endswith(".txt"):
file_path = os.path.join(content_dir, filename)
# Read the content of the file
with open(file_path, "r") as file:
file_content = file.read()
# Generate a new name for the document
new_name = generate_pdf_name(file_content)
# Copy the file to the cleaned-content directory
new_file_path = os.path.join(cleaned_content_dir, new_name)
shutil.copy(file_path, new_file_path)
print(f"Copied and renamed '{filename}' to '{new_name}'")
if __name__ == "__main__":
main()

65
test-paperless.py Normal file
View file

@ -0,0 +1,65 @@
import os
import requests
import fnmatch
# Load environment variables
PAPERLESS_TOKEN = os.getenv("PAPERLESS_TOKEN")
PAPERLESS_BASE_URL = os.getenv("PAPERLESS_BASE_URL")
# Ensure that the token is available
if not PAPERLESS_TOKEN:
raise ValueError("Paperless token is not set")
if not PAPERLESS_BASE_URL:
raise ValueError("Paperless base URL is not set")
# Set search parameters
search_params = ["Scan*", "PDF*"]
# Function to get all documents from Paperless with pagination
def get_all_documents():
headers = {"Authorization": f"Token {PAPERLESS_TOKEN}"}
url = f"{PAPERLESS_BASE_URL}/documents/"
documents = []
while url:
response = requests.get(url, headers=headers)
data = response.json()
documents.extend(data.get("results", []))
url = data.get("next")
return documents
# Function to filter documents based on search parameters
def filter_documents(documents, search_params):
filtered_docs = []
for doc in documents:
for pattern in search_params:
if fnmatch.fnmatch(doc["title"], pattern):
filtered_docs.append(doc)
break
return filtered_docs
def save_document_content(file_name, content):
os.makedirs("content", exist_ok=True) # Ensure the content directory exists
with open(f"content/{file_name}.txt", "w", encoding="utf-8") as file:
file.write(content)
def main():
all_documents = get_all_documents()
filtered_documents = filter_documents(all_documents, search_params)
for doc in filtered_documents:
# Use the 'content' field directly
doc_content = doc.get("content", "")
save_document_content(doc["title"], doc_content)
print(
f"Saved content for Document ID: {doc['id']} to content/{doc['title']}.txt"
)
if __name__ == "__main__":
main()