Initial Commit
This commit is contained in:
commit
27de462bc3
7 changed files with 392 additions and 0 deletions
3
.env.sample
Normal file
3
.env.sample
Normal file
|
@ -0,0 +1,3 @@
|
||||||
|
CHATGPT_TOKEN=https://platform.openai.com/api-keys
|
||||||
|
PAPERLESS_TOKEN=Login->Click User (Top Right)->Profile->API Auth Token
|
||||||
|
PAPERLESS_BASE_URL=https://paperless.domain.com/api
|
6
.gitignore
vendored
Normal file
6
.gitignore
vendored
Normal file
|
@ -0,0 +1,6 @@
|
||||||
|
.env
|
||||||
|
content/
|
||||||
|
cleaned-content/
|
||||||
|
.venv/
|
||||||
|
error.log
|
||||||
|
.idea/
|
93
README.md
Normal file
93
README.md
Normal file
|
@ -0,0 +1,93 @@
|
||||||
|
# Paperless-ngx ChatGPT Python Script
|
||||||
|
|
||||||
|
This Python script is designed to help you manage and organize your documents using the [Paperless-ngx](https://github.com/the-paperless-project/paperless-ngx) document management system, with the assistance of the OpenAI ChatGPT model. It can be used to automatically rename documents based on their content and to create cleaned copies of documents in a specific directory. This was a helpful script but I no longer use it; I wanted to add local AI support via Ollama. Maybe one day! I made it easy for anyone to update it.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
Before using this script, make sure you have the following:
|
||||||
|
|
||||||
|
- OpenAI API Token (for ChatGPT)
|
||||||
|
- Paperless API Token (for Paperless-ngx)
|
||||||
|
- Python 3.x
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
|
||||||
|
1. Clone or download this repository.
|
||||||
|
|
||||||
|
2. Install the required Python libraries using pip:
|
||||||
|
|
||||||
|
```
|
||||||
|
pip install openai requests
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Set up your environment variables:
|
||||||
|
|
||||||
|
- `CHATGPT_TOKEN`: Your OpenAI API Token.
|
||||||
|
- `PAPERLESS_TOKEN`: Your Paperless API Token.
|
||||||
|
- `PAPERLESS_BASE_URL`: The base URL of your Paperless-ngx instance. For example, `https://paperless.domain.com/api`.
|
||||||
|
|
||||||
|
You can set these environment variables in your system or create a `.env` file in the root directory of the project with the following content:
|
||||||
|
|
||||||
|
```
|
||||||
|
CHATGPT_TOKEN=your_chatgpt_api_token
|
||||||
|
PAPERLESS_TOKEN=your_paperless_api_token
|
||||||
|
PAPERLESS_BASE_URL=https://paperless.domain.com/api
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Modify the `search_params` variable in the script to specify the patterns for filtering documents in Paperless-ngx. By default, it is set to `["*"]`, which matches all documents. You can customize this to match specific document titles.
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Main Script (main.py)
|
||||||
|
|
||||||
|
The `main.py` script performs the following tasks:
|
||||||
|
|
||||||
|
- Retrieves all documents from Paperless-ngx.
|
||||||
|
- Filters documents based on the specified search parameters.
|
||||||
|
- Uses ChatGPT to suggest a new title for each document based on its content.
|
||||||
|
- Renames the documents with the suggested title (with retries in case of failure).
|
||||||
|
- Logs failed renaming attempts in the `error.log` file.
|
||||||
|
|
||||||
|
To run the main script, execute the following command:
|
||||||
|
|
||||||
|
```
|
||||||
|
python main.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Test Script (test-chatgpt.py)
|
||||||
|
|
||||||
|
The `test-chatgpt.py` script is designed to generate new names for documents based on their content using ChatGPT. It reads text files from the `content/` directory, suggests new names, and copies the renamed files to the `cleaned-content/` directory.
|
||||||
|
|
||||||
|
To use this script, follow these steps:
|
||||||
|
|
||||||
|
1. Place the text files you want to rename in the `content/` directory.
|
||||||
|
|
||||||
|
2. Run the test script using the following command:
|
||||||
|
|
||||||
|
```
|
||||||
|
python test-chatgpt.py
|
||||||
|
```
|
||||||
|
|
||||||
|
This will generate new names for the files and save them in the `cleaned-content/` directory.
|
||||||
|
|
||||||
|
### Paperless Document Retrieval (test-paperless.py)
|
||||||
|
|
||||||
|
The `test-paperless.py` script is used to retrieve documents from Paperless-ngx based on search parameters and save their content as text files in the `content/` directory.
|
||||||
|
|
||||||
|
To use this script, run it using the following command:
|
||||||
|
|
||||||
|
```
|
||||||
|
python test-paperless.py
|
||||||
|
```
|
||||||
|
|
||||||
|
The script will retrieve and save documents from Paperless-ngx to the `content/` directory based on the specified search parameters.
|
||||||
|
|
||||||
|
## Important Notes
|
||||||
|
|
||||||
|
- The main script (`main.py`) and the test script (`test-chatgpt.py`) use the OpenAI ChatGPT API to suggest new names for documents. Make sure you have an active API subscription and the necessary API key.
|
||||||
|
|
||||||
|
- The scripts assume that you have set up the Paperless-ngx document management system and provided the correct API token.
|
||||||
|
|
||||||
|
- Customization: You can customize the search parameters, retry count, and other settings in the scripts to suit your specific requirements.
|
||||||
|
|
||||||
|
Feel free to use and modify these scripts to automate your document management workflow with Paperless-ngx and ChatGPT.
|
134
main.py
Normal file
134
main.py
Normal file
|
@ -0,0 +1,134 @@
|
||||||
|
import os
|
||||||
|
import requests
|
||||||
|
import re
|
||||||
|
import fnmatch
|
||||||
|
from openai import OpenAI
|
||||||
|
|
||||||
|
# Load environment variables
|
||||||
|
CHATGPT_TOKEN = os.getenv("CHATGPT_TOKEN")
|
||||||
|
PAPERLESS_TOKEN = os.getenv("PAPERLESS_TOKEN")
|
||||||
|
PAPERLESS_BASE_URL = os.getenv("PAPERLESS_BASE_URL")
|
||||||
|
client = OpenAI(api_key=CHATGPT_TOKEN)
|
||||||
|
|
||||||
|
# Ensure that tokens are available
|
||||||
|
if not CHATGPT_TOKEN or not PAPERLESS_TOKEN:
|
||||||
|
raise ValueError("Environment variables for tokens are not set")
|
||||||
|
|
||||||
|
# Set search parameters to get all documents
|
||||||
|
search_params = ["*"]
|
||||||
|
|
||||||
|
# Maximum number of retries for renaming a failed document
|
||||||
|
MAX_RETRIES = 3
|
||||||
|
|
||||||
|
|
||||||
|
def get_all_documents():
|
||||||
|
headers = {"Authorization": f"Token {PAPERLESS_TOKEN}"}
|
||||||
|
url = f"{PAPERLESS_BASE_URL}/documents/"
|
||||||
|
documents = []
|
||||||
|
while url:
|
||||||
|
response = requests.get(url, headers=headers)
|
||||||
|
data = response.json()
|
||||||
|
for doc in data.get("results", []):
|
||||||
|
documents.append(
|
||||||
|
{
|
||||||
|
"id": doc["id"],
|
||||||
|
"title": doc["title"],
|
||||||
|
"content": doc.get("content", ""),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
url = data.get("next")
|
||||||
|
return documents
|
||||||
|
|
||||||
|
|
||||||
|
def filter_documents(documents, search_params):
|
||||||
|
filtered_docs = []
|
||||||
|
for doc in documents:
|
||||||
|
for pattern in search_params:
|
||||||
|
if fnmatch.fnmatch(doc["title"], pattern):
|
||||||
|
filtered_docs.append(doc)
|
||||||
|
break
|
||||||
|
return filtered_docs
|
||||||
|
|
||||||
|
|
||||||
|
def sanitize_filename(name):
|
||||||
|
# Remove invalid characters and replace spaces with underscores
|
||||||
|
return re.sub(r'[\\/*?:"<>|]', "", name)
|
||||||
|
|
||||||
|
|
||||||
|
def generate_pdf_name(ocr_content):
|
||||||
|
formatted_content = ocr_content.replace("\n", " ")
|
||||||
|
try:
|
||||||
|
# Prompt for generating a PDF title based on OCR content
|
||||||
|
prompt = f"""
|
||||||
|
Please suggest a descriptive and unique document title.
|
||||||
|
Spaces are preferred over underscores.
|
||||||
|
First, correct any spelling mistakes in the following content, then suggest the title: {formatted_content}
|
||||||
|
"""
|
||||||
|
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model="gpt-4-turbo-preview",
|
||||||
|
messages=[
|
||||||
|
{
|
||||||
|
"role": "system",
|
||||||
|
"content": prompt,
|
||||||
|
},
|
||||||
|
{"role": "user", "content": ""},
|
||||||
|
],
|
||||||
|
temperature=1,
|
||||||
|
max_tokens=256,
|
||||||
|
top_p=1,
|
||||||
|
frequency_penalty=0,
|
||||||
|
presence_penalty=0,
|
||||||
|
)
|
||||||
|
if response and response.choices:
|
||||||
|
suggested_name = response.choices[0].message.content.strip()
|
||||||
|
if "unable to suggest" not in suggested_name:
|
||||||
|
return sanitize_filename(suggested_name)
|
||||||
|
else:
|
||||||
|
return "Unable_To_Suggest_Title"
|
||||||
|
else:
|
||||||
|
return "No_Response"
|
||||||
|
except Exception as e:
|
||||||
|
print(f"An error occurred: {e}")
|
||||||
|
return "Error_Generated_Title"
|
||||||
|
|
||||||
|
|
||||||
|
def rename_pdf(document_id, new_name):
|
||||||
|
headers = {"Authorization": f"Token {PAPERLESS_TOKEN}"}
|
||||||
|
data = {"title": new_name}
|
||||||
|
response = requests.patch(
|
||||||
|
f"{PAPERLESS_BASE_URL}/documents/{document_id}/", headers=headers, data=data
|
||||||
|
)
|
||||||
|
return response.ok
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
documents = get_all_documents()
|
||||||
|
filtered_documents = filter_documents(documents, search_params)
|
||||||
|
|
||||||
|
# Create a dictionary to store the number of retries for each document
|
||||||
|
retry_counts = {doc["id"]: 0 for doc in filtered_documents}
|
||||||
|
|
||||||
|
for doc in filtered_documents:
|
||||||
|
# Retry renaming a failed document up to MAX_RETRIES times
|
||||||
|
while retry_counts[doc["id"]] < MAX_RETRIES:
|
||||||
|
ocr_content = doc["content"]
|
||||||
|
new_name = generate_pdf_name(ocr_content)
|
||||||
|
|
||||||
|
if new_name and rename_pdf(doc["id"], new_name):
|
||||||
|
print(f"Renamed document {doc['id']} to {new_name}")
|
||||||
|
break # Rename successful, move to the next document
|
||||||
|
else:
|
||||||
|
retry_counts[doc["id"]] += 1
|
||||||
|
print(f"Retry {retry_counts[doc['id']]} for document {doc['id']}")
|
||||||
|
|
||||||
|
# Log failed documents
|
||||||
|
if retry_counts[doc["id"]] == MAX_RETRIES:
|
||||||
|
with open("error.log", "a") as error_log:
|
||||||
|
error_log.write(
|
||||||
|
f"Failed to rename document {doc['id']} after {MAX_RETRIES} retries\n"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
2
requirements.txt
Normal file
2
requirements.txt
Normal file
|
@ -0,0 +1,2 @@
|
||||||
|
requests
|
||||||
|
openai
|
89
test-chatgpt.py
Normal file
89
test-chatgpt.py
Normal file
|
@ -0,0 +1,89 @@
|
||||||
|
import os
|
||||||
|
import requests
|
||||||
|
import shutil
|
||||||
|
import openai
|
||||||
|
from openai import OpenAI
|
||||||
|
import re
|
||||||
|
|
||||||
|
client = OpenAI(api_key=os.getenv("CHATGPT_TOKEN"))
|
||||||
|
|
||||||
|
|
||||||
|
def sanitize_filename(name):
|
||||||
|
# Remove invalid characters from the file name
|
||||||
|
return re.sub(r'[\\/*?:"<>|]', "", name)
|
||||||
|
|
||||||
|
|
||||||
|
def generate_pdf_name(file_content):
|
||||||
|
formatted_content = file_content.replace("\n", " ")
|
||||||
|
try:
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model="gpt-3.5-turbo",
|
||||||
|
messages=[
|
||||||
|
{
|
||||||
|
"role": "system",
|
||||||
|
"content": "Suggest a new name for a document with the following content: "
|
||||||
|
+ formatted_content[:500],
|
||||||
|
},
|
||||||
|
{"role": "user", "content": ""},
|
||||||
|
{
|
||||||
|
"role": "assistant",
|
||||||
|
"content": '"Energetic Greetings: An Expressive Salutation"',
|
||||||
|
},
|
||||||
|
],
|
||||||
|
temperature=1,
|
||||||
|
max_tokens=256,
|
||||||
|
top_p=1,
|
||||||
|
frequency_penalty=0,
|
||||||
|
presence_penalty=0,
|
||||||
|
)
|
||||||
|
print("Response Object:", response)
|
||||||
|
if response and response.choices:
|
||||||
|
suggested_name = response.choices[0].message.content.strip()
|
||||||
|
if "unable to suggest" not in suggested_name:
|
||||||
|
# Sanitize the suggested file name
|
||||||
|
sanitized_name = sanitize_filename(suggested_name)
|
||||||
|
return sanitized_name + ".txt"
|
||||||
|
else:
|
||||||
|
return "Unable_To_Suggest_Name.txt"
|
||||||
|
else:
|
||||||
|
return "No_Response.txt"
|
||||||
|
except Exception as e:
|
||||||
|
print("The server could not be reached")
|
||||||
|
print(e.__cause__) # an underlying Exception, likely raised within httpx.
|
||||||
|
except openai.RateLimitError as e:
|
||||||
|
print("A 429 status code was received; we should back off a bit.")
|
||||||
|
except openai.APIStatusError as e:
|
||||||
|
print("Another non-200-range status code was received")
|
||||||
|
print(e.status_code)
|
||||||
|
print(e.response)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
content_dir = "content/"
|
||||||
|
cleaned_content_dir = "cleaned-content/"
|
||||||
|
|
||||||
|
# Create the cleaned content directory if it doesn't exist
|
||||||
|
if not os.path.exists(cleaned_content_dir):
|
||||||
|
os.makedirs(cleaned_content_dir)
|
||||||
|
|
||||||
|
# Process each file in the content directory
|
||||||
|
for filename in os.listdir(content_dir):
|
||||||
|
if filename.endswith(".txt"):
|
||||||
|
file_path = os.path.join(content_dir, filename)
|
||||||
|
|
||||||
|
# Read the content of the file
|
||||||
|
with open(file_path, "r") as file:
|
||||||
|
file_content = file.read()
|
||||||
|
|
||||||
|
# Generate a new name for the document
|
||||||
|
new_name = generate_pdf_name(file_content)
|
||||||
|
|
||||||
|
# Copy the file to the cleaned-content directory
|
||||||
|
new_file_path = os.path.join(cleaned_content_dir, new_name)
|
||||||
|
shutil.copy(file_path, new_file_path)
|
||||||
|
|
||||||
|
print(f"Copied and renamed '{filename}' to '{new_name}'")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
65
test-paperless.py
Normal file
65
test-paperless.py
Normal file
|
@ -0,0 +1,65 @@
|
||||||
|
import os
|
||||||
|
import requests
|
||||||
|
import fnmatch
|
||||||
|
|
||||||
|
# Load environment variables
|
||||||
|
PAPERLESS_TOKEN = os.getenv("PAPERLESS_TOKEN")
|
||||||
|
PAPERLESS_BASE_URL = os.getenv("PAPERLESS_BASE_URL")
|
||||||
|
|
||||||
|
# Ensure that the token is available
|
||||||
|
if not PAPERLESS_TOKEN:
|
||||||
|
raise ValueError("Paperless token is not set")
|
||||||
|
|
||||||
|
if not PAPERLESS_BASE_URL:
|
||||||
|
raise ValueError("Paperless base URL is not set")
|
||||||
|
|
||||||
|
# Set search parameters
|
||||||
|
search_params = ["Scan*", "PDF*"]
|
||||||
|
|
||||||
|
|
||||||
|
# Function to get all documents from Paperless with pagination
|
||||||
|
def get_all_documents():
|
||||||
|
headers = {"Authorization": f"Token {PAPERLESS_TOKEN}"}
|
||||||
|
url = f"{PAPERLESS_BASE_URL}/documents/"
|
||||||
|
|
||||||
|
documents = []
|
||||||
|
while url:
|
||||||
|
response = requests.get(url, headers=headers)
|
||||||
|
data = response.json()
|
||||||
|
documents.extend(data.get("results", []))
|
||||||
|
url = data.get("next")
|
||||||
|
return documents
|
||||||
|
|
||||||
|
|
||||||
|
# Function to filter documents based on search parameters
|
||||||
|
def filter_documents(documents, search_params):
|
||||||
|
filtered_docs = []
|
||||||
|
for doc in documents:
|
||||||
|
for pattern in search_params:
|
||||||
|
if fnmatch.fnmatch(doc["title"], pattern):
|
||||||
|
filtered_docs.append(doc)
|
||||||
|
break
|
||||||
|
return filtered_docs
|
||||||
|
|
||||||
|
|
||||||
|
def save_document_content(file_name, content):
|
||||||
|
os.makedirs("content", exist_ok=True) # Ensure the content directory exists
|
||||||
|
with open(f"content/{file_name}.txt", "w", encoding="utf-8") as file:
|
||||||
|
file.write(content)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
all_documents = get_all_documents()
|
||||||
|
filtered_documents = filter_documents(all_documents, search_params)
|
||||||
|
|
||||||
|
for doc in filtered_documents:
|
||||||
|
# Use the 'content' field directly
|
||||||
|
doc_content = doc.get("content", "")
|
||||||
|
save_document_content(doc["title"], doc_content)
|
||||||
|
print(
|
||||||
|
f"Saved content for Document ID: {doc['id']} to content/{doc['title']}.txt"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
Loading…
Add table
Add a link
Reference in a new issue