OpenAI Quietly Lowers Codex Usage Limit – and How to Run It Locally

Tech‑Scroll 115 – “When the river is dammed overnight, the wise redirect the stream.”

OpenAI Quietly Lowers Codex Usage Limit – and How to Run It Locally

Update: head over to our latest article, access is being restricted more than at first and we can act to prevent it from getting worse, if you live in the UK and we all act then OpenAI will be forced legally to re-think their stance https://articles.akadata.ltd/article-when-a-digital-service-changes-without-notice-understanding-your-rights-under-uk-law-2/

A shock for developers: Overnight, OpenAI appears to have lowered the usage cap for gpt‑5‑codex on the command line. For those of us who suddenly encountered the familiar warning.

"You've hit your usage limit. Upgrade to Pro or try again in 1 hour 50 minutes."

It got worse a few hours later - 5 days 1 hour and 28 minutes, this is the second message we received after waiting 5 days one hour, it would appear this timer gets longer every time the limit is hit.

The change is unexpected and unwelcome, almost as if a dam was built overnight. What had been available without restriction can now lock you out for nearly two hours, interrupting production work mid‑flow. It leaves many feeling that “who gives a dam about the users anyway, they only pay for the service, right?”

This is a reminder, however, that cloud quotas can change without notice. If you rely on Codex or any other hosted LLM for development, it’s wise to have a plan for running the model locally so you’re never blocked by a sudden limit.

Below is a complete guide, based on our own setup at Breath Technology, for running a Codex‑compatible workflow entirely on your own hardware.


1. Why Run Codex Locally?

  • Zero quota surprises: No unexpected hourly or daily limits.
  • Data sovereignty: Your code and prompts never leave your network.
  • Cost control: Fixed hardware costs instead of unpredictable API bills.

2. Pick a Local Model Backend

Several open‑source runtimes can host Codex‑style LLMs:

Backend Highlights
llama.cpp / llama-server Lightweight C++ binary, runs GGUF models efficiently on CPU or GPU.
Ollama Simple installer for macOS/Linux; automatically exposes an OpenAI‑compatible endpoint.
vLLM High‑performance Python engine with built‑in OpenAI API server.

Choose a model such as the 1.58‑bit BitNet or any GPT‑class open model and export it to GGUF or another quantized format that fits your hardware.


3. Expose an OpenAI‑Compatible API

Codex expects the OpenAI REST API. Each backend can mimic it:

  • llama-server: start with --api to provide a REST/WS interface, e.g. --host 0.0.0.0 --port 8080.
  • Ollama: by default exposes http://localhost:11434/v1 matching OpenAI’s endpoints.
  • vLLM: python -m vllm.entrypoints.openai.api_server spins up /v1 routes.

Verify with:

curl http://localhost:11434/v1/models

You should see your model listed.


4. Point Codex to Your Local Endpoint

Codex uses the OpenAI client libraries. Override the base URL:

Environment variables

export OPENAI_API_BASE=http://localhost:11434/v1
export OPENAI_API_KEY=dummy-key

Node / TypeScript example

import OpenAI from "openai";
const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: "http://localhost:11434/v1"
});

The API key can be any string if your local server does not require authentication.


5. Tune Your Model Parameters

  • Match the model name exposed by the local backend so the client selects the correct engine and update your client code accordingly.
  • Choose max_tokens high enough to cover the longest expected completion but low enough to prevent runaway responses. For large models start around 1024–2048 and adjust after testing; if you need to summarise long RAG documents, push this higher but monitor memory usage.
  • Adjust temperature for creativity: 0.0–0.3 for deterministic code generation, 0.7–1.0 for brainstorming; tune this if you want more exploratory answers when combined with a local RAG knowledge base.
  • Ensure the context window matches your coding tasks, if you need multi‑file awareness use a model or build variant that supports extended context (e.g., 8k–32k tokens or more). For local RAG, size the context so retrieved documents fit comfortably.
  • Consider tuning top_p and frequency_penalty/presence_penalty to influence diversity and repetition, especially in long conversations. When adding sentiment analysis or RAG, slightly increasing frequency_penalty can help avoid repetitive retrieved text.
  • Review and set stop sequences to terminate output cleanly when generating structured code or documentation; adapt these when you want the model to stop after a RAG‑provided answer.
  • For interactive coding agents, set a sensible stream option to true so tokens are returned as they are generated, improving perceived responsiveness. If you integrate with OpenWebUI or similar front‑ends, streaming provides real‑time display of both chat and RAG content.
  • To add local RAG, run a vector database such as Milvus, Weaviate, or use the built‑in retrieval plugin of OpenWebUI. Index your documents and on each user query retrieve relevant chunks, then prepend them to the prompt before sending to the local model.
  • With OpenWebUI you can host your model locally and expose a friendly web interface; combine it with your RAG pipeline by configuring a retrieval plugin or custom prompt template so that Codex can answer using both its training data and your private knowledge.

6. Production Tips & Setup Tutorials

Below are step‑by‑step mini‑tutorials for each recommended production technique:

  • Run under systemd: create a full service file at /etc/systemd/system/codex-local.service such as:
[Unit]
Description=Codex Local Llama Server
After=network.target

[Service]
Type=simple
User=codex
Group=codex
WorkingDirectory=/usr/local/share/codex
ExecStart=/usr/local/bin/llama-server --api --host 0.0.0.0 --port 8080 --model /usr/local/share/codex/model.gguf
Restart=always
RestartSec=5s
LimitNOFILE=65535

[Install]
WantedBy=multi-user.target

Then run systemctl daemon-reload && systemctl enable --now codex-local to start on boot and ensure it restarts on failure.

  • Run in Docker: write a Dockerfile that installs your chosen backend and exposes port 8080. Use the following complete Dockerfile:
FROM debian:stable-slim
RUN apt-get update && apt-get install -y build-essential cmake git && rm -rf /var/lib/apt/lists/*
WORKDIR /app
RUN git clone https://github.com/ggerganov/llama.cpp.git && cd llama.cpp && mkdir build && cd build \
 && cmake -DLLAMA_BLAS=ON .. && make -j$(nproc)
COPY model.gguf /app/llama.cpp/
EXPOSE 8080
CMD ["/app/llama.cpp/build/bin/llama-server","--api","--host","0.0.0.0","--port","8080","--model","/app/llama.cpp/model.gguf"]

Then build and run it:

docker build -t codex-local .
docker run -d -p 8080:8080 codex-local

This provides an isolated container hosting the Codex-compatible API.

  • Enable GPU acceleration: For NVIDIA cards install CUDA toolkit and run:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && mkdir build && cd build
cmake -DLLAMA_CUBLAS=ON ..
make -j$(nproc)

This compiles with cuBLAS support.
For AMD GPUs install ROCm and use -DLLAMA_HIPBLAS=ON; for Apple Silicon use -DLLAMA_METAL=ON.
If you prefer prebuilt Docker images, add the --gpus all flag when running docker run and ensure the image was built with the correct GPU backend.
Finally verify GPU usage with nvidia-smi or your vendor’s equivalent while generating tokens to confirm that token latency is reduced.

  • Build with BLAS for CPU speed: If you do not have a GPU, you can still get good performance by compiling with BLAS (Basic Linear Algebra Subprograms) support.

Step‑by‑step:

  1. Install a BLAS library (for example on Debian/Ubuntu: sudo apt install libopenblas-dev, on Arch: sudo pacman -S openblas).
  2. After compilation you will have an optimised llama-server binary that uses OpenBLAS (or the BLAS library you installed) for faster matrix operations.

Run CMake with BLAS enabled:

cmake -DLLAMA_BLAS=ON ..
make -j$(nproc)

Clone the llama.cpp source and create a build directory:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp && mkdir build && cd build

This is recommended for any CPU‑only server because it can roughly double or triple throughput compared with a plain build, and requires no changes to your model files or API usage.

  • Monitor logs: for systemd use journalctl -u codex-local -f or for Docker docker logs -f <container> to confirm that Codex requests are reaching your local server.

Bottom Line

OpenAI’s sudden overnight change is frustrating news for developers, many invested weeks of production time only to find that what was once unlimited now demands a costly Pro subscription. This serves as a stark reminder that relying entirely on a third‑party service leaves you vulnerable to abrupt policy shifts. By following the steps above you can:

  • keep Codex‑style functionality available even when OpenAI enforces quotas, giving your team freedom and continuity;
  • truly own your deployment and data so future policy changes can’t disrupt you;
  • develop without interruptions, and with the confidence that your tools will always be there when you need them.

Run Codex locally, and usage limits become a thing of the past.