In November 2023, after OpenAI added the ability for ChatGPT to generate images from DALL-E 3 within the ChatGPT web interface, there was a short-lived meme where users gave the LLM a base image and kept asking the model to “make it more X”, where X can be anything.
二〇二三年十一月,OpenAI 將 DALL-E 3 的繪圖功能整合至 ChatGPT 網頁介面後,網路上曾一度流行一種迷因哏圖,使用者會提供一張基礎圖片給 LLM,然後不斷要求模型「讓它更 X」一點,而 X 可以是任何東西。

A regular guy becomes more “bro” every time. via /u/Jojop0tato on Reddit.

A regular guy becomes more “bro” every time. via /u/Jojop0tato on Reddit.
一位平凡的男子每次都變得更有「兄弟」味。via /u/Jojop0tato on Reddit.

Asked ChatGPT to make Santa Claus more and more serious. via /u/hessihan on Reddit.

Asked ChatGPT to make Santa Claus more and more serious. via /u/hessihan on Reddit.
經由 Reddit 用戶 u/hessihan,要求 ChatGPT 將聖誕老人變得越來越嚴肅。

The trend quickly died as all of these images were very samey and uninteresting, aside from the unexplainable trend that all of the examples eventually converged into something cosmic, irrespective of the starting image and the prompt. Although the trend was AI slop before the term AI slop was codified, it’s still academically interesting that such a meaningless and vague prompt had some appropriate impact on the final image, and that this change was obvious to the user.
這個趨勢很快就消失了,因為所有這些圖像都非常相似且無趣,除了所有示例最終都趨於某種「宇宙化」的難以解釋的趨勢,無論起始圖像和提示如何。儘管這種趨勢在「AI 廢料」一詞被正式定義之前就是 AI 廢料,但從學術角度來看,如此無意義且模糊的提示對最終圖像產生了一些適當的影響,並且這種變化對用戶來說是顯而易見的,這仍然很有趣。

What would happen if we tried a similar technique with code? LLM-generated code is unlikely to be slop (although not impossible) as it follows strict rules, and unlike creative outputs such as images, code quality can be measured more objectively.
如果我們嘗試用程式碼使用類似的技術會發生什麼? LLM 生成的程式碼不太可能是廢料(儘管並非不可能),因為它遵循嚴格的規則,並且與圖像等創意輸出不同,程式碼品質可以更客觀地衡量。

If code can indeed be improved simply through iterative prompting such as asking the LLM to “make the code better” — even though it’s very silly — it would be a massive productivity increase. And if that’s the case, what happens if you iterate on the code too much? What’s the equivalent of code going cosmic? There’s only one way to find out!
如果程式碼確實可以僅透過迭代提示(例如要求 LLM「使程式碼更好」)來改進——即使這很愚蠢——那將會大幅提高生產力。如果是這樣,那麼如果你對程式碼進行過多迭代會發生什麼?程式碼走向宇宙的等價物是什麼?只有一種方法可以找出答案!

Casually Coding With An LLM
隨意使用 LLM 編寫程式碼

Despite researching and developing tooling around LLMs even long before ChatGPT, I haven’t been fond of using LLM code copilots such as GitHub Copilot for coding assistance. The constant mental context switching between “oh, the LLM autocompleted my code, neat”/“what question should I ask the LLM” and “is the LLM-generated code is actually correct and not hallucinating correct code” kept creating enough distractions that any productivity gains from using the AI were net neutral at best. That’s also disregarding the expensive cost of using said LLMs.
儘管甚至在 ChatGPT 出現之前很久我就已經在研究和開發圍繞 LLMs 的工具,但我並不喜歡使用 LLM 程式碼輔助工具(例如 GitHub Copilot)來協助編碼。在「哦,LLM 自動完成了我的程式碼,真棒」/「我應該問 LLM 什麼問題」和「LLM 生成的程式碼是否真的正確而不是虛構出正確的程式碼」之間不斷的心理上下文切換,造成了足夠多的干擾,以至於使用 AI 帶來的任何生產力提升充其量也只是收支平衡。這還沒有考慮到使用上述 LLMs 的高昂成本。

Claude 3.5 Sonnet has made me rethink things. Due to whatever secret sauce Anthropic used in its training, the latest version of Claude 3.5 Sonnet (claude-3-5-sonnet-20241022) has incredible prompt adherence for all types of prompts, especially coding prompts. Coding benchmarks confirm that testing between Claude 3.5 Sonnet and GPT-4o, Claude wins, and anecdotally I’ve had the same experience across a variety of technical and creative tasks.
Claude 3.5 Sonnet 讓我重新思考。由於 Anthropic 在其訓練中使用了某種秘密武器,最新版本的 Claude 3.5 Sonnet( claude-3-5-sonnet-20241022 )對所有類型的提示,尤其是編碼提示,都具有令人難以置信的提示依從性。編碼基準測試證實,在 Claude 3.5 Sonnet 和 GPT-4o 之間進行測試,Claude 勝出,而且我個人在各種技術和創意任務中也有相同的體驗。

Initial Ask  初始請求

For this experiment, we will give Claude 3.5 Sonnet an interview-style coding prompt using Python: one that is both simple and could be implemented by a novice software engineer, but one that can be heavily optimized. This simple, casual prompt represents how the typical software engineer uses LLMs. Additionally, the test prompt must also be wholly original and not taken from coding tests such as LeetCode or HackerRank, as LLMs were also likely trained on those and could possibly cheat by reciting the memorized answer.
在這個實驗中,我們將使用 Python 向 Claude 3.5 Sonnet 提供面試風格的編碼提示:既簡單又可以由新手軟體工程師實現,但也可以進行大量優化。這個簡單、隨意的提示代表了典型的軟體工程師如何使用 LLMs。此外,測試提示也必須完全原創,而不是取自 LeetCode 或 HackerRank 等編碼測試,因為 LLMs 也可能接受過這些測試的訓練,並且可能會透過背誦記憶中的答案來作弊。

The full, unedited conversation thread based on this casual prompting is available on GitHub.
基於此隨意提示的完整未經編輯的對話串列可在 GitHub 上找到。

Therefore, here’s the test prompt that I’ve written myself:
因此,以下是我自己編寫的測試提示:

Write Python code to solve this problem:

Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.

Providing that as the user prompt to the Claude API and setting temperature=0 for the best/most deterministic answer 1, results in this implementation:
將其作為用戶提示提供給 Claude API 並設定 temperature=0 以獲得最佳/最具確定性的答案 1 ,將產生以下實作:

import random

def digit_sum(n):
    """Calculate the sum of digits of a number"""
    return sum(int(digit) for digit in str(n))

def find_difference():
    # Generate list of 1 million random integers
    numbers = [random.randint(1, 100000) for _ in range(1000000)]

    # Initialize variables for min and max numbers with digit sum 30
    min_num = float('inf')  # Initialize to positive infinity
    max_num = float('-inf')  # Initialize to negative infinity

    # Find numbers whose digits sum to 30
    for num in numbers:
        if digit_sum(num) == 30:
            min_num = min(min_num, num)
            max_num = max(max_num, num)

    # Check if we found any numbers with digit sum 30
    if min_num == float('inf') or max_num == float('-inf'):
        return "No numbers found with digit sum of 30"

    return max_num - min_num

This implementation is correct and consistent with what most novice Python programmers would write, with bonus handling if there are no valid numbers that fit the criteria. For each number in the list, check whether the digits sum to 30: if so, check whether it’s larger than the most-recently-seen largest number or smaller than the most-recently-seen smallest number and update those variables accordingly. After searching through the list, return the difference.
此實作是正確的,並且與大多數 Python 初學者編寫的程式碼一致,如果沒有符合條件的有效數字,還會額外處理。對於列表中的每個數字,檢查其各位數字之和是否為 30:如果是,則檢查它是否大於最近看到的最大數字或小於最近看到的最小數字,並相應地更新這些變數。搜尋完列表後,返回差值。

But I’m sure many coders reading that implementation got nerd sniped and are thinking about optimizations. One example is the digit_sum() function: although the implementation is a fun Pythonic one-liner, the type-casting between str and int causes a lot of unnecessary overhead.
但我相信許多讀到該實作的程式設計師都會感到技癢,並正在思考如何優化。一個例子是 digit_sum() 函式:雖然該實作是一個有趣的 Pythonic 單行式,但在 strint 之間的類型轉換會導致許多不必要的開銷。

On my M3 Pro Macbook Pro, this code takes on average 657 milliseconds to run. We’ll use this performance as a baseline to compare future implementations to see how they benchmark relative to it. (spoilers: they’re all faster)
在我的 M3 Pro Macbook Pro 上,這段程式碼平均需要 657 毫秒才能運行。我們將使用此效能作為基準來比較未來的實作,看看它們相對於它的基準測試如何。(劇透:它們都更快)

Iteration #1  迭代 #1

Now we can ask Claude to improve the code by including its current answer and all that came before it as a part of the conversation prompt. Thus, we add the iteration prompt:
現在,我們可以要求 Claude 改進程式碼,方法是將其目前的答案和之前的所有內容作為對話提示的一部分。因此,我們添加了迭代提示:

write better code

What, you though the title of this blog post was a joke? I never joke.
什麼?你以為這篇部落格文章的標題是開玩笑嗎?我從不開玩笑。

Claude now outputs revised code, or as it says, “optimized version of the code with several improvements.” Instead of colocating all the code in functions, it decided to refactor it into a Python class and made it more object-oriented:
Claude 現在輸出修改後的程式碼,或者如它所說,「經過多項改進的程式碼優化版本」。它沒有將所有程式碼都放在函式中,而是決定將其重構為 Python class ,並使其更加物件導向:

class DigitSumFinder:
    def __init__(self, target_sum: int = 30,
                 range_start: int = 1,
                 range_end: int = 100_000,
                 count: int = 1_000_000):
        self.target_sum = target_sum
        self.range_start = range_start
        self.range_end = range_end
        self.count = count

        # Pre-calculate digit sums for all possible numbers
        self.digit_sums = self._precompute_digit_sums()

    def _precompute_digit_sums(self) -> array:
        """Precompute digit sums for all possible numbers in range."""
        digit_sums = array('B', [0] * (self.range_end + 1))
        for num in range(self.range_start, self.range_end + 1):
            total = 0
            n = num
            while n:
                total += n % 10
                n //= 10
            digit_sums[num] = total
        return digit_sums

    def find_difference(self) -> Tuple[int, Optional[int], Optional[int]]:
        """
        Find the difference between max and min numbers with target digit sum.
        Returns: (difference, min_number, max_number)
        """
        min_num = float('inf')
        max_num = float('-inf')
        count_found = 0

        # Generate and process random numbers
        for _ in range(self.count):
            num = random.randint(self.range_start, self.range_end)
            if self.digit_sums[num] == self.target_sum:
                count_found += 1
                if num < min_num:
                    min_num = num
                if num > max_num:
                    max_num = num

        if count_found == 0:
            return 0, None, None

        return max_num - min_num, min_num, max_num

Here, the code does two smart algorithmic improvements:
在這裡,程式碼做了兩項巧妙的演算法改進:

  • When computing the digit sum it uses integer operations and avoids the need for the type-casting mentioned previously.
    在計算數字總和時,它使用整數運算,避免了先前提到的類型轉換的需求。
  • It precomputes all possible digit sums and stores them in a byte array (somewhat unusual instead of a list) for lookup, meaning that the digit sum will not need to be recalculated when there are duplicates in the one-million-number-list. Since this array is stored as field in the class, it will not need to be recalculated when searching a new list of random numbers.
    它預先計算所有可能的數字總和,並將它們儲存在一個位元組陣列(不同尋常地未使用列表)中以便查找,這意味著當百萬數字列表中存在重複項時,不需要重新計算數字總和。由於此陣列作為欄位儲存在類別中,因此在搜尋新的隨機數列表時,無需重新計算。

These optimizations speed up the code 2.7x relative to the initial implementation.
這些優化使程式碼的速度相較於初始實作提高了 2.7 倍。

Iteration #2  迭代 #2

After another write better code, Claude discovers more obvious optimizations (truncated for readabilty):
經過另一次 write better code 後,Claude 發現了更多顯而易見的優化(為便於閱讀已截斷):

class DigitSumAnalyzer:
    """Analyzes numbers based on their digit sums."""

    def __init__(self, number_range: NumberRange):
        self.number_range = number_range
        self._digit_sums = self._precompute_digit_sums()

    def _precompute_digit_sums(self) -> array:
        """Precompute digit sums for all possible numbers using vectorized operations."""
        digits = np.arange(self.number_range.end + 1)
        digit_sums = np.zeros(self.number_range.end + 1, dtype=np.uint8)

        while digits.any():
            digit_sums += digits % 10
            digits //= 10

        return array('B', digit_sums)

    def _process_chunk(self, chunk_size: int) -> Iterator[int]:
        """Process a chunk of random numbers."""
        numbers = np.random.randint(
            self.number_range.start,
            self.number_range.end + 1,
            chunk_size,
            dtype=np.int32
        )
        mask = np.frombuffer(self._digit_sums, dtype=np.uint8)[numbers] == self.number_range.target_sum
        yield from numbers[mask]

    def analyze(self, chunk_size: int = 100_000, num_processes: int = None) -> Result:
        """
        Analyze numbers to find min/max with target digit sum.

        Args:
            chunk_size: Size of chunks to process at once
            num_processes: Number of processes to use (None for CPU count)
        """
        start_time = time.perf_counter()
        min_num = float('inf')
        max_num = float('-inf')
        numbers_found = 0

        num_chunks = (self.number_range.count + chunk_size - 1) // chunk_size

        with ProcessPoolExecutor(max_workers=num_processes) as executor:
            futures = [
                executor.submit(self._process_chunk,
                              min(chunk_size, self.number_range.count - i * chunk_size))
                for i in range(num_chunks)
            ]

            for future in tqdm(futures, desc="Processing chunks"):
                for num in future.result():
                    numbers_found += 1
                    min_num = min(min_num, num)
                    max_num = max(max_num, num)

        execution_time = time.perf_counter() - start_time

        if numbers_found == 0:
            return Result(None, None, 0, execution_time, 0)

        return Result(min_num, max_num, max_num - min_num, execution_time, numbers_found)

Claude now has added two more optimizations, finally realizing that this coding problem is an embarrassingly parallel problem:
Claude 現在又增加了兩項優化,終於意識到這個編碼問題是一個令人尷尬的平行化問題:

  • Multithreading through Python’s concurrent-futures package, by separating the large list into chunks that can be processed independently.
    通過 Python 的 concurrent-futures 套件進行多執行緒處理,方法是將大列表分成可以獨立處理的區塊。
  • Vectorized numpy operations, which are much faster than base-Python operations. Special mention goes to the _precompute_digit_sums() function, which implements a vectorized implementation of calculating the digit sums. The conditional while digits.any(): is galaxy-brain code, but it works correctly.
    向量化的 numpy 運算,比基本的 Python 運算快得多。特別值得一提的是 _precompute_digit_sums() 函式,它實現了計算數字總和的向量化實作。條件 while digits.any(): 是個神來之筆的程式碼,但它確實有效。

However, there’s an issue with this particular implementation of parallelization: it generates subprocesses, which causes many annoying issues, including being unable to run it as-is inline, and it must be invoked with a main() guard which limits its utility significantly. But even when run as a separate script, it prints a Error: cannot pickle 'generator' object error due to the use of yield from numbers[mask] (said generator is completely unnecessary, return numbers[mask] is sufficient). The code also mixes numpy array dtypes which causes errors: setting them all to np.int32 fixes it.
然而,這種特殊的平行化實作存在一個問題:它會產生子行程,這會導致許多惱人的問題,包括無法以原樣內嵌運行,並且必須使用 main() 保護來調用它,這大大限制了它的效用。但即使作為單獨的腳本運行,它也會因為使用 yield from numbers[mask] 而列印 Error: cannot pickle 'generator' object 錯誤(所述產生器完全不必要, return numbers[mask] 就足夠了)。程式碼還混合了 numpy 陣列 dtype ,這會導致錯誤:將它們全部設定為 np.int32 可以解決此問題。

After making those fixes, the code is now 5.1x faster than the base implementation.
進行這些修正後,程式碼現在比基本實作快 5.1 倍。

Iteration #3  第三次迭代

Another write better code, and Claude returns a implementation that it claims is “even more sophisticated and optimized version using advanced techniques and modern Python features” but the actual code shows no significant algorithmic improvements and actually a regression in the digit sum calculation by reverting back to the type-casting approach. If anything, the codebase is becoming more bloated, such as adding a class for performing the difference:
又一次 write better code 後,Claude 返回了一個它聲稱「使用先進技術和現代 Python 特性,更加複雜和優化的版本」,但實際程式碼並沒有顯示出顯著的演算法改進,反而在數字和計算方面倒退回去了,又使用了型別轉換方法。實際情況是,程式碼庫變得更加臃腫,例如添加了一個類來執行差異計算:

@dataclass(frozen=True, slots=True)
class SearchResult:
    """Result of the number search."""
    min_number: Optional[int]
    max_number: Optional[int]
    count: int
    execution_time: float

    @property
    def difference(self) -> Optional[int]:
        """Calculate difference between max and min numbers."""
        if self.min_number is None or self.max_number is None:
            return None
        return self.max_number - self.min_number

This time, the code ran without needing any fixes. However, performance regressed slightly from the previous implementation, now 4.1x faster than the base implementation.
這次,程式碼無需任何修改即可運行。然而,效能比之前的版本略有下降,現在比基本實現快 4.1 倍。

Iteration #4  第四次迭代

This iterative prompting appears to be hitting diminishing returns. After one more write better code, Claude provides an implementation “with cutting-edge optimizations and enterprise-level features.” Wait, enterprise-level features?!
這種迭代提示似乎達到了收益遞減的階段。在又一次 write better code 之後,Claude 提供了一個「具有尖端優化和企業級功能」的實現。等等,企業級功能?!

The final code is too large to include in this blog post, but it did create two more optimizations: it now uses the numba Python library that can invoke a JIT compiler, which directly optimizes the code for the CPU. In this case, it can precompute the digit sums super quickly with just a decorator:
最終的程式碼過於龐大,無法包含在此部落格文章中,但它確實創造了另外兩個優化:它現在使用 numba Python 函式庫,該函式庫可以調用 JIT 編譯器,JIT 編譯器可以直接針對 CPU 優化程式碼。在這種情況下,它可以使用裝飾器快速預先計算數字和:

@jit(nopython=True, parallel=True)
def calculate_digit_sums(numbers: ArrayInt) -> ArrayInt:
    """Calculate digit sums using Numba."""
    result = np.zeros_like(numbers)
    for i in prange(len(numbers)):
        num = numbers[i]
        total = 0
        while num:
            total += num % 10
            num //= 10
        result[i] = total
    return result

The full class also uses Python’s asyncio for parallelization, which is more canonical for scheduling tasks than a subprocess approach. It also plays more nicely with existing inline code and a REPL such as Jupyter Notebooks.
整個類別還使用 Python 的 asyncio 進行平行化處理,這比子程序方法更適合排程任務。它還可以更好地與現有的內嵌程式碼和 REPL(例如 Jupyter Notebooks)配合使用。

It also added as a part of its “enterprise” push:
它還添加了作為其「企業級」推動的一部分:

  • Structured metrics logging with Prometheus.
    使用 Prometheus 進行結構化指標記錄。
  • A signal handler so the code can be torn down gracefully if force-killed.
    一個訊號處理程式,以便在程式碼被強制終止時可以正常關閉。
  • A benchmarking result display using a rich table.
    使用豐富表格呈現的基準測試結果。
It is pretty, though!

It is pretty, though!  不過,它蠻漂亮的!

It appears “going cosmic” for AI-generated code is making it enterprise by overengineering the code, which makes complete sense. Despite that, the code runs as-is without any bugs. Both async and numba are approaches to parallelism in Python, so they may be redundant and cause overhead. However, after benchmarking, the algorithm is extremely fast, resulting in about 6 milliseconds a run, or a 100x speedup. My assumption that this prompting was hitting diminishing returns aged very poorly. Maybe numba was the secret all along?
看來,人工智慧生成的程式碼「趨於宇宙化」,透過過度工程化使其企業級,這完全合理。儘管如此,程式碼仍然可以按原樣執行,沒有任何錯誤。Async 和 Numba 都是 Python 中實現平行化的途徑,因此它們可能是多餘的,並造成額外開銷。然而,經過基準測試,該演算法速度極快,每次執行約 6 毫秒,速度提升了 100 倍。我先前假設這種提示工程會遇到報酬遞減的狀況,現在看來我的假設大錯特錯。也許 Numba 一直都是秘密武器?

Overall, this form of iterative prompting to iteratively improve code has caveats: the code is indeed better, but in hindsight “better” is far too open ended. What I only wanted was algorithmic improvements, not a full SaaS. Let’s try again from scratch, this time with more direction.
總體而言,這種迭代提示以迭代改進程式碼的形式存在一些注意事項:程式碼確實更好,但事後看來,「更好」的定義太過空泛。我想要的只是演算法的改進,而不是一個完整的 SaaS。讓我們從頭再來一次,這次加入更多方向。

Prompt Engineering LLMs For Even More Better Code
提示工程 LLMs 讓程式碼更上一層樓

It’s 2025, and prompt engineering LLMs is still required to get best results from them. If anything, prompt engineering LLMs is even more important: next-token-prediction models are trained to maximimize the prediction probability of the next token over massive batches of inputs, and as a result they optimize for the average inputs and outputs. As LLMs drastically improve, the generated output becomes more drastically average, because that’s what they were trained to do: all LLMs are biased towards the average. Although it’s both counterintuitive and unfun, a small amount of guidance asking the LLM specifically what you want, and even giving a few examples of what you want, will objectively improve the output of LLMs more than the effort needed to construct said prompts. Claude 3.5 Sonnet, due to its strong prompt adherence, benefits significantly from even just a little prompt engineering.
時值 2025 年,提示工程 LLMs 仍然是從中獲得最佳結果的必要條件。甚至可以說,提示工程 LLMs 更為重要:下一個詞預測模型的訓練目標是在大量的輸入批次中最大化下一個詞的預測機率,因此它們針對的是平均輸入和輸出進行優化。隨著 LLMs 的顯著改進,生成的輸出變得更加趨於平均,因為這就是它們被訓練的目的:所有 LLMs 都偏向平均值。雖然這既違反直覺又無趣,但只要稍加引導,明確告知 LLM 你想要什麼,甚至提供一些你想要的範例,就能客觀地提升 LLMs 的輸出,其效果遠超於建構這些提示所需的努力。Claude 3.5 Sonnet 由於其嚴格遵守提示,即使只有一點提示工程也能顯著受益。

Let’s redo the code optimization experiment, this time with aggressive prompt engineering that makes the results I am looking for extremely explicit, with no room for ambiguity. Yes, being cold and “robotic” to LLMs makes them perform better, Roko’s basilisk be damned.
讓我們重做程式碼優化實驗,這次採用積極的提示工程,使我想要的結果非常明確,沒有任何歧義的空間。沒錯,對 LLMs 冷漠且「機械式」的互動能讓它們表現得更好,即使會招致羅科的蛇怪也在所不惜。

Initial Ask  初始請求

This time we will use a system prompt, only available via an API. The system prompt lists the LLM’s “rules” it must follow. Since I want more optimized code, we’ll define that in the rules, with granular examples:
這次我們將使用系統提示,這只能透過 API 使用。系統提示列出了 LLM 必須遵循的「規則」。由於我想要更優化的程式碼,我們將在規則中定義它,並提供詳細的範例:

All code you write MUST be fully optimized.

"Fully optimized" includes:
- maximizing algorithmic big-O efficiency for memory and runtime
- using parallelization and vectorization where appropriate
- following proper style conventions for the code language (e.g. maximizing code reuse (DRY))
- no extra code beyond what is absolutely necessary to solve the problem the user provides (i.e. no technical debt)

If the code is not fully optimized, you will be fined $100.

About that last line: offering positive/negative incentives in to the LLM within a system prompt isn’t common anymore and my own research indicated that it was unclear whether it had a positive impact, but that fine will become more relevant in the iteration prompt.
關於最後一行:在系統提示中向 LLM 提供正/負激勵措施的做法現在已經不常見了,我自己的研究也表明,它是否具有正面影響尚不清楚,但這一點在迭代提示中將變得更加相關。

The initial user prompt has an additional line as well:
最初的使用者提示還包含了額外的一行:

Write Python code to solve this problem:

Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.

Before writing the code, plan out all the necessary optimizations.

“Planning” is a long-used trick to help align LLM output for a first pass — the modern implementation of “let’s think step by step.” Some LLMs ignore it, but Claude 3.5 Sonnet always obeys.
「規劃」是一個長久以來用於幫助調整 LLM 首次輸出結果的技巧——相當於「讓我們逐步思考」的現代版本。有些 LLMs 會忽略它,但 Claude 3.5 Sonnet 總是會遵循。

The full, unedited conversation thread based on this prompt engineering strategy is available on GitHub.
這個基於提示工程策略的完整、未經編輯的對話串列可在 GitHub 上取得。

Sending the system prompt and this user prompt to Claude, the response starts with this:
將系統提示和這個使用者提示發送給 Claude 後,回應以這樣開頭:

Let me plan the optimizations first:

1. Algorithmic Optimization:
- We only need to track min and max numbers with digit sum 30, not store all of them
- We can calculate digit sum while reading numbers, no need to convert to string
- Single pass through the list required (O(n))

2. Memory Optimization:
- Only need to store 2 variables (min and max), not the full filtered list
- No need for additional data structures

3. Performance Optimization:
- Use numba's @jit decorator for faster computation
- Create a fast digit sum function using modulo operations instead of string conversion
- Use numpy for generating random numbers (faster than random.randint)

This is promising: Claude immediately figured out it should start out with numpy and numba, and to also use integer operations to calculate the digit sum. The initial implementation:
很有希望:Claude 立即發現它應該從 numpy 和 numba 開始,並且使用整數運算來計算數字總和。初步的實作:

@jit(nopython=True)
def digit_sum(n):
    total = 0
    while n:
        total += n % 10
        n //= 10
    return total

@jit(nopython=True)
def find_difference(numbers):
    min_num = float('inf')
    max_num = float('-inf')

    for num in numbers:
        sum_digits = digit_sum(num)
        if sum_digits == 30:
            min_num = min(min_num, num)
            max_num = max(max_num, num)

    return max_num - min_num if max_num != float('-inf') else 0

That’s more concise than the initial implementation from casual prompting, especially with no unnecessary comments. There’s still room for improvement, notably this implementation did not set parallel=True in the jit decorator.
這比隨意提示的初始實作更簡潔,尤其沒有不必要的註釋。但仍有改進的空間,尤其是這個實作沒有在 jit 裝飾器中設定 parallel=True

This minor use of prompt engineering resulted in code that runs in 11.2 milliseconds on average, or 59x faster than the naive implementation. Wow!
這次小小的提示工程運用,產生了平均執行時間為 11.2 毫秒的程式碼,比初始的實作快了 59 倍。真是太棒了!

Iteration #1  第一次迭代

We can now ask Claude to iterate on the code with a more verbose prompt than “write code better”:
我們現在可以用比「寫出更好的程式碼」更詳盡的提示來要求 Claude 迭代程式碼:

Your code is not fully optimized, and you have been fined $100. Make it more optimized.

No, it’s not gaslighting, it’s incentive alignment.
不,這不是在 PUA 它,而是誘因對齊。

Claude responds with this:
Claude 的回覆如下:

@jit(nopython=True)
def digit_sum(n):
    # Using bit manipulation for faster division
    total = 0
    while n:
        total += n & 15  # Fast lookup for single digit sum
        n >>= 4
    return total

@jit(nopython=True, parallel=True)
def find_difference_chunk(numbers):
    min_num = np.iinfo(np.int64).max
    max_num = np.iinfo(np.int64).min

    # Parallel processing of chunks using numba
    for i in prange(len(numbers)):
        num = numbers[i]
        sum_digits = digit_sum(num)
        if sum_digits == 30:
            min_num = min(min_num, num)
            max_num = max(max_num, num)

    return min_num, max_num

def process_chunk(chunk):
    return find_difference_chunk(chunk)

Good, it identified the parallel=True flag! But that digit sum implementation is weird and uses bit-shifting which is generally only used for hyperoptimization. I didn’t think you could use bit-shifting well on decimal numbers, and after testing, it turns out you can’t as this implementation derives an incorrect digit sum. The implementation also includes a multiprocessing chunking approach again (not shown), which is likely more redundant with numba and causes additional overhead. Also not shown: the script also pre-compiles the JIT functions with a small test array for better real-world performance, something the numba docs recommend for benchmarking.
很好,它識別出了 parallel=True 標記!但那個計算數字總和的實作方式很奇怪,使用了位元移位,這通常只用於極致優化。我不認為位元移位能有效地用於十進位數,經過測試後,事實證明它不能,因為這種實作方式會得出錯誤的數字總和。該實作還包含了多行程分塊方法(未顯示),這在使用 Numba 時可能更加多餘,並造成額外的開銷。同樣未顯示的是:該腳本還使用一個小型測試陣列預編譯了 JIT 函數,以提高實際性能,這是 Numba 文件中建議的基準測試方法。

Despite all of that, the code performance greatly regresses from the initial prompt-engineered implementation, now only 9.1x faster than the naive implementation. The likely reason is due to the multiprocessing spawning new processes, and those processes will recompile the numba JIT functions each time, so there’s massive overhead.
儘管如此,程式碼性能相較於最初提示工程產生的實作大幅下降,現在僅比原始實作快 9.1 倍。可能的原因是由於多行程產生了新的行程,而這些行程每次都會重新編譯 Numba 的 JIT 函數,因此產生了巨大的開銷。

Iteration #2  第二次迭代

Another iteration of the prompt:
另一版本的提示:

# Pre-computed lookup table using bit manipulation
LOOKUP = np.zeros(100001, dtype=np.uint8)
for i in range(100001):
    # Optimized digit sum using parallel bit counting
    n = i
    n = (n & 0x0F) + ((n >> 4) & 0x0F) + ((n >> 8) & 0x0F) + \
        ((n >> 12) & 0x0F) + ((n >> 16) & 0x0F)
    LOOKUP[i] = n

@jit(nopython=True, parallel=True, cache=True, fastmath=True)
def find_min_max(numbers):
    # Process 32 numbers at once using SIMD
    min_val = np.iinfo(np.uint32).max
    max_val = 0

    # Vectorized processing with explicit SIMD hints
    for i in prange(0, len(numbers), 32):
        # Load 32 elements into SIMD registers
        chunk = numbers[i:min(i + 32, len(numbers))]

        # Vectorized lookup and comparison
        sums = LOOKUP[chunk]
        mask = sums == 30

        if np.any(mask):
            valid_nums = chunk[mask]
            min_val = min(min_val, np.min(valid_nums))
            max_val = max(max_val, np.max(valid_nums))

    return min_val, max_val

Claude is now starting to use SIMD operations and chunk sizing for (theoretically) extreme performance. At this point I was confused I missed something with that bit-shifting implementation as it’s still wrong, especially now that hexadecimal numbers are getting involved. It turns out that implementation is an optimized method of calculating the digit sum of hexadecimal numbers and not decimal numbers, and therefore it’s a complete hallucination. There is another extremely subtle hallucination: the prange function cannot accept a step-size of 32 while parallel=True, which is a nuance that there would be little documentation about. Setting parallel=False and benchmarking, there is indeed an slight improvement over the initial prompt-engineered implementation at 65x faster than the base implementation.
Claude 現在開始使用 SIMD 運算和區塊大小調整來(理論上)獲得極致性能。此時,我很困惑我是不是遺漏了位元移位實作的某些東西,因為它仍然是錯的,尤其是在現在涉及十六進位數的情況下。事實證明,該實作是一種計算十六進位數(而非十進位數)數字總和的優化方法,因此它完全是憑空想像出來的。還有一個非常細微的錯誤: prange 函數不能接受 32 的步長,而 parallel=True 可以,這是一個幾乎沒有文件記載的細微差別。設定 parallel=False 並進行基準測試後,確實比最初提示工程產生的實作略有改進,比基礎實作快了 65 倍。

Iteration #3  第三次迭代

Another iteration:  另一版本:

# Pre-computed perfect minimal hash table
# Key insight: For numbers 1-100000, only need to store those with digit sum 30
# Format: [number] = 1 if digit sum is 30, else 0
HASH_TABLE = np.zeros(100001, dtype=np.uint8)

# Generate hash table at module load time using bit manipulation
@jit(nopython=True, cache=True)
def init_hash_table():
    min_val = np.iinfo(np.uint32).max
    max_val = 0

    # Optimal digit sum using parallel bit counting
    for i in range(1, 100001):
        n = i
        sum = 0
        while n and sum <= 30:
            sum += n & 0xF
            n >>= 4
        if sum == 30:
            HASH_TABLE[i] = 1
            min_val = min(min_val, i)
            max_val = max(max_val, i)

    return min_val, max_val

# Initialize at module load time
MIN_VALID, MAX_VALID = init_hash_table()

@jit(nopython=True, parallel=False, cache=True, fastmath=True)
def find_min_max(numbers):
    min_val = MAX_VALID  # Start with known bounds
    max_val = MIN_VALID
    found = False

    # Single vectorized operation
    mask = HASH_TABLE[numbers] == 1
    if np.any(mask):
        valid_nums = numbers[mask]
        min_val = np.min(valid_nums)
        max_val = np.max(valid_nums)
        found = True

    return min_val, max_val, found

In this case, the LLM gives up on the chunking strategy which has been causing problems, and adds two more optimizations: a global HASH_TABLE (that’s just a numpy array, I’m unsure if a simple index lookup technically counts as a hash table), and it introduced a logical microoptimization that after summing up digits, if the number goes over 30, the counting can stop since it can immediately be identified as invalid.
在這種情況下,LLM 放棄了導致問題的分塊策略,並增加了兩個優化:一個全域的 HASH_TABLE (它只是一個 NumPy 陣列,我不確定簡單的索引查找在技術上是否算作雜湊表),以及引入了一個邏輯上的微優化,即在對數字求和後,如果數字超過 30,則計數可以停止,因為它可以直接被識別為無效。

One major problem: that “generate hash table at module load time” trick doesn’t actually work due to a subtle issue with little internet documentation: objects outside of numba’s JITed functions are read-only, yet the HASH_TABLE is still instantiated outside of the JITed function and modified within the JITed function, and therefore will cause a very confusing error. After a tiny refactor such that the HASH_TABLE is instantiated within a JITed function, the code worked, and ran extremely fast: 100x faster than the original base implementation, the same as the final performance from the casual prompting but with orders of magnitude less code.
一個主要問題是:「在模組載入時生成雜湊表」的技巧實際上由於一個細微的問題而無法運作,網路上幾乎沒有相關文件:Numba 的 JIT 函數外部的物件是唯讀的,但 HASH_TABLE 仍然在 JIT 函數外部實例化,並在 JIT 函數內部修改,因此會導致一個非常令人困惑的錯誤。經過一個小小的重構,使得 HASH_TABLE 在 JIT 函數內部實例化後,程式碼就能正常運作,並且運行速度非常快:比原始基礎實作快 100 倍,與隨意提示的最終性能相同,但程式碼量卻少了幾個數量級。

Iteration #4  第四次迭代

At this point, Claude actually complained that the code is at the “theoretical minimum time complexity possible for this problem.” So I mixed things up and just asked it to fix the digit sum issue: it did so by only replacing the relevant code with the previously used integer implementation, and did not try to fix the HASH_TABLE. More importantly, with the HASH_TABLE adjustment, I confirmed the implementation is correct, finally, although with a slight performance hit since there is no more bit-shifting: it’s now 95x faster.
此時,Claude 實際上抱怨說程式碼已達到「此問題理論上可行的最小時間複雜度」。 因此,我稍微調整了一下問題,只要求它修復數字總和問題:它透過將相關程式碼替換為先前使用的整數實作來做到這一點,並且沒有嘗試修復 HASH_TABLE 。 更重要的是,透過 HASH_TABLE 的調整,我終於確認了實作的正確性,儘管由於不再有位元移位操作,效能略有下降:現在速度提高了 95 倍。

Next Steps For Better LLM Code Generation
生成更優 LLM 程式碼的後續步驟

Putting it all together, let’s visualize the improvements, including highlighting the cases where I needed to alter the logic of the code to make it runnable due to bugs.
總結一下,讓我們將改進過程視覺化,包括突出顯示我需要更改程式碼邏輯以使其因錯誤而可執行的案例。

In all, asking an LLM to “write code better” does indeed make the code better, depending on your definition of better. Through the use of the generic iterative prompts, the code did objectively improve from the base examples, both in terms of additional features and speed. Prompt engineering improved the performance of the code much more rapidly and consistently, but was more likely to introduce subtle bugs as LLMs are not optimized to generate high-performance code. As with any use of LLMs, your mileage may vary, and in the end it requires a human touch to fix the inevitable issues no matter how often AI hypesters cite LLMs as magic.
總而言之,要求 LLM「寫出更好的程式碼」確實可以使程式碼更好,這取決於你對「更好」的定義。透過使用通用的迭代提示,程式碼從基本範例中客觀地得到了改進,無論是在附加功能方面還是速度方面。提示工程更快、更一致地提高了程式碼的效能,但更容易引入細微的錯誤,因為 LLMs 並未針對生成高效能程式碼進行最佳化。與任何使用 LLMs 的情況一樣,你的結果可能會有所不同,最終需要人工介入來修復不可避免的問題,無論 AI 炒作者如何將 LLMs 譽為神奇。

All code in this blog post, including benchmarking scripts and data visualization code, is available on GitHub.
此部落格文章中的所有程式碼,包括基準測試指令碼和資料視覺化程式碼,都可以在 GitHub 上找到。

There are a few optimizations that I am very surprised Claude 3.5 Sonnet did not identify and implement during either experiment. Namely, it doesn’t explore the statistical angle: since we are generating 1,000,000 numbers uniformly from a range of 1 to 100,000, there will be a significant amount of duplicate numbers that will never need to be analyzed. The LLM did not attempt to dedupe, such as casting the list of numbers into a Python set() or using numpy’s unique(). I was also expecting an implementation that involves sorting the list of 1,000,000 numbers ascending: that way the algorithm could search the list from the start to the end for the minimum (or the end to the start for the maximum) without checking every number, although sorting is slow and a vectorized approach is indeed more pragmatic.
有一些優化措施,我非常驚訝 Claude 3.5 Sonnet 在兩個實驗中都沒有識別和實施。 也就是說,它沒有探討統計角度:由於我們從 1 到 100,000 的範圍內均勻地生成 1,000,000 個數字,因此將會有大量的重複數字永遠不需要分析。 LLM 並沒有嘗試去重複,例如將數字列表轉換為 Python set() 或使用 numpy 的 unique() 。 我也期待一個涉及將 1,000,000 個數字的列表升序排序的實作:這樣演算法就可以從頭到尾搜尋列表以找到最小值(或從尾到頭搜尋最大值),而無需檢查每個數字,儘管排序速度很慢,向量化方法確實更為實用。

Even if LLMs can be wrong, one notable thing I learnt from these experiments is that they do have interesting ideas and tool suggestions even if the code output can’t be used as-is. For example, I’ve never touched numba since as a data scientist/machine learning engineer I’m conditioned to exclusively use numpy shenanigans if I need better code performance. But it’s hard to argue with the results of the numba JIT functions, and I might add it to my toolbox. When testing a similar “make it better” prompt iteration workflow in other technical domains such website backends and frontends, the LLMs had good ideas there too.
即使 LLMs 可能出錯,我從這些實驗中學到的一個值得注意的事情是,即使程式碼輸出不能按原樣使用,它們也確實有有趣的點子和工具建議。例如,我從未接觸過 numba,因為作為資料科學家/機器學習工程師,如果我需要更好的程式碼效能,我習慣於只使用 numpy 的技巧。 但是 numba JIT 函式的結果很難反駁,我可能會將它新增到我的工具箱中。在其他技術領域(例如網站後端和前端)測試類似的「使其更好」的提示迭代工作流程時,LLMs 也提供了很好的想法。

Of course, these LLMs won’t replace software engineers anytime soon, because it requires a strong engineering background to recognize what is actually a good idea, along with other constraints that are domain specific. Even with the amount of code available on the internet, LLMs can’t discern between average code and good, highly-performant code without guidance. Real-world systems are obviously much more complicated than a job-interview-esque programming problem, but if a quick for-loop repeatedly asking Claude to implement a feature provides any hint which can speed up the code by 100x, the pipeline is more than worth it. Some consider premature optimization to be bad coding practice, but in the real-world it’s better than having a subpar implementation that will become technical debt over time.
當然,這些 LLMs 不會很快取代軟體工程師,因為需要深厚的工程背景才能識別什麼才是真正的好點子,以及其他特定領域的限制。即使網路上有大量的程式碼,LLMs 也無法在沒有指導的情況下區分普通程式碼和優秀的高效能程式碼。現實世界的系統顯然比面試式的程式設計問題要複雜得多,但如果一個快速迴圈重複要求 Claude 實作一個功能可以提供任何提示,將程式碼速度提高 100 倍,那麼這個流程就非常值得了。有些人認為過早優化是不好的程式設計習慣,但在現實世界中,它比採用一個隨著時間推移會變成技術債的次優實作要好。

One issue with my experiments is that I’m benchmarking code improvement using Python, which isn’t the coding language developers consider when hyperoptimizing performance. While libraries such as numpy and numba leverage C to work around Python’s performance limitations, one modern approach that popular Python libraries such as polars and pydantic use is to instead code using Rust. Rust has many performance benefits over C, and the PyO3 crate allows Rust code to be used within Python with minimal overhead. I can confirm that Claude 3.5 Sonnet can generate PyO3-compliant Python and Rust code despite that workflow being so new, but that’s more than enough material for another blog post.
我的實驗有一個問題,那就是我使用 Python 對程式碼改進進行基準測試,而 Python 並不是開發人員在超級優化效能時考慮的程式語言。雖然 numpy 和 numba 等函式庫利用 C 語言來規避 Python 的效能限制,但 polars 和 pydantic 等流行 Python 函式庫使用的一種現代方法是使用 Rust 進行程式設計。Rust 比 C 語言有很多效能優勢,而且 PyO3 crate 允許 Rust 程式碼在 Python 中使用,開銷最小。我可以確認 Claude 3.5 Sonnet 可以生成符合 PyO3 標準的 Python 和 Rust 程式碼,儘管該工作流程還很新,但這足以成為另一篇部落格文章的內容。

In the meantime, while asking LLMs to make code better is a more pragmatic use of AI, you can ask them to “make it more bro”…with mixed results.
與此同時,雖然要求 LLMs 改善程式碼是更務實的 AI 用法,但你也可以要求它們「讓它更有兄弟味」……結果好壞參半。


  1. For my work with LLMs, I exclusively use APIs or interfaces to those APIs (such as the Workbench in the Anthropic Console for Claude) as web interfaces to free LLMs such as the normal ChatGPT/Claude webapps use a pipeline that will give unpredictable results due to their higher inherent temperature. Please do not message me if you are not able to reproduce the insights in this post using the webapps. ↩︎
    在我使用 LLMs 的過程中,我僅使用 API 或 API 的介面(例如 Anthropic Console for Claude 中的 Workbench),因為免費 LLMs 的網頁介面(例如常見的 ChatGPT/Claude 網頁應用程式)使用會產生難以預測結果的流程,這是由於它們本身較高的 temperature 。如果您無法使用網頁應用程式重現本文中的見解,請不要與我聯繫。↩︎