Lately, I’ve been working on codifying a personal ethics statement about my stances on generative AI as I have been very critical about several aspects of modern GenAI, and yet I participate in it. While working on that statement, I’ve been introspecting on how I myself have been utilizing large language models for both my professional work as a Senior Data Scientist at BuzzFeed and for my personal work blogging and writing open-source software. For about a decade, I’ve been researching and developing tooling around text generation from char-rnns, to the ability to fine-tune GPT-2, to experiments with GPT-3, and even more experiments with ChatGPT and other LLM APIs. Although I don’t claim to the best user of modern LLMs out there, I’ve had plenty of experience working against the cons of next-token predictor models and have become very good at finding the pros.
最近,我一直在努力将个人对生成式 AI 的立场编撰成一份道德声明,因为尽管我对现代生成式 AI 的某些方面持强烈批评态度,却仍参与其中。在撰写这份声明时,我深入反思了自己如何利用大型语言模型(LLMs)——无论是作为 BuzzFeed 高级数据科学家的职业工作,还是个人博客写作及开源软件开发。近十年来,从字符级 RNN 到微调 GPT-2 的能力,再到 GPT-3 的实验,乃至 ChatGPT 及其他 LLM API 的更多尝试,我始终致力于文本生成工具的研究与开发。虽然我不敢自称是现代 LLMs 的最佳使用者,但在对抗"下一个词预测模型"缺陷的过程中积累了丰富经验,并擅长发掘其优势。
It turns out, to my surprise, that I don’t use them nearly as often as people think engineers do, but that doesn’t mean LLMs are useless for me. It’s a discussion that requires case-by-case nuance.
出乎我意料的是,我使用它们的频率远低于人们想象中工程师应有的水平,但这并不意味着 LLMs 对我毫无用处。这是一个需要具体情况具体分析的讨论。
How I Interface With LLMs
我如何与 LLMs 交互
Over the years I’ve utilized all the tricks to get the best results out of LLMs. The most famous trick is prompt engineering, or the art of phrasing the prompt in a specific manner to coach the model to generate a specific constrained output. Additions to prompts such as offering financial incentives to the LLM or simply telling the LLM to make their output better do indeed have a quantifiable positive impact on both improving adherence to the original prompt and the output text quality. Whenever my coworkers ask me why their LLM output is not what they expected, I suggest that they apply more prompt engineering and it almost always fixes their issues.
多年来,我运用了各种技巧从 LLMs 中获取最佳结果。最著名的技巧莫过于提示工程,即通过特定方式措辞提示,引导模型生成特定约束下的输出。在提示中添加诸如向 LLM 提供经济激励或直接要求 LLM 优化其输出等内容,确实能在提高对原始提示的遵循度和输出文本质量上产生可量化的积极影响。每当同事问我为何他们的 LLM 输出不如预期时,我建议他们加强提示工程,这几乎总能解决问题。
No one in the AI field is happy about prompt engineering, especially myself. Attempts to remove the need for prompt engineering with more robust RLHF paradigms have only made it even more rewarding by allowing LLM developers to make use of better prompt adherence. True, “Prompt Engineer” as a job title turned out to be a meme but that’s mostly because prompt engineering is now an expected skill for anyone seriously using LLMs. Prompt engineering works, and part of being a professional is using what works even if it’s silly.
在 AI 领域,没有人对提示工程感到满意,尤其是我自己。试图通过更强大的 RLHF 范式消除对提示工程的需求,反而因其能让 LLM 开发者利用更好的提示遵循性而使其回报更高。诚然,“提示工程师”这一职位头衔最终成了一个梗,但这主要是因为提示工程如今已成为任何认真使用 LLMs 的人必备的技能。提示工程确实有效,而作为专业人士的一部分,就是使用行之有效的方法,即便它看起来有些可笑。
To that end, I never use ChatGPT.com or other normal-person frontends for accessing LLMs because they are harder to control. Instead, I typically access the backend UIs provided by each LLM service, which serve as a light wrapper over the API functionality which also makes it easy to port to code if necessary. Accessing LLM APIs like the ChatGPT API directly allow you to set system prompts which control the “rules” for the generation that can be very nuanced. Specifying specific constraints for the generated text such as “keep it to no more than 30 words” or “never use the word ‘delve’” tends to be more effective in the system prompt than putting them in the user prompt as you would with ChatGPT.com. Any modern LLM interface that does not let you explicitly set a system prompt is most likely using their own system prompt which you can’t control: for example, when ChatGPT.com had an issue where it was too sycophantic to its users, OpenAI changed the system prompt to command ChatGPT to “avoid ungrounded or sycophantic flattery.” I tend to use Anthropic Claude’s API — Claude Sonnet in particular — more than any ChatGPT variant because Claude anecdotally is less “robotic” and also handles coding questions much more accurately.
为此,我从不使用 ChatGPT.com 或其他面向普通用户的前端来访问 LLMs,因为它们更难控制。相反,我通常使用各 LLM 服务提供的后端 UI,它们作为 API 功能的轻量级封装,同时也便于在必要时移植到代码中。直接访问像 ChatGPT API 这样的 LLM API,允许你设置系统提示,这些提示控制着生成内容的“规则”,可以非常细致入微。 在系统提示中指定生成文本的具体约束条件,如“不超过 30 字”或“避免使用‘delve’一词”,通常比在用户提示中设置更有效(类似 ChatGPT.com 的操作方式)。任何不允许显式设置系统提示的现代 LLM 界面,很可能使用了无法自定义的系统提示:例如,当 ChatGPT.com 出现对用户过度奉承的问题时,OpenAI 通过修改系统提示要求其“避免无根据或谄媚的恭维”。我个人更常使用 Anthropic Claude 的 API(特别是 Claude Sonnet),因为据反馈 Claude 显得不那么“机械”,且在编程问题上的回答准确度更高。
Additionally with the APIs, you can control the “temperature” of the generation, which at a high level controls the creativity of the generation. LLMs by default do not select the next token with the highest probability in order to allow it to give different outputs for each generation, so I prefer to set the temperature to 0.0
so that the output is mostly deterministic, or 0.2 - 0.3
if some light variance is required. Modern LLMs now use a default temperature of 1.0
, and I theorize that higher value is accentuating LLM hallucination issues where the text outputs are internally consistent but factually wrong.
此外,通过 API,你可以控制生成的“温度”,这从高层次上决定了生成内容的创造性。默认情况下,LLMs 不会选择概率最高的下一个令牌,以便每次生成都能产生不同的输出。因此,我倾向于将温度设置为 0.0
以使输出基本确定,或在需要轻微变化时设为 0.2 - 0.3
。现代 LLMs 现在默认使用 1.0
的温度值,我推测较高的值会加剧 LLM 的幻觉问题,即文本输出内部一致但事实错误。
LLMs for Professional Problem Solving!
LLMs 助力专业问题解决!
With that pretext, I can now talk about how I have used generative LLMs over the past couple years at BuzzFeed. Here are outlines of some (out of many) projects I’ve worked on using LLMs to successfully solve problems quickly:
基于上述背景,现在我可以谈谈过去几年在 BuzzFeed 是如何使用生成式 LLMs 的。以下列举了一些(众多中的)我利用 LLMs 快速成功解决问题的项目概要:
- BuzzFeed site curators developed a new hierarchal taxonomy to organize thousands of articles into a specified category and subcategory. Since we had no existing labeled articles to train a traditional multiclass classification model to predict these new labels, I wrote a script to hit the Claude Sonnet API with a system prompt saying
The following is a taxonomy: return the category and subcategory that best matches the article the user provides.
plus the JSON-formatted hierarchical taxonomy, then I provided the article metadata as the user prompt, all with a temperature of0.0
for the most precise results. Running this in a loop for all the articles resulted in appropriate labels.
BuzzFeed 网站的内容策划团队开发了一套新的层级分类法,用于将数千篇文章归类到特定的类别和子类别中。由于我们缺乏已标记的文章来训练传统的多类别分类模型以预测这些新标签,我编写了一个脚本调用 Claude Sonnet API,系统提示设置为The following is a taxonomy: return the category and subcategory that best matches the article the user provides.
加上 JSON 格式的层级分类法,然后将文章元数据作为用户提示输入,所有操作均以温度为0.0
运行以获得最精确的结果。通过循环处理所有文章,最终得到了恰当的标签。 - After identifying hundreds of distinct semantic clusters of BuzzFeed articles using data science shenanigans, it became clear that there wasn’t an easy way to give each one unique labels. I wrote another script to hit the Claude Sonnet API with a system prompt saying
Return a JSON-formatted title and description that applies to all the articles the user provides.
with the user prompt containing five articles from that cluster: again, running the script in a loop for all clusters provided excellent results.
在运用数据科学技巧识别出 BuzzFeed 文章数百个不同的语义聚类后,我们发现难以简单地为每个聚类赋予独特的标签。为此,我又编写了一个脚本调用 Claude Sonnet API,系统提示设置为Return a JSON-formatted title and description that applies to all the articles the user provides.
,用户提示则包含来自该聚类的五篇文章:同样地,通过循环运行脚本处理所有聚类,获得了极佳的效果。 - One BuzzFeed writer asked if there was a way to use a LLM to sanity-check grammar questions such as “should I use an em dash here?” against the BuzzFeed style guide. Once again I hit the Claude Sonnet API, this time copy/pasting the full style guide in the system prompt plus a command to
Reference the provided style guide to answer the user's question, and cite the exact rules used to answer the question.
In testing, the citations were accurate and present in the source input, and the reasonings were consistent.
一位 BuzzFeed 撰稿人询问是否有办法使用 LLM 来对照 BuzzFeed 风格指南检查语法问题,比如“这里该用破折号吗?”。我再次调用了 Claude Sonnet API,这次将完整的风格指南复制粘贴到系统提示中,并附加了一条指令Reference the provided style guide to answer the user's question, and cite the exact rules used to answer the question.
。测试中,引证准确且存在于源输入中,推理过程也保持一致。
Each of these projects were off-hand ideas pitched in a morning standup or a Slack DM, and yet each project only took an hour or two to complete a proof of concept (including testing) and hand off to the relevant stakeholders for evaluation. For projects such as the hierarchal labeling, without LLMs I would have needed to do more sophisticated R&D and likely would have taken days including building training datasets through manual labeling, which is not intellectually gratifying. Here, LLMs did indeed follow the Pareto principle and got me 80% of the way to a working solution, but the remaining 20% of the work iterating, testing and gathering feedback took longer. Even after the model outputs became more reliable, LLM hallucination was still a concern which is why I also advocate to my coworkers to use caution and double-check with a human if the LLM output is peculiar.
这些项目最初都只是晨会或 Slack 私聊中随口提出的想法,但每个项目仅用一两个小时就完成了概念验证(包括测试)并移交给相关利益方评估。以层级标签项目为例,若没有 LLMs,我需要进行更复杂的研发工作,可能耗时数日——包括通过人工标注构建训练数据集,这种工作毫无智力成就感。而 LLMs 确实遵循了帕累托原则,帮我完成了解决方案 80%的工作,但剩余的 20%迭代、测试和收集反馈环节耗时更久。即便模型输出逐渐可靠,LLM 幻觉仍是隐患,因此我也建议同事对异常输出保持警惕,必须人工复核。
There’s also one use case of LLMs that doesn’t involve text generation that’s as useful in my professional work: text embeddings. Modern text embedding models technically are LLMs, except instead of having a head which outputs the logits for the next token, it outputs a vector of numbers that uniquely identify the input text in a higher-dimensional space. All improvements to LLMs that the ChatGPT revolution inspired, such as longer context windows and better quality training regimens, also apply to these text embedding models and caused them to improve drastically over time with models such as nomic-embed-text and gte-modernbert-base. Text embeddings have done a lot at BuzzFeed from identifying similar articles to building recommendation models, but this blog post is about generative LLMs so I’ll save those use cases for another time.
LLMs 还有一个不涉及文本生成的用途,在我的专业工作中同样实用:文本嵌入。现代文本嵌入模型在技术上属于 LLMs,只不过它的输出不是预测下一个词元的对数几率,而是一个在高维空间中唯一标识输入文本的数字向量。ChatGPT 革命带来的所有 LLMs 改进,如更长的上下文窗口和更优质的训练方案,同样适用于这些文本嵌入模型,并促使它们随时间显著提升,例如 nomic-embed-text 和 gte-modernbert-base 等模型。在 BuzzFeed,文本嵌入从识别相似文章到构建推荐模型都发挥了巨大作用,但本篇博文主要讨论生成式 LLMs,这些应用场景就留待下次再谈。
LLMs for Writing? LLMs 用于写作?
No, I don’t use LLMs for writing the text on this very blog, which I suspect has now become a default assumption for people reading an article written by an experienced LLM user. My blog is far too weird for an LLM to properly emulate. My writing style is blunt, irreverent, and occasionally cringe: even with prompt engineering plus few-shot prompting by giving it examples of my existing blog posts and telling the model to follow the same literary style precisely, LLMs output something closer to Marvel movie dialogue. But even if LLMs could write articles in my voice I still wouldn’t use them due of the ethics of misrepresenting authorship by having the majority of the work not be my own words. Additionally, I tend to write about very recent events in the tech/coding world that would not be strongly represented in the training data of a LLM if at all, which increases the likelihood of hallucination.
不,我并没有用 LLMs 来撰写这个博客上的文字,尽管我怀疑对于阅读一位资深 LLM 用户所写文章的人来说,这已成为默认假设。我的博客风格太过怪异,LLM 难以准确模仿。我的文风直率、不羁,偶尔令人尴尬:即便通过提示工程加上少量示例提示——给它看我已有的博文并严格要求模型精确遵循同样的文学风格,LLMs 输出的内容也更接近漫威电影对白。但即便 LLMs 能以我的口吻写文章,出于伦理考虑——即大部分作品并非出自本人之手的署名问题——我仍不会使用它们。此外,我常撰写科技/编程领域的最新动态,这些内容在 LLM 的训练数据中要么鲜有体现,要么完全缺失,从而增加了幻觉生成的可能性。
There is one silly technique I discovered to allow a LLM to improve my writing without having it do my writing: feed it the text of my mostly-complete blog post, and ask the LLM to pretend to be a cynical Hacker News commenter and write five distinct comments based on the blog post. This not only identifies weaker arguments for potential criticism, but it also doesn’t tell me what I should write in the post to preemptively address that negative feedback so I have to solve it organically. When running a rough draft of this very blog post and the Hacker News system prompt through the Claude API (chat log), it noted that my examples of LLM use at BuzzFeed are too simple and not anything more innovative than traditional natural language processing techniques, so I made edits elaborating how NLP would not be as efficient or effective.
我发现了一个巧妙的方法,能让 LLM 在不代笔的情况下提升我的写作水平:将基本完成的博文内容输入 LLM,要求它扮演一个尖刻的 Hacker News 评论者,基于博文撰写五条截然不同的评论。这种方法不仅能暴露出可能招致批评的薄弱论点,还不会直接告诉我该如何修改内容来预先应对负面反馈,迫使我自主寻找解决方案。当我将这篇博文的草稿和 Hacker News 系统提示通过 Claude API 运行时(聊天记录),它指出我在 BuzzFeed 提到的 LLM 应用案例过于简单,与传统自然语言处理技术相比缺乏创新性,于是我修改内容,详细说明了自然语言处理技术在效率和效果上的不足。
LLMs for Companionship? LLMs 用于陪伴?
No, I don’t use LLMs as friendly chatbots either. The runaway success of LLM personal companion startups such as character.ai and Replika are alone enough evidence that LLMs have a use, even if the use is just entertainment/therapy and not more utilitarian.
不,我也不把 LLMs 当作友好的聊天机器人。像 character.ai 和 Replika 这样的 LLM 个人伴侣初创公司的巨大成功,本身就足以证明 LLMs 有其用途,即便这些用途仅限于娱乐/治疗而非更实用的功能。
I admit that I am an outlier since treating LLMs as a friend is the most common use case. Myself being an introvert aside, it’s hard to be friends with an entity who is trained to be as friendly as possible but also habitually lies due to hallucination. I could prompt engineer an LLM to call me out on my bullshit instead of just giving me positive affirmations, but there’s no fix for the lying.
我承认自己是个异类,因为将 LLMs 视为朋友是最常见的用例。撇开我性格内向不谈,很难与一个被训练得尽可能友好、却又因幻觉而习惯性撒谎的实体成为朋友。我可以通过提示工程让某个 LLM 揭穿我的胡言乱语,而非仅仅给予积极肯定,但对于撒谎问题却无计可施。
LLMs for Coding??? LLMs 用于编程???
Yes, I use LLMs for coding, but only when I am reasonably confident that they’ll increase my productivity. Ever since the dawn of the original ChatGPT, I’ve asked LLMs to help me write regular expressions since that alone saves me hours, embarrassing to admit. However, the role of LLMs in coding has expanded far beyond that nowadays, and coding is even more nuanced and more controversial on how you can best utilize LLM assistance.
是的,我在编码时会使用 LLMs,但仅在我确信它们能提高我的生产力时。自最初的 ChatGPT 问世以来,我就开始让 LLMs 帮我编写正则表达式,单这一项就节省了我数小时的时间,说来有些惭愧。然而,如今 LLMs 在编码中的作用已远不止于此,关于如何最有效地利用 LLM 辅助,编码变得更加微妙且更具争议性。
Like most coders, I Googled coding questions and clicked on the first Stack Overflow result that seemed relevant, until I decided to start asking Claude Sonnet the same coding questions and getting much more detailed and bespoke results. This was more pronounced for questions which required specific functional constraints and software frameworks, the combinations of which would likely not be present in a Stack Overflow answer. One paraphrased example I recently asked Claude Sonnet while writing another blog post is Write Python code using the Pillow library to composite five images into a single image: the left half consists of one image, the right half consists of the remaining four images.
(chat log). Compositing multiple images with Pillow isn’t too difficult and there’s enough questions/solutions about it on Stack Overflow, but the specific way it’s composited is unique and requires some positioning shenanigans that I would likely mess up on the first try. But Claude Sonnet’s code got it mostly correct and it was easy to test, which saved me time doing unfun debugging.
和大多数程序员一样,我曾通过谷歌搜索编码问题并点击第一个看似相关的 Stack Overflow 结果,直到我决定开始向 Claude Sonnet 提出相同的编码问题,并获得了更详细且量身定制的答案。对于需要特定功能约束和软件框架组合的问题,这种优势更为明显——这类组合在 Stack Overflow 的答案中通常难以找到。最近我在撰写另一篇博客文章时向 Claude Sonnet 提出的一个转述示例是 Write Python code using the Pillow library to composite five images into a single image: the left half consists of one image, the right half consists of the remaining four images.
(聊天记录)。使用 Pillow 合成多张图像并不算太难,Stack Overflow 上也有足够多相关问题/解决方案,但具体的合成方式很独特,需要一些定位技巧,我初次尝试时很可能会搞砸。而 Claude Sonnet 生成的代码基本正确且易于测试,这省去了我进行枯燥调试的时间。
However, for more complex code questions particularly around less popular libraries which have fewer code examples scraped from Stack Overflow and GitHub, I am more cautious of the LLM’s outputs. One real-world issue I’ve had is that I need a way to log detailed metrics to a database while training models — for which I use the Trainer class in Hugging Face transformers — so that I can visualize and analyze it later. I asked Claude Sonnet to Write a Callback class in Python for the Trainer class in the Hugging Face transformers Python library such that it logs model training metadata for each step to a local SQLite database, such as current epoch, time for step, step loss, etc.
(chat log). This one I was less optimistic about since there isn’t much code about creating custom callbacks, however the Claude-generated code implemented some helpful ideas that weren’t on the top-of-my-mind when I asked, such a buffer to limit blocking I/O, SQLite config speedups, batch inserts, and connection handling. Asking Claude to “make the code better” twice (why not?) results in a few more unexpected ideas such as SQLite connection caching and using a single column with the JSON column type to store an arbitrary number of metrics, in addition to making the code much more Pythonic. It is still a lot of code such that it’s unlikely to work out-of-the-box without testing in the full context of an actual training loop. However, even if the code has flaws, the ideas themselves are extremely useful and in this case it would be much faster and likely higher quality code overall to hack on this generated code instead of writing my own SQLite logger from scratch.
然而,对于涉及较冷门库的更复杂代码问题——这些库从 Stack Overflow 和 GitHub 抓取的代码示例较少——我对 LLM 的输出持更谨慎态度。一个实际案例是,我需要在模型训练时记录详细指标到数据库(使用 Hugging Face transformers 的 Trainer 类)以便后续可视化分析。我让 Claude Sonnet Write a Callback class in Python for the Trainer class in the Hugging Face transformers Python library such that it logs model training metadata for each step to a local SQLite database, such as current epoch, time for step, step loss, etc.
(聊天记录)。由于自定义回调的公开代码很少,我原本预期不高,但 Claude 生成的代码提出了些我提问时未想到的实用方案:比如限制阻塞 I/O 的缓冲区、SQLite 配置加速、批量插入和连接处理。两次要求 Claude"优化代码"(何乐不为?)又带来了意外收获——SQLite 连接缓存和使用 JSON 列类型存储任意指标,同时让代码更符合 Python 风格。不过这些代码仍需在实际训练循环中测试,不太可能直接开箱即用。 然而,即使代码存在缺陷,这些想法本身极具价值,在此情况下,基于生成的代码进行修改,而非从头编写自己的 SQLite 日志器,整体上会更快且可能产生更高质量的代码。
For actual data science in my day-to-day work that I spend most of my time, I’ve found that code generation from LLMs is less useful. LLMs cannot output the text result of mathematical operations reliably, with some APIs working around that by allowing for a code interpreter to perform data ETL and analysis, but given the scale of data I typically work with it’s not cost-feasible to do that type of workflow. Although pandas is the standard for manipulating tabular data in Python and has been around since 2008, I’ve been using the relatively new polars library exclusively, and I’ve noticed that LLMs tend to hallucinate polars functions as if they were pandas functions which requires documentation deep dives to confirm which became annoying. For data visualization, which I don’t use Python at all and instead use R and ggplot2, I really haven’t had a temptation to consult a LLM, in addition to my skepticism that LLMs would know both those frameworks as well. The techniques I use for data visualization have been unchanged since 2017, and the most time-consuming issue I have when making a chart is determining whether the data points are too big or too small for humans to read easily, which is not something a LLM can help with.
在我日常工作中占据大部分时间的实际数据科学领域,我发现 LLMs 的代码生成功能用处不大。LLM 无法可靠地输出数学运算的文本结果,某些 API 通过允许代码解释器执行数据 ETL 和分析来规避这个问题,但考虑到我通常处理的数据规模,这种工作流在成本上并不可行。尽管 pandas 是 Python 中处理表格数据的标准库且自 2008 年就已存在,但我一直专用于相对较新的 polars 库,并注意到 LLMs 往往会将 polars 函数幻觉为 pandas 函数,这需要通过深入文档查阅来确认,变得相当烦人。至于数据可视化,我完全不使用 Python 而改用 R 和 ggplot2,我确实没有咨询 LLM 的冲动,此外我也怀疑 LLMs 是否能同时精通这两个框架。 自 2017 年以来,我使用的数据可视化技术一直未变,制作图表时最耗时的问题在于判断数据点大小是否适合人类轻松阅读,而这一点 LLM 无法提供帮助。
Asking LLMs coding questions is only one aspect of coding assistance. One of the other major ones is using a coding assistant with in-line code suggestions such as GitHub Copilot. Despite my success in using LLMs for one-off coding questions, I actually dislike using coding assistants for an unexpected reason: it’s distracting. Whenever I see a code suggestion from Copilot pop up, I have to mentally context switch from writing code to reviewing code and then back again, which destroys my focus. Overall, it was a net neutral productivity gain but a net negative cost as Copilots are much more expensive than just asking a LLM ad hoc questions through a web UI.
向 LLMs 提出编程问题仅是编码辅助的一个方面。另一个主要用途是使用如 GitHub Copilot 这类提供内联代码建议的编程助手。尽管我在利用 LLMs 解决一次性编程问题上取得了成功,但出于一个意想不到的原因,我实际上并不喜欢使用编程助手:它会分散注意力。每当 Copilot 弹出代码建议时,我不得不从编写代码的心理状态切换到审查代码,然后再切换回来,这严重破坏了我的专注力。总体而言,虽然生产力有所提升,但净收益却为中性,因为 Copilot 的成本远高于通过网页界面直接向 LLM 提问。
Now we can talk about the elephants in the room — agents, MCP, and vibe coding — and my takes are spicy. Agents and MCP, at a high-level, are a rebranding of the Tools paradigm popularized by the ReAct paper in 2022 where LLMs can decide whether a tool is necessary to answer the user input, extract relevant metadata to pass to the tool to run, then return the results. The rapid LLM advancements in context window size and prompt adherence since then have made Agent workflows more reliable, and the standardization of MCP is an objective improvement over normal Tools that I encourage. However, they don’t open any new use cases that weren’t already available when LangChain first hit the scene a couple years ago, and now simple implementations of MCP workflows are even more complicated and confusing than it was back then. I personally have not been able to find any novel use case for Agents, not then and not now.
现在我们可以谈谈房间里的大象——agents、MCP 和 vibe coding——我的观点很犀利。从高层次看,agents 和 MCP 是对 2022 年 ReAct 论文推广的 Tools 范式的重新包装,即 LLMs 能判断是否需要工具来响应用户输入,提取相关元数据传递给工具执行,然后返回结果。此后,LLM 在上下文窗口大小和提示遵循方面的快速进步,使得 Agent 工作流更加可靠,而 MCP 的标准化相比普通 Tools 确实是一种改进,我对此表示支持。然而,它们并未开启任何在 LangChain 几年前刚出现时尚未实现的新用例,如今简单的 MCP 工作流实现甚至比那时更加复杂和混乱。就我个人而言,无论是过去还是现在,都未能发现 agents 有任何新颖的用例。
Vibe coding with coding agents like Claude Code or Cursor is something I have little desire to even experiment with. On paper, coding agents should be able to address my complaints with LLM-generated code reliability since it inherently double-checks itself and it’s able to incorporate the context of an entire code project. However, I have also heard the horror stories of people spending hundreds of dollars by accident and not get anything that solves their coding problems. There’s a fine line between experimenting with code generation and gambling with code generation. Vibe coding can get me 80% of the way there, and I agree there’s value in that for building quick personal apps that either aren’t ever released publicly, or are released with disclaimers about its “this is released as-is” nature. But it’s unprofessional to use vibe coding as a defense to ship knowingly substandard code for serious projects, and the only code I can stand by is the code I am fully confident in its implementation.
与 Claude Code 或 Cursor 这样的编码代理进行 Vibe coding,我甚至没有尝试的欲望。理论上,编码代理应该能解决我对 LLM 生成代码可靠性的担忧,因为它会自我双重检查,并能整合整个代码项目的上下文。然而,我也听过那些因意外花费数百美元却未能解决任何编码问题的恐怖故事。在试验代码生成与赌博式代码生成之间,界限非常微妙。Vibe coding 或许能帮我完成 80%的工作,我承认这对于快速构建个人应用——那些要么从未公开发布,要么发布时附带“按现状提供”免责声明的应用——是有价值的。但若以 Vibe coding 为借口,在重要项目中交付明知不合格的代码,则显得极不专业。唯有对实现完全有信心的代码,才是我愿意支持的代码。
Of course, the coding landscape is always changing, and everything I’ve said above is how I use LLMs for now. It’s entirely possible I see a post on Hacker News that completely changes my views on vibe coding or other AI coding workflows, but I’m happy with my coding productivity as it is currently and I am able to complete all my coding tasks quickly and correctly.
当然,编程领域总是在不断变化,我上面所说的一切只是我目前使用 LLMs 的方式。完全有可能我在 Hacker News 上看到一篇文章,彻底改变我对氛围编程或其他 AI 编程工作流的看法,但我对当前的编程效率感到满意,能够快速且正确地完成所有编码任务。
What’s Next for LLM Users?
LLM 用户的下一步是什么?
Discourse about LLMs and their role in society has become bifuricated enough such that making the extremely neutral statement that LLMs have some uses is enough to justify a barrage of harrassment. I strongly disagree with AI critic Ed Zitron about his assertions that the reason the LLM industry is doomed because OpenAI and other LLM providers can’t earn enough revenue to offset their massive costs as LLMs have no real-world use. Two things can be true simultaneously: (a) LLM provider cost economics are too negative to return positive ROI to investors, and (b) LLMs are useful for solving problems that are meaningful and high impact, albeit not to the AGI hype that would justify point (a). This particular combination creates a frustrating gray area that requires a nuance that an ideologically split social media can no longer support gracefully. Hypothetically, If OpenAI and every other LLM provider suddenly collapsed and no better LLM models would ever be trained and released, open-source and permissively licensed models such as Qwen3 and DeepSeek R1 that perform comparable to ChatGPT are valid substitute goods and they can be hosted on dedicated LLM hosting providers like Cerebras and Groq who can actually make money on each user inference query. OpenAI collapsing would not cause the end of LLMs, because LLMs are useful today and there will always be a nonzero market demand for them: it’s a bell that can’t be unrung.
关于 LLMs 及其在社会中角色的讨论已经两极分化到如此程度,以至于仅仅做出“LLMs 有一定用途”这样极其中立的声明,就足以招致一连串的骚扰。我强烈反对 AI 评论家 Ed Zitron 的观点,他认为 LLM 行业注定失败的原因是 OpenAI 和其他 LLM 提供商无法通过 LLMs 获得足够收入来抵消其巨额成本,因为 LLMs 在现实世界中并无实际用途。两件事可以同时成立:(a) LLM 提供商的成本经济学过于负面,无法为投资者带来正向投资回报;(b) LLMs 在解决有意义且影响重大的问题上是有用的,尽管这还达不到能证明(a)点合理性的通用人工智能(AGI)炒作水平。这种特殊的组合创造了一个令人沮丧的灰色地带,需要一种社交媒体因意识形态分裂而无法优雅支持的微妙理解。 假设 OpenAI 及其他所有 LLM 提供商突然倒闭,且未来不再有更优的 LLM 模型被训练和发布,性能与 ChatGPT 相当的开源及宽松许可模型如 Qwen3 和 DeepSeek R1 将成为有效的替代品,它们可由 Cerebras、Groq 等专业 LLM 托管服务商部署——这些服务商实际上能从每项用户推理请求中盈利。OpenAI 的倒闭不会导致 LLMs 的终结,因为 LLMs 如今已证明其价值,市场对它们的需求永远存在:这是不可逆转的技术趋势。
As a software engineer — and especially as a data scientist — one thing I’ve learnt over the years is that it’s always best to use the right tool when appropriate, and LLMs are just another tool in that toolbox. LLMs can be both productive and counterproductive depending on where and when you use them, but they are most definitely not useless. LLMs are more akin to forcing a square peg into a round hole (at the risk of damaging either the peg or hole in the process) while doing things without LLM assistance is the equivalent of carefully defining a round peg to pass through the round hole without incident. But for some round holes, sometimes shoving the square peg through and asking questions later makes sense when you need to iterate quickly, while sometimes you have to be more precise with both the peg and the hole to ensure neither becomes damaged, because then you have to spend extra time and money fixing the peg and/or hole.
作为一名软件工程师——尤其是数据科学家——多年来我学到的一点是,适时使用合适的工具总是最佳选择,而 LLMs 不过是工具箱中的又一件工具。根据使用场景和时机,LLMs 既可能提升效率也可能适得其反,但它们绝非无用之物。LLMs 更像是强行将方钉打入圆孔(过程中可能损坏钉或孔),而不借助 LLM 辅助则相当于精心设计圆钉使其顺利通过圆孔。但对于某些圆孔,在需要快速迭代时,先强行塞入方钉再解决问题可能更合理;而有时则需对钉和孔都更加精确,以避免损坏,否则将额外耗费时间和成本修复钉或孔。
…maybe it’s okay if I ask an LLM to help me write my metaphors going forward.
…也许以后我可以让 LLM 帮我写比喻句。