What the New GPT-4 AI Can Do

2023-03-19 05:57:26
关注

Tech research company OpenAI has just released an updated version of its text-generating artificial intelligence program, called GPT-4, and demonstrated some of the language model’s new abilities. Not only can GPT-4 produce more natural-sounding text and solve problems more accurately than its predecessor. It can also process images in addition to text. But the AI is still vulnerable to some of the same problems that plagued earlier GPT models: displaying bias, overstepping the guardrails intended to prevent it from saying offensive or dangerous things and “hallucinating,” or confidently making up falsehoods not found in its training data.

On Twitter, OpenAI CEO Sam Altman described the model as the company’s “most capable and aligned” to date. (“Aligned” means it is designed to follow human ethics.) But “it is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it,” he wrote in the tweet. 

Perhaps the most significant change is that GPT-4 is “multimodal,” meaning it works with both text and images. Although it cannot output pictures (as do generative AI models such as DALL-E and Stable Diffusion), it can process and respond to the visual inputs it receives. Annette Vee, an associate professor of English at the University of Pittsburgh who studies the intersection of computation and writing, watched a demonstration in which the new model was told to identify what was funny about a humorous image. Being able to do so means “understanding context in the image. It’s understanding how an image is composed and why and connecting it to social understandings of language,” she says. “ChatGPT wasn’t able to do that.”

A device with the ability to analyze and then describe images could be enormously valuable for people who are visually impaired or blind. For instance, a mobile app called Be My Eyes can describe the objects around a user, helping those with low or no vision interpret their surroundings. The app recently incorporated GPT-4 into a “virtual volunteer” that, according to a statement on OpenAI’s website, “can generate the same level of context and understanding as a human volunteer.”

But GPT-4’s image analysis goes beyond describing the picture. In the same demonstration Vee watched, an OpenAI representative sketched an image of a simple website and fed the drawing to GPT-4. Next the model was asked to write the code required to produce such a website—and it did. “It looked basically like what the image is. It was very, very simple, but it worked pretty well,” says Jonathan May, a research associate professor at the University of Southern California. “So that was cool.”

Even without its multimodal capability, the new program outperforms its predecessors at tasks that require reasoning and problem-solving. OpenAI says it has run both GPT-3.5 and GPT-4 through a variety of tests designed for humans, including a simulation of a lawyer’s bar exam, the SAT and Advanced Placement tests for high schoolers, the GRE for college graduates and even a couple of sommelier exams. GPT-4 achieved human-level scores on many of these benchmarks and consistently outperformed its predecessor, although it did not ace everything: it performed poorly on English language and literature exams, for example. Still, its extensive problem-solving ability could be applied to any number of real-world applications—such as managing a complex schedule, finding errors in a block of code, explaining grammatical nuances to foreign-language learners or identifying security vulnerabilities.

Additionally, OpenAI claims the new model can interpret and output longer blocks of text: more than 25,000 words at once. Although previous models were also used for long-form applications, they often lost track of what they were talking about. And the company touts the new model’s “creativity,” described as its ability to produce different kinds of artistic content in specific styles. In a demonstration comparing how GPT-3.5 and GPT-4 imitated the style of Argentine author Jorge Luis Borges in English translation, Vee noted that the more recent model produced a more accurate attempt. “You have to know enough about the context in order to judge it,” she says. “An undergraduate may not understand why it’s better, but I’m an English professor.... If you understand it from your own knowledge domain, and it’s impressive in your own knowledge domain, then that’s impressive.”

May has also tested the model’s creativity himself. He tried the playful task of ordering it to create a “backronym” (an acronym reached by starting with the abbreviated version and working backward). In this case, May asked for a cute name for his lab that would spell out “CUTE LAB NAME” and that would also accurately describe his field of research. GPT-3.5 failed to generate a relevant label, but GPT-4 succeeded. “It came up with ‘Computational Understanding and Transformation of Expressive Language Analysis, Bridging NLP, Artificial intelligence And Machine Education,’” he says. “‘Machine Education’ is not great; the ‘intelligence’ part means there’s an extra letter in there. But honestly, I’ve seen way worse.” (For context, his lab’s actual name is CUTE LAB NAME, or the Center for Useful Techniques Enhancing Language Applications Based on Natural And Meaningful Evidence). In another test, the model showed the limits of its creativity. When May asked it to write a specific kind of sonnet—he requested a form used by Italian poet Petrarch—the model, unfamiliar with that poetic setup, defaulted to the sonnet form preferred by Shakespeare.

Of course, fixing this particular issue would be relatively simple. GPT-4 merely needs to learn an additional poetic form. In fact, when humans goad the model into failing in this way, this helps the program develop: it can learn from everything that unofficial testers enter into the system. Like its less fluent predecessors, GPT-4 was originally trained on large swaths of data, and this training was then refined by human testers. (GPT stands for generative pretrained transformer.) But OpenAI has been secretive about just how it made GPT-4 better than GPT-3.5, the model that powers the company’s popular ChatGPT chatbot. According to the paper published alongside the release of the new model, “Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.” OpenAI’s lack of transparency reflects this newly competitive generative AI environment, where GPT-4 must vie with programs such as Google’s Bard and Meta’s LLaMA. The paper does go on to suggest, however, that the company plans to eventually share such details with third parties “who can advise us on how to weigh the competitive and safety considerations ... against the scientific value of further transparency.”

Those safety considerations are important because smarter chatbots have the ability to cause harm: without guardrails, they might provide a terrorist with instructions on how to build a bomb, churn out threatening messages for a harassment campaign or supply misinformation to a foreign agent attempting to sway an election. Although OpenAI has placed limits on what its GPT models are allowed to say in order to avoid such scenarios, determined testers have found ways around them. “These things are like bulls in a china shop—they’re powerful, but they’re reckless,” scientist and author Gary Marcus told Scientific American shortly before GPT-4’s release. “I don’t think [version] four is going to change that.”

And the more humanlike these bots become, the better they are at fooling people into thinking there is a sentient agent behind the computer screen. “Because it mimics [human reasoning] so well, through language, we believe that—but underneath the hood, it’s not reasoning in any way similar to the way that humans do,” Vee cautions. If this illusion fools people into believing an AI agent is performing humanlike reasoning, they may trust its answers more readily. This is a significant problem because there is still no guarantee that those responses are accurate. “Just because these models say anything, that doesn’t mean that what they’re saying is [true],” May says. “There isn’t a database of answers that these models are pulling from.” Instead, systems like GPT-4 generate an answer one word at a time, with the most plausible next word informed by their training data—and that training data can become outdated. “I believe GPT-4 doesn’t even know that it’s GPT-4,” he says. “I asked it, and it said, ‘No, no, there’s no such thing as GPT-4. I’m GPT-3.’”

Now that the model has been released, many researchers and AI enthusiasts have an opportunity to probe GPT-4’s strengths and weaknesses. Developers who want to use it in other applications can apply for access, and anyone who wants to “talk” with the program will have to subscribe to ChatGPT Plus. For $20 per month, this paid program lets users choose between talking with a chatbot that runs on GPT-3.5 and one that runs on GPT-4.

Such explorations will undoubtedly uncover more potential applications—and flaws—in GPT-4. “The real question should be ‘How are people going to feel about it two months from now, after the initial shock?’” Marcus says. “Part of my advice is: let’s temper our initial enthusiasm by realizing we have seen this movie before. It’s always easy to make a demo of something; making it into a real product is hard. And if it still has these problems—around hallucination, not really understanding the physical world, the medical world, etcetera—that’s still going to limit its utility somewhat. And it’s still going to mean you have to pay careful attention to how it’s used and what it’s used for.”

参考译文
新的GPT-4 AI可以做什么
科技研究公司 OpenAI 最近推出了其文本生成人工智能程序的更新版本,称为 GPT-4,并展示了该语言模型的一些新功能。GPT-4 不仅比其前身能生成更自然的文本,解决问题更准确,还可以在处理文本的同时处理图像。但该人工智能仍然存在一些与早期 GPT 模型相同的问题:存在偏见、超出设定的界限(这些界限旨在防止它说出冒犯性或危险的言论),以及“幻觉”现象,即自信地编造了不在其训练数据中的虚假信息。在 Twitter 上,OpenAI 首席执行官山姆·奥尔特曼将该模型描述为公司“迄今为止最为强大且最符合人类价值观的模型”(“符合人类价值观”是指它被设计成遵循人类的伦理)。但他同时在推文中写道:“它仍然存在缺陷,仍然有局限,而且在初次使用时看起来比你花更多时间使用之后更令人印象深刻。” 也许最重要的变化是,GPT-4 是“多模态”的,意味着它可以同时处理文本和图像。虽然它不能生成图片(像 DALL-E 和 Stable Diffusion 这样的生成式人工智能模型可以生成图片),但它可以处理并回应它所接收到的视觉输入。匹兹堡大学英语系副教授安妮特·维研究计算与写作的交叉点。她在一场演示中看到新模型被要求识别一张幽默图像中的笑点。她表示,能够做到这一点意味着“理解图像中的语境,它理解图像是如何构成的、为什么构成的,并将其与我们对语言的社会理解联系起来”。“ChatGPT 并不能做到这一点。” 一款具备分析并描述图像能力的设备,对于视觉受损或失明的人来说,可能非常有价值。例如,一款名为“Be My Eyes”的移动应用可以帮助视力低下或失明的用户描述他们周围的物体,帮助他们理解环境。该应用近期在其“虚拟志愿者”功能中集成了 GPT-4,根据 OpenAI 网站上的声明,“虚拟志愿者”可以“提供与人类志愿者相同的语境和理解。” 但 GPT-4 的图像分析功能并不仅限于描述图片。在维所观看的同一场演示中,一位 OpenAI 代表绘制了一个简单网站的图像,并将该图像输入 GPT-4。然后模型被要求写出生成该网站所需的代码——它成功完成了。南加州大学研究助理教授乔纳森·梅说:“生成的代码基本上就像图像本身。它非常简单,但效果还不错。”“所以这挺酷的。” 即使不考虑其多模态功能,新程序在需要推理和解决问题的任务上也优于其前身。OpenAI 表示,它已将 GPT-3.5 和 GPT-4 运用于一系列为人类设计的测试,包括律师资格考试模拟、高中生的 SAT 和 AP 考试、研究生的 GRE 考试,甚至还有葡萄酒品鉴师的考试。GPT-4 在许多这些基准测试中取得了与人类相当的分数,并且始终优于其前身,尽管它并非在所有方面都表现完美:例如,它在英语语言和文学考试中的成绩不佳。尽管如此,其广泛的解决问题能力可以应用于大量现实世界中的应用,比如管理复杂的日程安排、查找代码块中的错误、向外语学习者解释语法细节,或识别安全漏洞。此外,OpenAI 声称新模型可以理解和输出更长的文本块:一次超过 25,000 个单词。虽然之前的模型也可以用于长文应用,但它们常常会偏离主题。公司还强调新模型的“创造力”,即其能够以特定风格生成各种类型的艺术内容。在一项比较 GPT-3.5 和 GPT-4 模仿阿根廷作家博尔赫斯英文翻译风格的演示中,维注意到较新的模型生成了更准确的尝试。“你必须对语境有足够的了解才能判断它的好坏,”她说,“本科生可能不会理解为什么它更好,但我是一名英语教授……如果你从自己的知识领域出发,认为它令人印象深刻,那它就是令人印象深刻的。” 梅也亲自测试了模型的创造力。他尝试了一个有趣的任务,那就是命令它创建一个“反向缩写”(从缩写版本开始并倒推生成的缩写)。这一次,梅要求为他的实验室取一个能拼出“CUTE LAB NAME”的可爱名字,并且这个名字还要准确地描述他的研究领域。GPT-3.5 没有生成相关的标签,但 GPT-4 成功了。他说:“它生成的标签是‘Computational Understanding and Transformation of Expressive Language Analysis, Bridging NLP, Artificial intelligence And Machine Education’。”“‘Machine Education’这部分不太好;‘intelligence’这个词多了一个字母。但说实话,我见过比这差得多的。”(作为背景,他实验室的实际名称是 CUTE LAB NAME,即“基于自然且有意义证据的增强语言应用有用技术中心”)。在另一项测试中,该模型展示了其创造力的局限。当梅要求它写一种特定类型的十四行诗——他要求写意大利诗人彼特拉克所用的形式时,该模型由于不熟悉这种诗歌结构,转而默认使用了莎士比亚偏爱的十四行诗形式。当然,修复这个具体问题相对简单。GPT-4 只需学习一种新的诗歌形式即可。事实上,当人类以这种方式“挑衅”模型让它失败时,这有助于程序发展:它可以从所有非官方测试者输入系统的内容中学习。像其语言流畅度稍逊的前身一样,GPT-4 最初也是在大量数据上进行训练的,随后由人类测试者进一步优化了训练。(GPT 代表生成式预训练变压器。)但 OpenAI 对于其如何使 GPT-4 比 GPT-3.5 更加完善一直守口如瓶,GPT-3.5 是该公司热门聊天机器人 ChatGPT 所使用的模型。根据该公司在发布新模型时一起发表的论文,“鉴于竞争格局以及像 GPT-4 这样的大规模模型的安全性问题,本报告未提供有关架构(包括模型规模)、硬件、训练计算、数据集构建、训练方法等的进一步细节。” OpenAI 的这种不透明性反映了这个新的生成式人工智能环境,在这里,GPT-4 必须与诸如谷歌的 Bard 和 Meta 的 LLaMA 这类程序竞争。然而,该论文确实继续表明,公司计划最终将这些细节与第三方分享,“这些第三方可以就如何权衡竞争性考量和安全考量……与进一步透明度的科学价值进行建议。” 这些安全考量很重要,因为更聪明的聊天机器人有能力造成伤害:如果没有限制措施,它们可能会向恐怖分子提供如何制造炸弹的说明,为骚扰活动撰写威胁信息,或向试图影响选举的外国代理人提供虚假信息。虽然 OpenAI 已经对其 GPT 模型施加限制以避免出现此类情况,但坚定的测试者已经找到了绕过这些限制的方法。“这些系统就像瓷器店里的公牛——它们很强大,但也很鲁莽,”科学家兼作家加里·马库斯在 GPT-4 发布前夕告诉《科学美国人》杂志,“我不认为第四版会改变这一点。” 而且,这些聊天机器人越像人类,它们就越擅长欺骗人们相信它们背后有一个有意识的代理。维警告说:“因为它们通过语言如此巧妙地模仿了人类的推理,我们相信这一点——但事实上,它们的推理方式与人类完全不同。” 如果这一幻觉让人误以为人工智能代理正在执行类人的推理,他们就可能会更轻易地信任它的回答。这是一个重要问题,因为目前仍无法保证这些回答的准确性。“仅仅因为这些模型说了些什么,并不代表它们说的就是正确的,”梅说。“它们的回答并不是从一个数据库中提取的。” 相反,像 GPT-4 这样的系统会逐字生成答案,下一个最有可能的字由其训练数据决定——而这些训练数据可能会过时。“我认为 GPT-4 仍然会存在一些缺陷,”马库斯说。“真正的关键问题是‘两个月后,在最初的冲击之后,人们会如何看待它?’” 我的部分建议是:让我们对最初的兴奋保持克制,意识到我们之前已经看过类似的情景。很容易展示一个演示;将它变成真正的产品却很困难。如果它仍然存在这些问题——比如幻觉、无法真正理解物理世界、医学世界等——这仍然会限制它的一些实用性。而且,它仍然意味着你必须仔细关注它是如何被使用以及用于什么目的。
您觉得本篇内容如何
评分

评论

您需要登录才可以回复|注册

提交评论

广告
提取码
复制提取码
点击跳转至百度网盘