Byte Open Sources “Free Operator”: Outperforms GPT-4o and Claude

Byte Open Sources “Free Operator”: Outperforms GPT-4o and Claude

ByteDance, in collaboration with Tsinghua University, has quietly launched UI-TARS, an intelligent agent capable of automatically executing complex cross-application operations, ahead of OpenAI. This groundbreaking AI has surpassed models like GPT-4o and Claude 3.5, and notably, it is available for free commercial use under the Apache 2.0 license. Here’s the content formatted for a WordPress blog:

Revolutionary AI in Action

UI-TARS demonstrates its prowess by performing tasks such as retrieving weather information in a Mac browser, posting tweets on a Windows system, and even manipulating mobile and web interfaces to search for songs on an Android music player.

On GitHub, UI-TARS has garnered over 900 stars, with users praising its performance as superior to OpenAI’s leaked Operator, which requires a costly membership, whereas UI-TARS is completely free.

Demo Highlights

Here are three impressive demos of UI-TARS:

  • Autonomously searching for flights from Seattle to New York.
  • Modifying a PowerPoint to match the background color of the second page with the first.
  • Installing a plugin for VS Code.

Performance Metrics

Perception: UI-TARS leads the pack in perception benchmarks like VisualWebBench, WebSRC, and ScreenQA-short, with UI-TARS-72B surpassing GPT-4o and Claude 3.5 Sonnet.

Localization: It also excels in element localization on ScreenSpot Pro, ScreenSpot, and ScreenSpot v2 benchmarks.

Execution: UI-TARS achieves SOTA performance in both static and dynamic environments across key metrics.

Behind the Scenes

UI-TARS is built upon Alibaba’s open-source multimodal model Qwen-2-VL, further trained with 50B tokens. The training process克服ed the bottleneck of limited data scale through a three-stage training procedure.

Project Insights

The UI-TARS project is a joint effort by ByteDance’s Seed team and Tsinghua University. The project incorporates innovative approaches such as online learning for generating new interaction trajectory data and reflective tuning mechanisms for error recovery. During inference, it employs techniques like thought chains and System 2 thinking.

Core Summary: The launch of UI-TARS signifies a significant leap for ByteDance in the AI domain. It is a testament to the power of collaboration and innovation, offering a freely available, commercial-grade AI with unparalleled perception, localization, and execution capabilities.

UI-TARS, based on the Qwen-2-VL model, trained on massive datasets, stands as a beacon of progress in AI technology, blending scientific rigor with a touch of humanistic endeavor.