Browser Use
In my article “RPA RIP”, I outlined my concerns regarding the overuse of traditional Robotic Process Automation (RPA). As an alternative, I positioned API-centric automation, combined with Generative AI to create “intelligent agents” (also known as Agentic AI).
As documented over a series of articles (the latest being “Agentic AI, my team has developed and scaled a multi-model, multi-modal Generative AI framework for business, which primarily aims to accelerate productivity by streamlining common tasks.
Last year, we launched a new feature known as “Workflows” which allows a user to declaratively (no code or specialist knowledge required) create chained events using pre-approved components. These events can be triggered in real time, promoting experimentation and continuous improvement through iteration.
These Workflows focus on API-centric interactions, either as an input or an output. For example, taking an input event from a user or signal to trigger an interaction with a system of record, such as SAP or Workday.
In recent months, we have enhanced this capability with “Deep Research”, leveraging reasoning models (e.g., OpenAI o3) to construct more complicated workflows that are dynamically compiled (not pre-determined scripts).
We believe these Agentic AI capabilities will trigger the next wave of innovation regarding the use of Generative AI, further increasing the overall return on investment.
However, not every system, especially in an enterprise business, is API-centric. Interacting and/or automating these legacy systems can be a challenge. This is where traditional RPA has historically proven valuable, as it can simulate user interactions via the user interface itself.
Unfortunately, this approach is very brittle as it relies upon pre-determined scripts that can not adapt to change.
Therefore, we have been testing “Browser Use”, which is an open-source project, backed by Y Combinator, that connects AI Agents to the browser. This opens the door to browser automation (no direct API required), without the need for pre-determined scripts.
In short, Browser Use accepts input from the user as a standard prompt and then leverages a language model to plan and execute the request in real time.
The video below is a short demonstration of Browser Use finding and approving a pull request within GitHub.
As part of this demonstration, Browser Use was not given any context regarding how to complete the task or insight into the GitHub website. It executed the user request by navigating the user interface organically, comparable to a human interaction.
The results of our testing have been surprisingly good, especially considering the relative immaturity of the project. However, it should be noted that the complexity of the task being requested and the specific language model being used can have a significant impact on the success of the process.
In addition, as the process is not pre-determined, the same input can result in different (inconsistent) results. Therefore, using these capabilities at scale in production would likely require robust grounding and thorough testing.
With that said, Browser Use does boast high levels of accuracy compared to other options, including “Operator” from OpenAI, which currently costs £200 per month to use.
Browser Use highlight the following features on their website. However, I would recommend reviewing their documentation for a more detailed description of the capabilities.
Vision + HTML Extraction
- Combines visual understanding with HTML structure extraction for comprehensive web interaction.
Multi-tab Management
- Automatically handles multiple browser tabs for complex workflows and parallel processing.
Element Tracking
- Extracts clicked elements XPaths and repeats exact LLM actions for consistent automation.
Custom Actions
- Add your own actions like saving files, database operations, notifications, or human input handling.
Self-correcting
- Intelligent error handling and automatic recovery for robust automation workflows.
Any LLM Support
- Compatible with all LangChain LLMs including GPT-4, Claude 3, and Llama 2.
As Agentic AI gains momentum, I expect to see a lot of browser automation capabilities appear on the market. However, I believe Browser Use has emerged as an early favourite, achieving 40k stars on GitHub in three months. They also have a strong community (open source) and great backing from established investors (Y Combinator).
Overall, the best form of automation is still programmatically defined, using standard APIs. However, knowing that this is not viable for all scenarios, especially in enterprise businesses, I am excited to see modern alternatives becoming available.