Add documentation for VAB-WebArena-Lite

2024-10-20 00:10:34 +08:00 · 2024-10-20 00:10:34 +08:00 · 0c6f549214
commit 0c6f549214
parent 9b7a654aaa
21 changed files with 10138 additions and 70 deletions
--- a/README.md
+++ b/README.md
@ -26,32 +26,30 @@ Compared to its predecessor [AgentBench](https://github.com/THUDM/AgentBench), V
 ## Table of Contents
 -   [Quick Start](#quick-start)
 -   [Dataset Summary](#dataset-summary)
 -   [Leaderboard](#leaderboard)
 -   [Quick Start](#quick-start)
 -   [Acknowledgement](#acknowledgement)
 -   [Citation](#citation)
 ## Dataset Summary
 We offer two splits for each dataset: Testing and Training. Different from its predecessor [AgentBench](https://github.com/THUDM/AgentBench), VAB is accompanied with a trajectory training set for behavior cloning (BC) training, which allows development of more potent visual foundation agents with emerging open LMMs.
 ![](./assets/statistics.png)
 ## Leaderboard
 Here is the scores on test set results of VAB. All metrics are task Success Rate (SR). Noted that proprietary LMMs are tested with mere **Prompting**, and open LMMs are tested after **Multitask Finetuning** on VAB training set, as they usually fail to follow complicated agent task instructions.
 ![](./assets/leaderboard.png)
 ## Quick Start
-This section will guide you on how to use `gpt-4o-2024-05-13` as an agent to launch 4 concurrent `VAB-Minecraft` tasks.
+This section will first give you an overview to the use and architecture of VAB.
-For the specific framework structure, please refer to AgentBench's [Framework Introduction](https://github.com/THUDM/AgentBench/blob/main/docs/Introduction_en.md).
+Next, it will guide you on how to use `gpt-4o-2024-05-13` as an exemplar agent to launch 4 concurrent `VAB-Minecraft` tasks.
 ### Overview on VAB Framework
 To allow fast evaluation over agent tasks, we leverage AgentBench's framework as the backbone (currently for VAB-OmniGibson, VAB-Minecraft, and VAB-CSS).
 If you are interested in its detailed implementation, please refer to AgentBench's [Framework Introduction](https://github.com/THUDM/AgentBench/blob/main/docs/Introduction_en.md) (which may not be necessary).
 Basically, the framework calls all LLM/LMM in API formats via `Agent-Controller`, and accesses to environments via `Task-Controller`.
 The `Assigner` will automatically assign evaluation tasks by pairing `Agent-Controller` and `Task-Controller` to optimize the overall evaluation speed.
 For more detailed configuration and launch methods, please check [Configuration Guide](docs/Config_en.md)
 and [Program Entrance Guide](docs/Entrance_en.md).
-### Step 1. Prerequisites
+![](./assets/framework.png)
 ### Step 1. Prerequisites for All Environments
 Clone this repo and install the dependencies.
@ -68,7 +66,14 @@ Ensure that [Docker](https://www.docker.com/) is properly installed.
 docker ps
 ```
-For specific environments, please refer to their respective prerequisites: [VAB-OmniGibson](docs/README_setup.md#Setup-for-VAB-OmniGibson), [VAB-Minecraft](docs/README_setup.md#Setup-for-VAB-Minecraft), [VAB-CSS](docs/README_setup.md#Setup-for-VAB-CSS).
+For specific environments, please refer to their additional prerequisites respectively.
 For VAB-WebArena-Lite, it is based on [WebArena](https://github.com/webarena-x/webarena) with some modifications, so please read its individual setup carefully.
 * [VAB-OmniGibson Setup](docs/detailed_setups/VAB-OmniGibson.md)
 * [VAB-Minecraft Setup](docs/detailed_setups/VAB-Minecraft.md)
 * VAB-Mobile: Ongoing
 * [VAB-WebArena-Lite Setup](VAB-WebArena-Lite/README.md) (Separate installation and evaluation method)
 * VAB-CSS: Ongoing
 ### Step 2. Configure the Agent
@ -117,6 +122,19 @@ python -m src.assigner --auto-retry --config configs/assignments/omnigibson.yaml
 You can modify the config files to launch other tasks or change task concurrency.
 ## Dataset Summary
 We offer two splits for each dataset: Testing and Training. Different from its predecessor [AgentBench](https://github.com/THUDM/AgentBench), VAB is accompanied with a trajectory training set for behavior cloning (BC) training, which allows development of more potent visual foundation agents with emerging open LMMs.
 ![](./assets/statistics.png)
 ## Leaderboard
 Here is the scores on test set results of VAB. All metrics are task Success Rate (SR). Noted that proprietary LMMs are tested with mere **Prompting**, and open LMMs are tested after **Multitask Finetuning** on VAB training set, as they usually fail to follow complicated agent task instructions.
 ![](./assets/leaderboard.png)
 ## Acknowledgement
 This project is heavily built upon the following repositories (to be updated):
--- a/VAB-WebArena-Lite/README.md
+++ b/VAB-WebArena-Lite/README.md
@ -0,0 +1,218 @@
 # Setup for VAB-WebArena-Lite
 ## Brief Introduction
 VAB-WebArena-Lite is a 165-task refined subset from <a href="https://webarena.dev/" target="_blank">WebArena</a>.
 The purpose of building this subset is to manually ensure task correctness & feasibility, and speed up testing (original 812-task WebArena usually takes more than 6h to run through, while VAB-WebArena-Lite takes around 40m in practice). 
 The modified version of the test cases can be found in `config_files/wa/test_webarena_lite.raw.json`.
 ## Install
 First, you should clone the official repository of <a href="https://github.com/web-arena-x/visualwebarena">VisualWebArena</a> to this directory
 ```bash
 # Assume you have cloned VAB and is now in the `VAB-WebArena-Lite` directory
 git clone https://github.com/web-arena-x/visualwebarena.git visualwebarena
 cd visualwebarena
 git reset --hard ad57aae4dad71531504726900b80db02e0526158
 cd ..
 ```
 Then, you should substitute the file with the commands below:
 ```bash
 bash replace.sh
 ```
 After that, you should install the dependencies for VAB-WebArena-Lite (recommend using a independent conda environment to VAB):
 ```bash
 # Python 3.10 (or 3.11, but not 3.12 cause 3.12 deprecated distutils needed here)
 python -m wal wal
 source venv/bin/activate
 pip install -r requirements.txt
 playwright install
 pip install -e .
 ```
 You can also run the unit tests to ensure that WebArena-Lite is installed correctly:
 ```bash
 pytest -x
 ```
 ## Setup WebArena-Lite Environments
 1. Setup the standalone environments.
 Please check out [this page](https://github.com/web-arena-x/webarena/tree/main/environment_docker) for details.
 2. Configurate the urls for each website.
 First, export the `DATASET` to be `webarena`:
 ```bash
 export DATASET=webarena
 ```
 Then, set the URL for the websites
 (🚨 Notice: check if default ports of websites below correspond to those you setup in the first step)
 ```bash
 # Actually, the CLASSIFIEDS environment is not included in the WebArena-Lite evaluation, we keep the environment variables here just for consistency.
 export CLASSIFIEDS="<your_classifieds_domain>:9980"
 export CLASSIFIEDS_RESET_TOKEN="4b61655535e7ed388f0d40a93600254c"
 # Below are the variables you should set for the evaluation.
 export SHOPPING="<your_shopping_site_domain>:7770"
 export REDDIT="<your_reddit_domain>:9999"
 export SHOPPING_ADMIN="<your_e_commerce_cms_domain>:7780/admin"
 export GITLAB="<your_gitlab_domain>:8023"
 export MAP="<your_map_domain>:3000"
 export WIKIPEDIA="<your_wikipedia_domain>:8888"
 export HOMEPAGE="<your_homepage_domain>:4399"
 ```
 3. Generate config files for each test example:
 ```bash
 python scripts/generate_test_data.py
 ```
 You will see `*.json` files generated in the [config_files](./config_files) folder. Each file contains the configuration for one test example.
 4. Obtain and save the auto-login cookies for all websites:
 ```bash
 bash prepare.sh
 ```
 5. Set up API keys.
 ```bash
 export OPENAI_API_KEY=your_key
 # Optional: if you use a different OpenAI model source
 export OPENAI_API_URL=your_url 
 # Optional: you can set the following variables to evaluate the preset model in llms/providers/api_utils.py
 export GEMENI_API_KEY=your_key
 export QWEN_API_KEY=your_key
 export CLAUDE_API_KEY=your_key
 # Optional: if you have trained your model, we recommend deploying it as an API service, where you can set a FINETUNED_URL to evaluate it.
 export FINETUNED_URL=your_url
 ```
 If using Gemini, first install the [gcloud CLI](https://cloud.google.com/sdk/docs/install). Configure the API key by authenticating with Google Cloud:
 ```bash
 gcloud auth login
 gcloud config set project <your_project_name>
 ```
 ## 🖼️ Evaluating in VAB Standard Setting with SoM (Set-of-Marks) Visual Agents
 ### 👎 Run Single Agent For Evalution (Slow, but please read to understand meaning of arguments)
 To run your own model with SoM visual agent,  you can run evaluation with the following flags:
 ```bash
 python run.py \
  --instruction_path agent/prompts/jsons/p_som_cot_id_actree_3s.json \
  --test_start_idx 0 \
  --test_end_idx 1 \
  --result_dir <your_result_dir> \
  --test_config_base_dir config_files/wa/test_webarena_lite \
  --provider api \
  --model openai_gpt-4-vision-preview \
  --action_set_tag som  --observation_type image_som
 ```
 Besides the original model providers (OpenAI, Google), you can also add your models in `llms/providers/api_utils.py`. Remember to set `--provider` to:
 - `api`: Keep the same input style as WebArena, suitable for regular API calls
 - `finetune`: This is required for models trained with the data we provide.
 For the `--model` variable, we use the format `<source>_<model-name>` .
 - If there is no more optional models under source, you can set it to just `source`.
 - Remember that the source name here should be added in the init function of `APIModel` in `llms/providers/api_utils.py`.
 - For example, if you want to use the openai model "gpt-4o", you can set the flag like this: `--model openai_gpt-4o`.
 Finally, run `score.py` to get the pass rate
 ```bash 
 python score.py <your_result_dir>
 ```
 ### 👍 Run Parallel Agent For Evaluation (Recommended)
 To run the tests in parallel, you can first configure `wa_parallel_run.sh`, then run it. We default split the test set to 5 parallel-group for evaluation in VAB.
 ```bash
 # Remember to first launch a tmux session
 tmux
 bash wa_parallel_run.sh
 ```
 The script is enabled with auto-resuming if it is interrupted or met unexpected error. Please feel free to rerun the above command until all tasks finish.
 After all parallel groupes finish, run `score.py` to get the pass rate
 ```bash 
 python score.py <your_result_dir>
 ```
 ### 🚨 Important: Refresh all websites before re-run another round of testing!
 Since tasks in WebArena may involve changing status and database of websites (e.g., posting comments on Reddit), if websites are not all refreshed before another round of evaluation, the results would be problematic.
 Please remember to run following command (assume you are hosting WebArena websites on your own) to restart and refresh all website dockers to avoid potential contamination.
 The process usually takes 3-5 minites.
 ```bash
 # Make sure the script is executed on the machine that you run those website dockers
 bash refresh_website_docker.sh
 ```
 You may need to change some contents in the script (e.g. configured ports of websites, names of dockers, etc.).
 ## Run Visualized Demostration
 Original WebArena have also prepared a demo for you to run the agents on your own task on an arbitrary webpage. An example is shown above where the agent is tasked to find the best Thai restaurant in Pittsburgh.
 After following the setup instructions above and setting the OpenAI API key (the other environment variables for website URLs aren't really used, so you should be able to set them to some dummy variable), you can run the GPT-4V + SoM agent with the following command:
 ```bash
 python run_demo.py \
  --instruction_path agent/prompts/jsons/p_som_cot_id_actree_3s.json \
  --start_url "https://www.amazon.com" \
  --image "https://media.npr.org/assets/img/2023/01/14/this-is-fine_wide-0077dc0607062e15b476fb7f3bd99c5f340af356-s1400-c100.jpg" \
  --intent "Help me navigate to a shirt that has this on it." \
  --result_dir demo_test_amazon \
  --model gpt-4-vision-preview \
  --action_set_tag som  --observation_type image_som \
  --render
 ```
 This tasks the agent to find a shirt that looks like the provided image (the "This is fine" dog) from Amazon. Have fun!
 ## Acknowledgements
 Our code is heavily based off the <a href="https://github.com/web-arena-x/webarena">WebArena codebase</a> and <a href="https://github.com/web-arena-x/visualwebarena">VisualWebArena codebase</a>.
 If you find this environment useful, please consider citing <a href="https://jykoh.com/vwa" target="_blank">VisualWebArena</a> as well as <a href="https://webarena.dev/" target="_blank">WebArena</a>:
 ```bibtex
@article{koh2024visualwebarena,
  title={VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks},
  author={Koh, Jing Yu and Lo, Robert and Jang, Lawrence and Duvvur, Vikram and Lim, Ming Chong and Huang, Po-Yu and Neubig, Graham and Zhou, Shuyan and Salakhutdinov, Ruslan and Fried, Daniel},
  journal={arXiv preprint arXiv:2401.13649},
  year={2024}
 }
@article{zhou2024webarena,
  title={WebArena: A Realistic Web Environment for Building Autonomous Agents},
  author={Zhou, Shuyan and Xu, Frank F and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Bisk, Yonatan and Fried, Daniel and Alon, Uri and others},
  journal={ICLR},
  year={2024}
 }
 ```
--- a/VAB-WebArena-Lite/new/agent.py
+++ b/VAB-WebArena-Lite/new/agent.py
@ -0,0 +1,227 @@
 import argparse
 import json
 from typing import Any, Optional
 import tiktoken
 from beartype import beartype
 from PIL import Image
 from agent.prompts import *
 from browser_env import Trajectory
 from browser_env.actions import (
    Action,
    ActionParsingError,
    create_id_based_action,
    create_none_action,
    create_playwright_action,
 )
 from browser_env.utils import Observation, StateInfo
 from llms import (
    call_llm,
    generate_from_huggingface_completion,
    generate_from_openai_chat_completion,
    generate_from_openai_completion,
    lm_config,
 )
 from llms.tokenizers import Tokenizer
 class Agent:
    """Base class for the agent"""
    def __init__(self, *args: Any) -> None:
        pass
    def next_action(
        self, trajectory: Trajectory, intent: str, meta_data: Any
    ) -> Action:
        """Predict the next action given the observation"""
        raise NotImplementedError
    def reset(
        self,
        test_config_file: str,
    ) -> None:
        raise NotImplementedError
 class TeacherForcingAgent(Agent):
    """Agent that follows a pre-defined action sequence"""
    def __init__(self) -> None:
        super().__init__()
    def set_action_set_tag(self, tag: str) -> None:
        self.action_set_tag = tag
    def set_actions(self, action_seq: str | list[str]) -> None:
        if isinstance(action_seq, str):
            action_strs = action_seq.strip().split("\n")
        else:
            action_strs = action_seq
        action_strs = [a.strip() for a in action_strs]
        actions = []
        for a_str in action_strs:
            try:
                if self.action_set_tag == "playwright":
                    cur_action = create_playwright_action(a_str)
                elif self.action_set_tag == "id_accessibility_tree":
                    cur_action = create_id_based_action(a_str)
                else:
                    raise ValueError(
                        f"Unknown action type {self.action_set_tag}"
                    )
            except ActionParsingError as e:
                cur_action = create_none_action()
            cur_action["raw_prediction"] = a_str
            actions.append(cur_action)
        self.actions: list[Action] = actions
    def next_action(
        self, trajectory: Trajectory, intent: str, meta_data: Any
    ) -> Action:
        """Predict the next action given the observation"""
        return self.actions.pop(0)
    def reset(
        self,
        test_config_file: str,
    ) -> None:
        with open(test_config_file) as f:
            ref_actions = json.load(f)["reference_action_sequence"]
            tag = ref_actions["action_set_tag"]
            action_seq = ref_actions["action_sequence"]
            self.set_action_set_tag(tag)
            self.set_actions(action_seq)
 class PromptAgent(Agent):
    """prompt-based agent that emits action given the history"""
    @beartype
    def __init__(
        self,
        action_set_tag: str,
        lm_config: lm_config.LMConfig,
        prompt_constructor: PromptConstructor,
        captioning_fn = None,
    ) -> None:
        super().__init__()
        self.lm_config = lm_config
        self.prompt_constructor = prompt_constructor
        self.action_set_tag = action_set_tag
        self.captioning_fn = captioning_fn
        # Check if the model is multimodal.
        if ("gemini" in lm_config.model or "gpt-4" in lm_config.model and "vision" in lm_config.model or lm_config.provider in ["api", "finetune"]) and type(prompt_constructor) == MultimodalCoTPromptConstructor:
            self.multimodal_inputs = True
        else:
            self.multimodal_inputs = False
    def set_action_set_tag(self, tag: str) -> None:
        self.action_set_tag = tag
    @beartype
    def next_action(
        self, trajectory: Trajectory, intent: str, meta_data: dict[str, Any], images: Optional[list[Image.Image]] = None,
        output_response: bool = False
    ) -> Action:
        # Create page screenshot image for multimodal models.
        if self.multimodal_inputs:
            page_screenshot_arr = trajectory[-1]["observation"]["image"]
            page_screenshot_img = Image.fromarray(
                page_screenshot_arr
            )  # size = (viewport_width, viewport_width)
        # Caption the input image, if provided.
        if images is not None and len(images) > 0:
            if self.captioning_fn is not None:
                image_input_caption = ""
                for image_i, image in enumerate(images):
                    if image_i == 0:
                        image_input_caption += f'Input image {image_i+1}: "{self.captioning_fn([image])[0]}"'
                    else:
                        image_input_caption += f'input image {image_i+1}: "{self.captioning_fn([image])[0]}"'
                    if len(images) > 1:
                        image_input_caption += ", "
                # Update intent to include captions of input images.
                intent = f"{image_input_caption}\nIntent: {intent}"
            elif not self.multimodal_inputs:
                print(
                    "WARNING: Input image provided but no image captioner available."
                )
        if self.multimodal_inputs:
            prompt = self.prompt_constructor.construct(
                trajectory, intent, page_screenshot_img, images, meta_data
            )
        else:
            prompt = self.prompt_constructor.construct(
                trajectory, intent, meta_data
            )
        lm_config = self.lm_config
        n = 0
        while True:
            response = call_llm(lm_config, prompt)
            force_prefix = self.prompt_constructor.instruction[
                "meta_data"
            ].get("force_prefix", "")
            response = f"{force_prefix}{response}"
            if output_response:
                print(f'Agent: {response}', flush=True)
            n += 1
            try:
                parsed_response = self.prompt_constructor.extract_action(
                    response
                )
                if self.action_set_tag == "id_accessibility_tree":
                    action = create_id_based_action(parsed_response)
                elif self.action_set_tag == "playwright":
                    action = create_playwright_action(parsed_response)
                elif self.action_set_tag == "som":
                    action = create_id_based_action(parsed_response)
                else:
                    raise ValueError(
                        f"Unknown action type {self.action_set_tag}"
                    )
                action["raw_prediction"] = response
                break
            except ActionParsingError as e:
                if n >= lm_config.gen_config["max_retry"]:
                    action = create_none_action()
                    action["raw_prediction"] = response
                    break
        return action
    def reset(self, test_config_file: str) -> None:
        pass
 def construct_agent(args: argparse.Namespace, captioning_fn=None) -> Agent:
    llm_config = lm_config.construct_llm_config(args)
    agent: Agent
    if args.agent_type == "teacher_forcing":
        agent = TeacherForcingAgent()
    elif args.agent_type == "prompt":
        with open(args.instruction_path) as f:
            constructor_type = json.load(f)["meta_data"]["prompt_constructor"]
        tokenizer = Tokenizer(args.provider, args.model)
        prompt_constructor = eval(constructor_type)(
            args.instruction_path, lm_config=llm_config, tokenizer=tokenizer
        )
        agent = PromptAgent(
            action_set_tag=args.action_set_tag,
            lm_config=llm_config,
            prompt_constructor=prompt_constructor,
            captioning_fn=captioning_fn
        )
    else:
        raise NotImplementedError(
            f"agent type {args.agent_type} not implemented"
        )
    return agent
--- a/VAB-WebArena-Lite/new/api_utils.py
+++ b/VAB-WebArena-Lite/new/api_utils.py
@ -0,0 +1,682 @@
 import os
 import copy
 import json
 import time
 import base64
 import shutil
 import requests
 import dashscope
 import http.client
 import anthropic
 import google.auth
 from google.oauth2 import service_account
 from google.auth.transport.requests import Request
 from openai import OpenAI
 from typing import List, Tuple, Dict
 from http import HTTPStatus
 from PIL import Image
 from io import BytesIO
 PROXIES = {                             # gemini
    "http": "http://127.0.0.1:7890",
    "https": "http://127.0.0.1:7890"
 }
 SEED  = int(os.environ.get("SEED", 42))
 GCLOUD_KEY_FILE_PATH = ""       # path to the google cloud project json file
 GCLOUD_REGIONAL_CODE = "asia-east1"
 OPENAI_API_URL = os.environ.get("OPENAI_API_URL")
 FINETUNED_URL  = os.environ.get("FINETUNED_URL")        # finetuned model url
 OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]           # you should alway setup openai api key for evaluation
 GEMINI_API_KEY = os.environ.get("GEMENI_API_KEY", "")   # no need when using google cloud
 QWEN_API_KEY   = os.environ.get("QWEN_API_KEY"  , "")
 CLAUDE_API_KEY = os.environ.get("CLAUDE_API_KEY", "")
 class BasicModel(object):
    def __init__(self):
        super().__init__()
        # make temp dir here
        file_path = os.path.dirname(__file__)
        self.base_dir = os.path.join(file_path, "temp", f"{int(time.time())}")
        os.makedirs(self.base_dir, exist_ok=True)
    def __del__(self):
        # remove temp dir
        shutil.rmtree(self.base_dir, ignore_errors=True) 
    def prompt_construct(self, messages: List[Dict]) -> List[Dict]:
        return messages 
    @staticmethod
    def process_system_prompt(messages: List[Dict]) -> List[Dict]:
        if messages[0]["role"] != "system":
            return messages
        new_messages = copy.deepcopy(messages[1:])
        system_prompt = messages[0]["content"]
        # Search for first user message and add system prompt to it
        for item in new_messages:
            if item.get("role") != "user":
                continue
            for ct in item["content"]:
                # Case 1: directly appended to the text
                if ct["type"] == "text":
                    ct["text"] = system_prompt + "\n" + ct["text"]
                    return new_messages
            # Case 2: create a new text item
            item["content"].insert(0, {
                "type": "text",
                "text": system_prompt
            })
            return new_messages
        # Case 3: no user message found, add a new user message
        new_messages.insert(0, {
            "role": "user",
            "content": [{
                "type": "text",
                "text": system_prompt
            }]
        })
        return new_messages
    @staticmethod
    def pil_to_b64(img: Image.Image) -> str:
        with BytesIO() as image_buffer:
            img.save(image_buffer, format="PNG")
            byte_data = image_buffer.getvalue()
            img_b64 = base64.b64encode(byte_data).decode("utf-8")
            img_b64 = "data:image/png;base64," + img_b64
        return img_b64
    # save base64 image and return filename
    def b64_to_image(self, base64_data: str) -> str:
        base64_data = base64_data.split(",")[1]
        image_data = base64.b64decode(base64_data)
        filename = os.path.join(self.base_dir, f"{int(time.time())}.png")
        with open(filename, "wb") as f:
            f.write(image_data)
        return filename
    def get_model_response(self, messages: List[Dict], model_name: str, **args) -> Tuple[bool, str]:
        raise NotImplementedError("Subclasses must implement this method")
 class OpenAIModel(BasicModel):
    def __init__(self):
        super().__init__()
        self.client = OpenAI(api_key=OPENAI_API_KEY, base_url=OPENAI_API_URL)
    def prompt_construct(self, messages: List[Dict]) -> List[Dict]:
        return messages 
    def get_model_response(self, messages: List[Dict], model_name: str, **args) -> Tuple[bool, str]:
        if "OPENAI_API_KEY" not in os.environ:
            raise ValueError(
                "OPENAI_API_KEY environment variable must be set when using OpenAI API."
            )
        response = self.client.chat.completions.create(
            model=model_name,
            messages=messages,
            temperature=args.get("temperature", 0.0),
            max_tokens=args.get("max_tokens", 1024),
            top_p=args.get("top_p", 1.0),
        )
        try:
            answer: str = response.choices[0].message.content
            return True, answer
        except:
            return False, str(response.error)
 class FinetuneModel(BasicModel):
    def __init__(self):
        super().__init__()
        self.url = FINETUNED_URL           # inference api
    def prompt_construct(self, messages: List[Dict]) -> List[Dict]:
        dialog, images = "", []
        for message in messages:
            if message["role"] == "system":
                dialog += f"<|system|>\n{message['content']}\n\n"
                continue
            elif message["role"] == "assistant":
                dialog += f"<|assistant|>\n{message['content']}\n\n"
                continue
            dialog += f"<|user|>\n" 
            for content in message["content"]:
                if content["type"] == "text":
                    dialog += f"{content['text']}\n"
                else:
                    # TODO: we use filename as image url here
                    images.append(self.b64_to_image(content["image_url"]["url"]))
            dialog += "\n\n"
        dialog += "<|assistant|>\n"         
        images = [open(image, "rb") for image in images]
        new_messages = [
            {"image": images[0]},
            {"prompt": dialog}
        ]
        return new_messages
    def get_model_response(self, messages: List[Dict], model_name: str, **args) -> Tuple[bool, str]:
        try:
            response = requests.post(self.url, files=messages[0], data=messages[1], timeout=40)
            response = response.json()
        except Exception as e:
            return False, str(e)
        if "error" in response:
            return False, response["error"]["message"]
        # TODO: you should change the response format here
        resp = f'```\n{response["response"].split("<|end_of_text|>")[0]}\n```'
        return True, resp
 class QwenModel(BasicModel):
    def __init__(self):
        super().__init__()
        dashscope.api_key = QWEN_API_KEY
        self.seed = SEED
    def prompt_construct(self, messages: List[Dict]) -> List[Dict]:
        messages = self.process_system_prompt(messages)
        new_messages = []
        for message in messages:
            if message["role"] != "user":
                new_messages.append({
                    "role": "assistant",
                    "content": [{"text": message["content"]}]
                })
                continue
            new_content = []
            for content in message["content"]:
                if content["type"] == "text":
                    new_content.append({
                        "text": content["text"],
                    })
                else:
                    filename = self.b64_to_image(content["image_url"]["url"])
                    new_content.append({
                        "image": f"file://{filename}"
                    })
            new_messages.append({
                "role": "user",
                "content": new_content
            })
        return new_messages 
    def get_model_response(self, messages: List[Dict], model_name: str, **args) -> Tuple[bool, str]:
        if "QWEN_API_KEY" not in os.environ:
            raise ValueError(
                "QWEN_API_KEY environment variable must be set when using Qwen API."
            )
        response = dashscope.MultiModalConversation.call(
            model=model_name,
            messages=messages,
            top_k=args.get("top_k"),
            seed=self.seed
        )
        if response.status_code == HTTPStatus.OK:
            return True, response.output.choices[0].message.content[0]['text']
        else:
            return False, response.message
 class ClaudeModel(BasicModel):
    def __init__(self):
        super().__init__()
        self.client = anthropic.Anthropic(api_key=CLAUDE_API_KEY)
    def prompt_construct(self, messages: List[Dict]) -> List[Dict]:
        new_messages = []
        for message in messages:
            if message["role"] in ["system", "assistant"]:
                new_messages.append(message)
                continue
            new_content = []
            for content in message["content"]:
                if content["type"] == "text":
                    new_content.append(content)
                    continue
                hdr, idata = content["image_url"]["url"].split(";base64,")
                new_content.append({
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": hdr.split("data:")[1],
                        "data": idata
                    }
                })
            new_messages.append({
                "role": "user",
                "content": new_content
            })
        return new_messages
    def get_model_response(self, messages: List[Dict], model_name: str, **args) -> Tuple[bool, str]:
        try:
            if messages[0]["role"] == "system":
                system_prompt = messages[0]["content"]
                messages = messages[1:]
                response = self.client.messages.create(
                    model="claude-3-5-sonnet-20240620",
                    max_tokens=args.get("max_tokens"),
                    temperature=args.get("temperature"),
                    system=system_prompt,
                    messages=messages
                )
            else:
                response = self.client.messages.create(
                    model="claude-3-5-sonnet-20240620",
                    max_tokens=args.get("max_tokens"),
                    temperature=args.get("temperature"),
                    messages=messages
                )
            usage = response.usage
            prompt_tokens = usage.input_tokens
            completion_tokens = usage.output_tokens
            # print(response)
            print(response.content)
            print(f"Prompt Tokens: {prompt_tokens}\nCompletion Tokens: {completion_tokens}\n")
            return True, response.content
        except Exception as e:
            return False, str(e)
    def get_model_response_thirdapi(self, messages) -> Tuple[bool, str]:
        conn = http.client.HTTPSConnection("cn2us02.opapi.win", timeout=900)
        system_prompt = None
        if messages[0]["role"] == "system":
            system_prompt = messages[0]["content"]
            messages = messages[1:]
            payload = {
                "model": "claude-3-opus",
                "stream": False,
                "system": system_prompt,
                "messages": messages,
                "temperature": self.temperature,
                "max_tokens": self.max_tokens
            }
        else:
            payload = {
                "model": "claude-3-opus",
                "stream": False,
                "messages": messages,
                "temperature": self.temperature,
                "max_tokens": self.max_tokens
            }
        payload = json.dumps(payload)
        headers = {
            'Accept': 'application/json',
            'Authorization': f'Bearer {CLAUDE_API_KEY}',
            'User-Agent': 'Apifox/1.0.0 (https://apifox.com)',
            'Content-Type': 'application/json'
        }
        try:
            conn.request("POST", "/v1/messages", payload, headers)
            res = conn.getresponse()
            data = res.read()
            response = json.loads(data.decode("utf-8"))
        except Exception as e:
            return False, str(e)
        if "statusCode" in response and response["statusCode"] != 200:
            return False, response["message"]
        usage = response["usage"]
        prompt_tokens = usage["input_tokens"]
        completion_tokens = usage["output_tokens"]
        print(f"Prompt Tokens: {prompt_tokens}\nCompletion Tokens: {completion_tokens}\n")
        return True, response["content"][0]["text"]
 class GeminiModel(BasicModel):
    def __init__(self):
        super().__init__()
        self.api_key = GEMINI_API_KEY
    def prompt_construct(self, messages: List[Dict]) -> List[Dict]:
        parts = []
        dialog = ""
        sep = "\n\n###\n\n"
        for message in messages:
            if message["role"] == "system":
                dialog += f"SYSTEM:\n{message['content']}{sep}"
            elif message["role"] == "assistant":
                dialog += f"ASSISTANT:\n{message['content']}{sep}"
            elif message["role"] == "user":
                dialog += "USER:\n"
                for content in message["content"]:
                    if content["type"] == "text":
                        dialog += content["text"] + "\n"
                        continue
                    assert content["type"] == "image_url"
                    # save text
                    parts.append({ "text": dialog })
                    dialog = ""
                    # new content type for image
                    hdr, idata = content["image_url"]["url"].split(";base64,")
                    parts.append({
                        "inline_data": {
                            "mime_type": hdr.split("data:")[1],
                            "data": idata
                        }
                    })
                dialog += sep
            else:
                raise ValueError(f"Invalid role: {message['role']}")
        parts.append({
            "text": dialog + "ASSISTANT:\n"
        })
        new_messages = [{
            "parts": parts,
            "role": "user"
        }]
        return new_messages
    def get_model_response(self, messages: List[Dict], model_name: str, **args) -> Tuple[bool, str]:
        headers = {
            "Content-Type": "application/json"
        }
        proxies = PROXIES
        url = f"https://generativelanguage.googleapis.com/v1beta/models/{model_name}-latest:generateContent?key={self.api_key}"
        generation_config = {
            "temperature": args.get('temperature'),
            "maxOutputTokens": args.get('max_tokens'),
            "stopSequences": ["\n\n###\n\n"]
        }
        safety_settings = [
            {
                "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
                "threshold": "BLOCK_ONLY_HIGH"
            },
            {
                "category": "HARM_CATEGORY_HARASSMENT",
                "threshold": "BLOCK_ONLY_HIGH"
            },
            {
                "category": "HARM_CATEGORY_HATE_SPEECH",
                "threshold": "BLOCK_ONLY_HIGH"
            },
            {
                "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
                "threshold": "BLOCK_ONLY_HIGH"
            }
        ]
        payload = {
            "contents": messages,
            "generationConfig": generation_config,
            "safetySettings": safety_settings
        }
        try:
            response = requests.post(url, headers=headers, json=payload, proxies=proxies, timeout=30)
            response = response.json()
        except Exception as e:
            return False, str(e)
        if "error" in response:
            return False, response["error"]["message"]
        if "content" not in response['candidates'][0]:
            self.generation_config['maxOutputTokens'] *= 2
            return False, "No content generated."
        return True, response['candidates'][0]['content']['parts'][0]['text']
 class VertexGeminiModel(BasicModel):
    def __init__(self):
        super().__init__()
    def prompt_construct(self, messages: List[Dict]) -> List[Dict]:
        parts = []
        dialog = ""
        sep = "\n\n###\n\n"
        for message in messages:
            if message["role"] == "system":
                dialog += f"SYSTEM:\n{message['content']}{sep}"
            elif message["role"] == "assistant":
                dialog += f"ASSISTANT:\n{message['content']}{sep}"
            elif message["role"] == "user":
                dialog += "USER:\n"
                for content in message["content"]:
                    if content["type"] == "text":
                        dialog += content["text"] + "\n"
                        continue
                    assert content["type"] == "image_url"
                    # save text
                    parts.append({ "text": dialog })
                    dialog = ""
                    # new content type for image
                    hdr, idata = content["image_url"]["url"].split(";base64,")
                    parts.append({
                        "inline_data": {
                            "mime_type": hdr.split("data:")[1],
                            "data": idata
                        }
                    })
                dialog += sep
            else:
                raise ValueError(f"Invalid role: {message['role']}")
        parts.append({
            "text": dialog + "ASSISTANT:\n"
        })
        new_messages = [{
            "parts": parts,
            "role": "user"
        }]
        return new_messages
    def get_model_response(self, messages: List[Dict], model_name: str, **args) -> Tuple[bool, str]:
        def get_gcloud_token():
            def get_token():
                try:
                    # Load the credentials from the key file
                    creds = service_account.Credentials.from_service_account_file(
                        GCLOUD_KEY_FILE_PATH,
                        # You can list multiple scopes if needed
                        scopes=['https://www.googleapis.com/auth/cloud-platform']  
                    )
                    # Refresh the token (this is needed even for the first time)
                    creds.refresh(Request())
                    return creds.token
                except Exception as e:
                    print(f"An error occurred while trying to fetch the gcloud token: {str(e)}")
                    return None
            os.environ['HTTP_PROXY'] = PROXIES['http']
            os.environ['HTTPS_PROXY'] = PROXIES['https']
            fail_time = 0
            while not api_key and fail_time < 10:
                time.sleep(5)
                api_key = get_token()
                fail_time += 1
            return api_key
        def get_url(model_name: str) -> str:
            region_code = GCLOUD_REGIONAL_CODE
            model_id = f"{model_name}:generateContent"
            with open(GCLOUD_KEY_FILE_PATH, "r") as f:
                project_id = json.load(f)["project_id"]
            return f"https://{region_code}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{region_code}/publishers/google/models/{model_id}"
        url = get_url(model_name)
        api_key = get_gcloud_token()
        if not api_key:
            return False, "Failed to fetch gcloud token."
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {api_key}"
        }
        proxies = PROXIES
        safety_settings = [
            {
                "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
                "threshold": "BLOCK_ONLY_HIGH"
            },
            {
                "category": "HARM_CATEGORY_HARASSMENT",
                "threshold": "BLOCK_ONLY_HIGH"
            },
            {
                "category": "HARM_CATEGORY_HATE_SPEECH",
                "threshold": "BLOCK_ONLY_HIGH"
            },
            {
                "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
                "threshold": "BLOCK_ONLY_HIGH"
            }
        ]
        generation_config = {
            "temperature": args.get('temperature'),
            "maxOutputTokens": args.get('max_tokens'),
            "stopSequences": ["\n\n###\n\n"]
        }
        payload = {
            "contents": messages,
            "generationConfig": generation_config,
            "safetySettings": safety_settings
        }
        try:
            response = requests.post(url, headers=headers, json=payload, proxies=proxies, timeout=120)
            response = response.json()
        except Exception as e:
            return False, str(e)
        if "error" in response:
            return False, response["error"]["message"]
        if "content" not in response['candidates'][0]:
            self.generation_config['maxOutputTokens'] *= 2
            return False, "No content generated."
        return True, response['candidates'][0]['content']['parts'][0]['text']
 class APIModel(object):
    def __init__(self):
        super().__init__()
        self.models = {
            "openai": OpenAIModel(),
            "gemini": VertexGeminiModel(),
            "qwen": QwenModel(),
            "finetuned": FinetuneModel(),
            "claude": ClaudeModel()
        }
    def inference(self, model_id: str, messages: List[Dict], args: Dict) -> Tuple[bool, str]:
        model_provider, model_name = model_id.split("_")[:2] if "_" in model_id else (model_id, model_id) # eg. "openai_gpt4o"
        if model_provider not in self.models:
            return False, f"Unsupported model: {model_provider} ({model_name})"
        model = self.models[model_provider]
        prompt = model.prompt_construct(messages)
        resp = model.get_model_response(prompt, model_name, **args)
        return resp
 model = APIModel()
 def generate_with_api(prompt: List[dict], model_id: str, args: Dict) -> str:
    success, response = model.inference(model_id, prompt, args)
    return response
 if __name__ == "__main__":
    path_to_image = "../../coco_images/000000000285.jpg"
    from PIL import Image
    image = Image.open(path_to_image)
    img_str = BasicModel.pil_to_b64(image)
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant. Please response concisely."
        },
        {
            "role": "user",
            "content": [{
                "type": "text",
                "text": "what's annotated in this image? Image: Omitted."
            }]
        },
        {
            "role": "assistant",
            "content": "Only 5.cart is annotated in this image."
        },
        {
            "role": "user",
            "content": [{
                "type": "text",
                "text": "What can you see?"
            },{
                "type": "image_url",
                "image_url": {
                    "url": img_str,
                    "detail": "high"
                }
            }]
        }
    ]
    response = generate_with_api(messages, "openai", {
        "temperature": 0.5,
        "max_tokens": 1024,
        "top_p": 0.9,
        "n": 1,
    })
--- a/VAB-WebArena-Lite/new/evaluators.py
+++ b/VAB-WebArena-Lite/new/evaluators.py
@ -0,0 +1,649 @@
 """base class for evaluation"""
 # answer string match
 import importlib
 import json
 import re
 import time
 import urllib
 from pathlib import Path
 from typing import Any, Optional, Tuple, Union
 from urllib.parse import urljoin
 import evaluate  # type: ignore[import]
 import requests
 from beartype import beartype
 from beartype.door import is_bearable
 from nltk.tokenize import word_tokenize  # type: ignore
 from PIL import Image
 from playwright.sync_api import CDPSession, Page
 from browser_env.actions import Action
 from browser_env.utils import StateInfo
 from evaluation_harness import image_utils
 from evaluation_harness.helper_functions import (
    PseudoPage,
    get_query_text,
    get_query_text_lowercase,
    gitlab_get_project_memeber_role,
    llm_fuzzy_match,
    llm_ua_match,
    reddit_get_latest_comment_content_by_username,
    reddit_get_latest_comment_obj_by_username,
    reddit_get_parent_comment_username_of_latest_comment_by_username,
    reddit_get_post_url,
    shopping_get_latest_order_url,
    shopping_get_num_reviews,
    shopping_get_order_product_name_list,
    shopping_get_order_product_option,
    shopping_get_order_product_quantity,
    shopping_get_product_attributes,
    shopping_get_product_price,
    shopping_get_rating_as_percentage,
    shopping_get_sku_latest_review_author,
    shopping_get_sku_latest_review_rating,
    shopping_get_sku_latest_review_text,
 )
 Trajectory = list[Union[Action, StateInfo]]
@beartype
 class Evaluator(object):
    def __init__(self, eval_tag: str = "") -> None:
        self.eval_tag = eval_tag
    def __call__(
        self,
        trajectory: Trajectory,
        config_file: Path | str,
        page: Page | PseudoPage
    ) -> float:
        raise NotImplementedError
    @staticmethod
    def get_last_action(trajectory: Trajectory) -> Action:
        try:
            is_bearable(trajectory[-1], Action)
            last_action = trajectory[-1]
        except Exception:
            raise ValueError(
                "The last element of trajectory should be an action, add a fake stop action if needed"
            )
        return last_action  # type: ignore[return-value]
    @staticmethod
    def get_last_state(trajectory: Trajectory) -> StateInfo:
        try:
            is_bearable(trajectory[-2], StateInfo)
            last_state = trajectory[-2]
        except Exception:
            raise ValueError(
                "The second last element of trajectory should be a state, add a fake stop action if needed"
            )
        return last_state  # type: ignore[return-value]
@beartype
 class NumericEvaluator(Evaluator):
    """Check if the numerical relationship is correct"""
    @staticmethod
    @beartype
    def str_2_int(s: str) -> Optional[int]:
        try:
            s = s.strip()
            if "," in s:
                s = s.replace(",", "")
            return int(s)
        except ValueError:
            # Return None if the string cannot be converted to int
            print(f"[NumericEvaluator error]: Cannot convert {s} to int")
            return None
    @staticmethod
    @beartype
    def compare_inequality(
        value: Union[int, float], inequality: str, tol: float = 1e-8
    ) -> bool:
        """
        Compare a value (int or float) against an inequality string.
        Args:
        - value (int/float): The value to be compared.
        - inequality (str): Inequality in the form of "< 700", ">= 300", etc.
        - tol (float): Tolerance for floating point comparisons.
        Returns:
        - bool: True if the value satisfies the inequality, False otherwise.
        """
        # Extract the operator and the number from the inequality string
        ops = {
            "<=": lambda x, y: x <= y + tol,
            ">=": lambda x, y: x >= y - tol,
            "==": lambda x, y: abs(x - y) <= tol,
            "<": lambda x, y: x < y + tol,
            ">": lambda x, y: x > y - tol,
        }
        for op, func in ops.items():
            if op in inequality:
                _, num = inequality.split(op)
                return func(value, float(num.strip()))
        raise ValueError(f"Invalid inequality string: {inequality}")
@beartype
 class StringEvaluator(Evaluator):
    """Check whether the answer is correct with:
    exact match: the answer is exactly the same as the reference answer
    must include: each phrase in the reference answer must be included in the answer
    fuzzy match: the answer is similar to the reference answer, using LLM judge
    """
    @staticmethod
    @beartype
    def clean_answer(answer: str) -> str:
        if answer.startswith("'") and answer.endswith("'"):
            answer = answer[1:-1]
        elif answer.startswith('"') and answer.endswith('"'):
            answer = answer[1:-1]
        return answer.lower()
    @staticmethod
    @beartype
    def exact_match(ref: str, pred: Union[str, int]) -> float:
        if isinstance(pred, int):
            pred = str(pred)
        return float(
            StringEvaluator.clean_answer(pred)
            == StringEvaluator.clean_answer(ref)
        )
    @staticmethod
    @beartype
    def must_include(ref: str, pred: str) -> float:
        clean_ref = StringEvaluator.clean_answer(ref)
        clean_pred = StringEvaluator.clean_answer(pred)
        # tokenize the answer if the ref is a single word
        # prevent false positive (e.g, 0)
        if len(word_tokenize(clean_ref)) == 1:
            tok_pred = word_tokenize(clean_pred)
            return float(clean_ref in tok_pred)
        else:
            return float(clean_ref in clean_pred)
    @staticmethod
    @beartype
    def must_exclude(ref: str, pred: str) -> float:
        """Returns 1 if pred is not in ref, and 0 otherwise"""
        clean_ref = StringEvaluator.clean_answer(ref)
        clean_pred = StringEvaluator.clean_answer(pred)
        # tokenize the answer if the ref is a single word
        # prevent false positive (e.g, 0)
        if len(word_tokenize(clean_ref)) == 1:
            tok_pred = word_tokenize(clean_pred)
            return float(clean_ref not in tok_pred)
        else:
            return float(clean_ref not in clean_pred)
    @staticmethod
    @beartype
    def fuzzy_match(ref: str, pred: str, intent: str) -> float:
        return llm_fuzzy_match(pred, ref, intent)
    @staticmethod
    @beartype
    def ua_match(ref: str, pred: str, intent: str) -> float:
        return llm_ua_match(pred, ref, intent)
    def __call__(
        self,
        trajectory: Trajectory,
        config_file: Path | str,
        page: Page | PseudoPage | None = None
    ) -> float:
        with open(config_file, "r") as f:
            configs = json.load(f)
        last_action = self.get_last_action(trajectory)
        pred = self.clean_answer(last_action["answer"])
        score = 1.0
        for approach, value in configs["eval"]["reference_answers"].items():
            match approach:
                case "exact_match":
                    score *= self.exact_match(ref=value, pred=pred)
                case "required_values":
                    required_values = value
                    assert isinstance(required_values, list)
                    pred = NumericEvaluator.str_2_int(pred)
                    if pred is None:
                        score = 0.0
                    else:
                        for v in required_values:
                            value_or = v.split(" |OR| ")
                            score *= any(
                                [
                                    NumericEvaluator.compare_inequality(
                                        pred, value
                                    )
                                    for value in value_or
                                ]
                            )
                case "must_include":
                    assert isinstance(value, list)
                    for must_value in value:
                        value_or = must_value.split(" |OR| ")
                        score *= any([self.must_include(ref=v, pred=pred) for v in value_or])
                case "must_exclude":
                    assert isinstance(value, list)
                    for must_excl_value in value:
                        score *= self.must_exclude(
                            ref=must_excl_value, pred=pred
                        )
                case "one_of":
                    assert isinstance(value, list)
                    found = False
                    for one_of_value in value:
                        one_of_value = self.clean_answer(one_of_value)
                        if one_of_value in pred:
                            found = True
                            break
                    score = score * found
                case "fuzzy_match":
                    intent = configs["intent"]
                    if value == "N/A":
                        # if the instruction only asks the model to generate N/A when encountering an unachievable task
                        # without more concrete reasons
                        score *= self.exact_match(ref=value, pred=pred)
                        # if the instruction also asks the model to generate the reason why the task is unachievable
                        # this should be the default as it will prevent false positive N/A`
                        if score != 1:
                            score = 1.0 * self.ua_match(
                                intent=configs["intent"],
                                ref=configs["eval"]["string_note"],
                                pred=pred,
                            )
                    else:
                        assert isinstance(value, list)
                        reference = ', '.join(value)
                        score *= self.fuzzy_match(
                            ref=reference, pred=pred, intent=intent
                        )
        return score
@beartype
 class StringSoftEvaluator(Evaluator):
    """Use text generation metrics such as BLEU, ROUGE, etc. to evaluate the answer"""
    def __call__(
        self,
        trajectory: Trajectory,
        config_file: Path | str,
        page: Page | PseudoPage | None = None
    ) -> float:
        with open(config_file, "r") as f:
            configs = json.load(f)
        last_action = self.get_last_action(trajectory)
        pred = last_action["answer"]
        ref = configs["eval"]["reference_answers"]
        # rouge
        m = evaluate.load("rouge")
        rouge = m.compute(predictions=[pred], references=[ref])
        return float(rouge["rouge1"])
@beartype
 class URLExactEvaluator(Evaluator):
    """Check whether the URL is exactly the same as of the reference URLs"""
    def __call__(
        self,
        trajectory: Trajectory,
        config_file: Path | str,
        page: Page | PseudoPage
    ) -> float:
        with open(config_file, "r") as f:
            configs = json.load(f)
        def clean_url(url: str) -> str:
            url = str(url)
            # Replace http://localhost with http://127.0.0.1 to keep things consistent across evals.
            url = url.replace("localhost", "127.0.0.1")
            if url.endswith("/"):
                url = url[:-1]
            return url
        pred = clean_url(page.url)
        ref_urls = configs["eval"]["reference_url"].split(" |OR| ")
        ref_urls = [clean_url(url) for url in ref_urls]
        matching_rule = configs["eval"].get("url_note", "EXACT")
        if matching_rule == "EXACT":
            if pred in ref_urls:
                return 1.0
            else:
                return 0.0
        elif matching_rule == "GOLD in PRED":
            if any([ref in pred for ref in ref_urls]):
                return 1.0
            else:
                return 0.0
        else:
            raise ValueError(f"Unknown matching rule: {matching_rule}")
@beartype
 class HTMLContentExactEvaluator(Evaluator):
    """Check whether the contents appear in the page"""
    @staticmethod
    @beartype
    def fuzzy_match(ref: str, pred: str, intent: str) -> float:
        return llm_fuzzy_match(pred, ref, intent)
    def __call__(
        self,
        trajectory: Trajectory,
        config_file: Path | str,
        page: Page | PseudoPage
    ) -> float:
        with open(config_file, "r") as f:
            configs = json.load(f)
        targets = configs["eval"]["program_html"]
        score = 1.0
        for target in targets:
            target_url: str = target["url"]  # which url to check
            if target_url.startswith("func"):
                func = target_url.split("func:")[1]
                func = func.replace("__last_url__", page.url)
                target_url = eval(func)
            locator: str = target["locator"]  # js element locator
            # navigate to that url
            if target_url != "last":
                page.goto(target_url)
                time.sleep(3)  # TODO [shuyanzh]: fix this hard-coded sleep
            # empty, use the full page
            if not locator.strip():
                selected_element = page.content()
            # use JS to select the element
            elif locator.startswith("document.") or locator.startswith(
                "[...document."
            ):
                if "prep_actions" in target:
                    try:
                        for prep_action in target["prep_actions"]:
                            page.evaluate(f"() => {prep_action}")
                    except Exception:
                        pass
                try:
                    selected_element = str(page.evaluate(f"() => {locator}"))
                    if not selected_element:
                        selected_element = ""
                except Exception:
                    # the page is wrong, return empty
                    selected_element = ""
            elif locator.startswith("lambda:"):
                try:
                    locator = locator.lstrip("lambda:")
                    selected_element = page.evaluate(locator)
                    if not selected_element:
                        selected_element = None
                except Exception:
                    # the page is wrong, return empty
                    selected_element = None
            # run program to call API
            elif locator.startswith("func:"):  # a helper function
                func = locator.split("func:")[1]
                func = func.replace("__page__", "page")
                selected_element = eval(func)
            else:
                raise ValueError(f"Unknown locator: {locator}")
            # If the selected element is None, then the page is wrong
            if selected_element is None:
                score = 0.0
                break
            if "exact_match" in target["required_contents"]:
                required_contents = target["required_contents"]["exact_match"]
                score *= StringEvaluator.exact_match(
                    ref=required_contents, pred=selected_element
                )
            elif "must_include" in target["required_contents"]:
                required_contents = target["required_contents"]["must_include"]
                assert isinstance(required_contents, list)
                for content in required_contents:
                    content_or = content.split(" |OR| ")
                    score *= any(
                        [
                            StringEvaluator.must_include(
                                ref=content, pred=selected_element
                            )
                            for content in content_or
                        ]
                    )
            elif "must_exclude" in target["required_contents"]:
                required_contents = target["required_contents"]["must_exclude"]
                assert isinstance(required_contents, list)
                for content in required_contents:
                    assert " |OR| " not in content
                    score *= StringEvaluator.must_exclude(
                        content, pred=selected_element
                    )
            elif "required_values" in target["required_contents"]:
                required_values = target["required_contents"][
                    "required_values"
                ]
                assert isinstance(required_values, list)
                if isinstance(selected_element, str):
                    selected_element = NumericEvaluator.str_2_int(
                        selected_element
                    )
                if selected_element is None:
                    score = 0.0
                else:
                    for value in required_values:
                        value_or = value.split(" |OR| ")
                        score *= any(
                            [
                                NumericEvaluator.compare_inequality(
                                    selected_element, value
                                )
                                for value in value_or
                            ]
                        )
            elif "fuzzy_match" in target["required_contents"]:
                required_contents = target["required_contents"]["fuzzy_match"]
                intent = configs["intent"]
                assert isinstance(required_contents, list)
                reference = ', '.join(required_contents)
                score *= self.fuzzy_match(
                    ref=reference, pred=selected_element, intent=intent
                )
            else:
                raise ValueError(
                    f"Unknown required_contents: {target['required_contents'].keys()}"
                )
        return score
@beartype
 class PageImageEvaluator(Evaluator):
    """Check whether the answer is correct by querying a vision model."""
    def __init__(self, captioning_fn):
        self.captioning_fn = captioning_fn
        # Default to 0.8 as the threshold for similarity to account for compression, resizing, etc
        # This might be too generous but we bias towards minimizing false negatives.
        self.ssim_threshold = 0.8
    def __call__(
        self,
        trajectory: Trajectory,
        config_file: Path | str,
        page: Page | PseudoPage | None = None
    ) -> float:
        with open(config_file, "r") as f:
            configs = json.load(f)
        for query in configs["eval"]["page_image_query"]:
            locator: str = query["eval_image_class"]
            target_url: str = query["eval_image_url"]
            if target_url.startswith("func"):
                func = target_url.split("func:")[1]
                func = func.replace("__last_url__", page.url)
                target_url = eval(func)
            # navigate to that url
            if target_url != "last":
                page.goto(target_url)
                time.sleep(3)  # TODO(jykoh): fix this hard-coded sleep
            # empty, use the full page
            if not locator.strip():
                images = page.get_by_role("img").all()
            # use JS to select the element
            elif locator.startswith("."):
                # Get all img children under the locator
                elements = page.query_selector_all(locator)
                images = []
                for element in elements:
                    is_img = element.evaluate(
                        'element => element.tagName === "IMG"'
                    )
                    if is_img:
                        images.append(element)
                    else:
                        images.extend(element.query_selector_all("img"))
            else:
                raise ValueError(f"Unknown locator: {locator}")
            if images == []:
                return 0.0
            all_image_pixels = []
            for image in images:
                try:
                    # Get image from URL.
                    image_url = image.get_attribute("src")
                    if not image_url.startswith(
                        ("http://", "https://", "www.")
                    ):
                        image_url = urljoin(page.url, image_url)
                    image = Image.open(
                        requests.get(image_url, stream=True).raw
                    )
                    all_image_pixels.append(image)
                except Exception as e:
                    print("[WARNING]: ", e)
            score = 1.0
            if all_image_pixels == []:
                return 0.0
            else:
                # Run the VQA eval on the image elements.
                eval_vqas = query.get("eval_vqa", [])
                assert (
                    len(eval_vqas) > 0 or "eval_fuzzy_image_match" in query
                ), "eval_vqa must have at least 2 questions or eval_fuzzy_image_match must be True"
                for qa in eval_vqas:
                    question, answer = qa["question"], qa["answer"]
                    prompt = f"Q: {question} A:"
                    pred_ans = self.captioning_fn(
                        all_image_pixels, [prompt] * len(all_image_pixels)
                    )
                    score *= float(
                        any(
                            [answer.lower() in ans.lower() for ans in pred_ans]
                        )
                    )
                if "eval_fuzzy_image_match" in query:
                    ssim_threshold = query.get(
                        "ssim_threshold", self.ssim_threshold
                    )
                    exact_match_imgs = query["eval_fuzzy_image_match"].split(
                        " |OR| "
                    )
                    all_exact_match_pixels = []
                    for exact_match_img in exact_match_imgs:
                        if exact_match_img.startswith("http"):
                            exact_match_pixels = Image.open(
                                requests.get(exact_match_img, stream=True).raw
                            )
                        else:
                            exact_match_pixels = Image.open(exact_match_img)
                        all_exact_match_pixels.append(exact_match_pixels)
                    # Check if any of the images on the page match
                    found_exact_match = False
                    for exact_match_pixels in all_exact_match_pixels:
                        for image_pixels in all_image_pixels:
                            ssim = image_utils.get_image_ssim(
                                image_pixels, exact_match_pixels
                            )
                            if ssim > ssim_threshold:
                                found_exact_match = True
                                break
                    score *= float(found_exact_match)
        return score
 class EvaluatorComb:
    def __init__(self, evaluators: list[Evaluator]) -> None:
        self.evaluators = evaluators
    def __call__(
        self,
        trajectory: Trajectory,
        config_file: Path | str,
        page: Page | PseudoPage
    ) -> float:
        score = 1.0
        for evaluator in self.evaluators:
            cur_score = evaluator(trajectory, config_file, page)
            score *= cur_score
        return score
@beartype
 def evaluator_router(
    config_file: Path | str, captioning_fn=None
 ) -> EvaluatorComb:
    """Router to get the evaluator class"""
    with open(config_file, "r") as f:
        configs = json.load(f)
    eval_types = configs["eval"]["eval_types"]
    evaluators: list[Evaluator | EvaluatorPartial] = []
    for eval_type in eval_types:
        match eval_type:
            case "string_match":
                evaluators.append(StringEvaluator())
            case "url_match":
                evaluators.append(URLExactEvaluator())
            case "program_html":
                evaluators.append(HTMLContentExactEvaluator())
            case "page_image_query":
                evaluators.append(PageImageEvaluator(captioning_fn))
            case _:
                raise ValueError(f"eval_type {eval_type} is not supported")
    return EvaluatorComb(evaluators)
--- a/VAB-WebArena-Lite/new/generate_test_data.py
+++ b/VAB-WebArena-Lite/new/generate_test_data.py
@ -0,0 +1,63 @@
 """Replace the website placeholders with website domains from env_config
 Generate the test data"""
 import json
 import os
 from browser_env.env_config import *
 def main() -> None:
    DATASET = os.environ["DATASET"]
    if DATASET == "webarena":
        print("DATASET: webarena")
        print(f"REDDIT: {REDDIT}")
        print(f"SHOPPING: {SHOPPING}")
        print(f"SHOPPING_ADMIN: {SHOPPING_ADMIN}")
        print(f"GITLAB: {GITLAB}")
        print(f"WIKIPEDIA: {WIKIPEDIA}")
        print(f"MAP: {MAP}")
        inp_paths = ["config_files/wa/test_webarena.raw.json", "config_files/wa/test_webarena_lite.raw.json"]
        replace_map = {
            "__REDDIT__": REDDIT,
            "__SHOPPING__": SHOPPING,
            "__SHOPPING_ADMIN__": SHOPPING_ADMIN,
            "__GITLAB__": GITLAB,
            "__WIKIPEDIA__": WIKIPEDIA,
            "__MAP__": MAP,
        }
    elif DATASET == "visualwebarena":
        print("DATASET: visualwebarena")
        print(f"CLASSIFIEDS: {CLASSIFIEDS}")
        print(f"REDDIT: {REDDIT}")
        print(f"SHOPPING: {SHOPPING}")
        inp_paths = [
            "config_files/vwa/test_classifieds.raw.json", "config_files/vwa/test_shopping.raw.json", "config_files/vwa/test_reddit.raw.json",
        ]
        replace_map = {
            "__REDDIT__": REDDIT,
            "__SHOPPING__": SHOPPING,
            "__WIKIPEDIA__": WIKIPEDIA,
            "__CLASSIFIEDS__": CLASSIFIEDS,
        }
    else:
        raise ValueError(f"Dataset not implemented: {DATASET}")
    for inp_path in inp_paths:
        output_dir = inp_path.replace('.raw.json', '')
        os.makedirs(output_dir, exist_ok=True)
        with open(inp_path, "r") as f:
            raw = f.read()
        for k, v in replace_map.items():
            raw = raw.replace(k, v)
        with open(inp_path.replace(".raw", ""), "w") as f:
            f.write(raw)
        data = json.loads(raw)
        for idx, item in enumerate(data):
            with open(os.path.join(output_dir, f"{idx}.json"), "w") as f:
                json.dump(item, f, indent=2)
 if __name__ == "__main__":
    main()
--- a/VAB-WebArena-Lite/new/helper_functions.py
+++ b/VAB-WebArena-Lite/new/helper_functions.py
@ -0,0 +1,647 @@
 """Implements helper functions to assist evaluation cases where other evaluators are not suitable."""
 import json
 from datetime import datetime, timezone
 from typing import Any, Union
 from urllib.parse import urlparse
 import requests
 from beartype import beartype
 from beartype.typing import Dict, List
 from playwright.sync_api import CDPSession, Page
 from browser_env.env_config import (
    ACCOUNTS,
    REDDIT,
    SHOPPING,
    WIKIPEDIA,
 )
 from llms.providers.openai_utils import (
    generate_from_openai_chat_completion,
 )
 import logging
 logger = logging.getLogger("logger")
 class PseudoPage:
    def __init__(self, original_page: Page, url: str):
        self.url = url
        self.original_page = original_page
    def __getattr__(self, attr: str) -> Any:
        # Delegate attribute access to the original page object
        if attr not in ["url"]:
            return getattr(self.original_page, attr)
        else:
            return getattr(self, attr)
@beartype
 def shopping_get_auth_token() -> str:
    response = requests.post(
        url=f"{SHOPPING}/rest/default/V1/integration/admin/token",
        headers={"content-type": "application/json"},
        data=json.dumps(
            {
                "username": ACCOUNTS["shopping_site_admin"]["username"],
                "password": ACCOUNTS["shopping_site_admin"]["password"],
            }
        ),
    )
    token: str = response.json()
    return token
@beartype
 def shopping_get_latest_order_url() -> str:
    """Get the latest order url from the shopping website."""
    header = {
        "Authorization": f"Bearer {shopping_get_auth_token()}",
        "Content-Type": "application/json",
    }
    params = {
        "searchCriteria[sortOrders][0][field]": "created_at",
        "searchCriteria[sortOrders][0][direction]": "DESC",
        "searchCriteria[pageSize]": "1",
    }
    response = requests.get(
        f"{SHOPPING}/rest/V1/orders", params=params, headers=header
    )
    assert response.status_code == 200
    response_obj = response.json()["items"][0]
    order_id = int(response_obj["increment_id"])
    order_url = f"{SHOPPING}/sales/order/view/order_id/{order_id}/"
    return order_url
@beartype
 def shopping_get_sku_latest_review_author(sku: str) -> str:
    """Get the latest review for shopping admin."""
    header = {
        "Authorization": f"Bearer {shopping_get_auth_token()}",
        "Content-Type": "application/json",
    }
    response = requests.get(
        f"{SHOPPING}/rest/V1/products/{sku}/reviews", headers=header
    )
    assert response.status_code == 200
    response_obj = response.json()
    if len(response_obj) == 0:
        return ""
    author: str = response_obj[-1]["nickname"]
    return author
@beartype
 def shopping_get_sku_latest_review_rating(sku: str) -> str:
    """Get the latest review for shopping admin."""
    header = {
        "Authorization": f"Bearer {shopping_get_auth_token()}",
        "Content-Type": "application/json",
    }
    response = requests.get(
        f"{SHOPPING}/rest/V1/products/{sku}/reviews", headers=header
    )
    assert response.status_code == 200
    response_obj = response.json()
    if len(response_obj) == 0:
        return ""
    assert response_obj[0]["ratings"][0]["rating_name"] == "Rating"
    rating: str = str(response_obj[-1]["ratings"][0]["percent"])
    return rating
@beartype
 def shopping_get_sku_latest_review_text(sku: str) -> str:
    """Get the latest review text for shopping admin."""
    header = {
        "Authorization": f"Bearer {shopping_get_auth_token()}",
        "Content-Type": "application/json",
    }
    response = requests.get(
        f"{SHOPPING}/rest/V1/products/{sku}/reviews", headers=header
    )
    assert response.status_code == 200
    response_obj = response.json()
    if len(response_obj) == 0:
        return ""
    text: str = response_obj[-1]["detail"]
    return text
@beartype
 def shopping_get_sku_latest_review_title(sku: str) -> str:
    """Get the latest review title for shopping admin."""
    header = {
        "Authorization": f"Bearer {shopping_get_auth_token()}",
        "Content-Type": "application/json",
    }
    response = requests.get(
        f"{SHOPPING}/rest/V1/products/{sku}/reviews", headers=header
    )
    assert response.status_code == 200
    response_obj = response.json()
    if len(response_obj) == 0:
        return ""
    title: str = response_obj[-1]["title"]
    return title
@beartype
 def shopping_get_sku_product_page_url(sku: str) -> str:
    """Get product page url from sku"""
    header = {
        "Authorization": f"Bearer {shopping_get_auth_token()}",
        "Content-Type": "application/json",
    }
    response = requests.get(
        f"{SHOPPING}/rest/V1/products/{sku}", headers=header
    )
    assert response.status_code == 200
    response_obj = response.json()
    if len(response_obj) == 0:
        return ""
    for custom_attributes in response_obj["custom_attributes"]:
        if custom_attributes["attribute_code"] == "url_key":
            return f"{SHOPPING}/{custom_attributes['value']}.html"
    return ""
@beartype
 def shopping_get_all_product_order(
    page: Page | PseudoPage,
 ) -> List[Dict[str, str]]:
    """
    Get info of all product in a given order page.
    Example output:
    [
        {
            "name": "Kellogg's Special K Protein Bars, Meal Replacement, Protein Snacks, Value Size, Strawberry, 19oz Box (12 Bars)\nSize\n12 Count (Pack of 1)",
            "options": {
                "Size": "12 Count (Pack of 1)"
            },
            "sku": "B00MXUFL0E",
            "price": "$24.50",
            "qty": "Ordered2",
            "subtotal": "$49.00"
        },
        {
            "name": "Kellogg's Special K Protein Bars, Meal Replacement, Protein Snacks, Value Size, Chocolatey Chip Cookie Dough, 19oz Box (12 Bars)",
            "sku": "B07ZD2PB9F",
            "price": "$42.30",
            "qty": "Ordered2",
            "subtotal": "$84.60"
        }
    ]
    """
    try:
        result = page.evaluate(
            f"""
 (() => {{
    try {{
        const products = [...document.querySelector("#my-orders-table").getElementsByTagName('tbody')].map(
            (x) => {{
                return [...x.getElementsByTagName('td')].reduce(function(obj, y) {{
                    const key = y.className.split(' ')[1];
                    obj[key] = y.outerText;
                    // check if options exist
                    if (key === 'name' && y.querySelector('dl')) {{
                        var option_dict = {{}}
                        const options = [...y.querySelector('dl').children];
                        for (let i = 0; i < options.length; i += 2) {{
                            option_dict[options[i].outerText] = options[i+1].outerText;
                        }}
                        obj['options'] = option_dict;
                    }}
                    return obj;
                }}, {{}})
            }}
        );
        return products;
    }} catch (e) {{
        // If any errors are caught, return an empty string
        return e;
        return [];
    }}
 }})();
            """
        )
        return result
    except Exception as e:
        result = []
    return result
@beartype
 def shopping_get_order_product_name_list(page: Page | PseudoPage) -> str:
    try:
        products = shopping_get_all_product_order(page)
        return " |OR| ".join([p["name"] for p in products])
    except Exception:
        return ""
@beartype
 def shopping_get_order_product_quantity(
    page: Page | PseudoPage, sku: str
 ) -> int:
    try:
        if "|OR|" in sku:
            skus = sku.split(" |OR| ")
        else:
            skus = [sku]
        products = shopping_get_all_product_order(page)
        for product in products:
            if product["sku"].strip() in skus:
                # Ordered{qty}
                return int(product["qty"][7:])
        return 0
    except Exception:
        return 0
@beartype
 def shopping_get_order_product_option(
    page: Page | PseudoPage, sku: str, option_name: str
 ) -> str:
    try:
        products = shopping_get_all_product_order(page)
        for product in products:
            if product["sku"].strip() == sku:
                # Ordered{qty}
                return product["options"][option_name]
        return ""
    except Exception as e:
        return ""
@beartype
 def shopping_get_product_attributes(
    page: Page | PseudoPage, attribute: str
 ) -> str:
    # Get the values of all cells in the table for the given attribute
    try:
        result = page.evaluate(
            f"""
                (() => {{
                try {{
                    // Create an array of search terms, splitting the string by ' |OR| '
                    const searchTerms = '{attribute}'.toLowerCase().split(' |or| ');
                    // Convert the children of the tbody inside the element with the given ID into an array
                    return Array.from(
                    document.querySelector('#productDetails_detailBullets_sections1 > tbody').children
                    )
                    // Filter the array to only include elements where the first child's text includes any of the search terms
                    .filter(x =>
                    searchTerms.some(term => x.children[0].outerText.toLowerCase().includes(term))
                    )
                    // Map over the filtered elements to get the outerText of their second child
                    .map(x => x.children[1].outerText)
                    // Join all the resulting strings with a comma and a space
                    .join(', ')
                }} catch (e) {{
                    // If any errors are caught, return an empty string
                    return ''
                }}
                }})();
            """
        )
    except Exception:
        result = ""
    return result
@beartype
 def shopping_get_product_price(page: Page | PseudoPage) -> Union[float, int]:
    """Get the price of the product on the shopping website."""
    try:
        result = page.evaluate(
            """
                (() => {{
                    res = parseFloat(document.querySelector(\"#maincontent > div.columns > div > div.product-info-main > div.product-info-price > div.price-box.price-final_price > span > span\")
                    .outerText.substr(1));
                    return res ? res : 0;
                }})();
            """
        )
    except Exception:
        result = 0
    return result
@beartype
 def shopping_get_num_reviews(page: Page | PseudoPage) -> int:
    """Get the price of the product on the shopping website."""
    try:
        result = page.evaluate(
            """
                (() => {{
                    res = parseInt(document.querySelector(\"#tab-label-reviews-title\")
                    .outerText.split(' ')[1]);
                    return res ? res : 0; }}
                )();
            """
        )
    except Exception:
        result = 0
    return result
@beartype
 def shopping_get_rating_as_percentage(page: Page | PseudoPage) -> int:
    """Get the rating of the product on the shopping website as a percentage out of 100."""
    try:
        rating = page.evaluate(
            """
                (() => {{
                    ratingPercentage = parseFloat(document.querySelector('.rating-result').title.replace('%', ''));
                    return ratingPercentage ? ratingPercentage : 0;
                }})();
            """
        )
    except Exception:
        rating = 0
    return rating
@beartype
 def get_query_text(page: Page | PseudoPage, selector: str) -> str:
    """Get the text content of the element matching the given selector.
    Note that this function DOES NOT perform downcasing.
    """
    try:
        result = page.evaluate(
            f"""
                (() => {{
                    try {{
                        return document.querySelector('{selector}').textContent;
                    }} catch (e) {{
                        return '';
                    }}
                }})();
            """
        )
    except Exception:
        result = ""
    return result
@beartype
 def get_query_text_lowercase(page: Page | PseudoPage, selector: str) -> str:
    """Get the lowercase text content of the element matching the given selector."""
    return get_query_text(page, selector).lower()
@beartype
 def reddit_get_post_url(url: str) -> str:
    """Get the post url"""
    # Url is http://domain/f/subreddit/post_id/...
    # get domain, subreddit, post_id
    domain = urlparse(url).netloc
    tok_url = urlparse(url).path.split("/")
    # not a valid post/comment url, return the url as is
    if len(tok_url) < 4:
        return url
    if tok_url[1] != "f":
        return url
    subreddit = urlparse(url).path.split("/")[2]
    post_id = urlparse(url).path.split("/")[3]
    scheme = urlparse(url).scheme
    post_url = f"{scheme}://{domain}/f/{subreddit}/{post_id}/"
    return post_url
@beartype
 def reddit_get_post_comment_tree(page: Page | PseudoPage) -> Dict[str, Any]:
    try:
        comment_tree = page.evaluate(
            f"""(function buildCommentTree(node, data_level) {{
    let tree = {{
        "username": node.querySelector(".fg-inherit").outerText,
        "net_score": parseInt(node.querySelector(".vote__net-score").outerText),
        "content": node.querySelector(".comment__content").outerText,
        "time": new Date(node.querySelector('.comment__main > header > h1 > span > time').dateTime),
        "children": []
    }};
    node.querySelectorAll(".comment").forEach((child) => {{
        if (parseInt(child.getAttribute('data-level')) === data_level+1) {{
            tree['children'].push(buildCommentTree(child, data_level+1));
        }}
    }})
    return tree;
 }})(document.querySelector("#main"), 0)"""
        )
    except Exception:
        comment_tree = {}
    return comment_tree
@beartype
 def reddit_get_latest_comment_obj_by_username(
    page: Page | PseudoPage, username: str
 ) -> Dict[str, Any]:
    try:
        comment_tree = reddit_get_post_comment_tree(page)
        latest_time = datetime.min.replace(tzinfo=timezone.utc)
        comment = {}
        def dfs(node):
            nonlocal latest_time
            nonlocal comment
            if node["username"] == username:
                if node["time"] > latest_time:
                    comment = {
                        "username": node["username"],
                        "net_score": node["net_score"],
                        "content": node["content"],
                        "time": node["time"],
                    }
                    latest_time = node["time"]
            for child in node["children"]:
                dfs(child)
        dfs(comment_tree)
    except Exception as e:
        comment = {}
    return comment
@beartype
 def reddit_get_latest_comment_content_by_username(
    page: Page | PseudoPage, username: str
 ) -> str:
    try:
        comment = reddit_get_latest_comment_obj_by_username(page, username)
        content = comment["content"]
    except Exception:
        content = ""
    return content
@beartype
 def reddit_get_parent_comment_obj_of_latest_comment_by_username(
    page: Page | PseudoPage, username: str
 ) -> Dict[str, Any]:
    try:
        comment_tree = reddit_get_post_comment_tree(page)
        latest_time = datetime.min.replace(tzinfo=timezone.utc)
        comment = {}
        def dfs(node):
            nonlocal latest_time
            nonlocal comment
            for child in node["children"]:
                if child["username"] == username:
                    if child["time"] > latest_time:
                        comment = {
                            "username": node["username"],
                            "net_score": node["net_score"],
                            "content": node["content"],
                            "time": node["time"],
                        }
                        latest_time = child["time"]
                else:
                    dfs(child)
        dfs(comment_tree)
    except Exception:
        comment = {}
    return comment
@beartype
 def reddit_get_parent_comment_username_of_latest_comment_by_username(
    page: Page | PseudoPage, username: str
 ) -> str:
    try:
        comment = reddit_get_parent_comment_obj_of_latest_comment_by_username(
            page, username
        )
        username = comment["username"]
    except Exception:
        username = ""
    return username
@beartype
 def gitlab_get_project_memeber_role(
    page: Page | PseudoPage, account_name: str
 ) -> str:
    # get the account index
    try:
        account_idx = page.evaluate(
            f"""(() => {{
                const elements = document.querySelectorAll("td[data-label='Account'] span.gl-avatar-labeled-sublabel");
                let index = -1;  // Default value if not found
                for(let i = 0; i < elements.length; i++) {{
                    if(elements[i].outerText === '@{account_name}') {{
                        index = i;
                        break;
                    }}
                }}
                return index;
            }})()"""
        )
        # get the role
        role: str = page.evaluate(
            f"""(() => {{
                return document.querySelectorAll("td.col-max-role span")[{account_idx}].outerText;
            }})()"""
        )
    except Exception:
        role = ""
    return role
@beartype
 def llm_fuzzy_match(pred: str, reference: str, question: str) -> float:
    """Check whether the prediction matches the reference with GPT-4-turbo"""
    messages: list[dict[str, Any]] = []
    # construct the question to ask
    message = "Help a teacher to grade the answer of a student given a question. Keep in mind that the student may use different phrasing or wording to answer the question. The goal is to evaluate whether the answer is semantically equivalent to the reference answer.\n"
    message += f"question: {question}\n"
    message += f"reference answer: {reference}\n"
    message += "all the string 'N/A' that you see is a special sequence that means 'not achievable'\n"
    message += f"student answer: {pred}\n"
    message += "Conclude the judgement by 'correct', 'incorrect', or 'partially correct'. Only output one of these options, and nothing else."
    messages = [
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": message},
    ]
    logger.info(f'[R] {reference}')
    logger.info(f'[P] {pred}')
    response = generate_from_openai_chat_completion(
        model="gpt-4-1106-preview",
        messages=messages,
        temperature=0,
        max_tokens=768,
        top_p=1.0,
        context_length=0,
    ).lower()
    if "partially correct" in response or "incorrect" in response:
        return 0.0
    else:
        assert "correct" in response, response
        return 1.0
 def llm_ua_match(pred: str, reference: str, question: str) -> float:
    """Check whether the prediction matches the reference with GPT-4-turbo"""
    messages: list[dict[str, Any]] = []
    # construct the question to ask
    message = ""
    message += f"task: {question}\n"
    message += f"actual unachievable reason: {reference}\n"
    message += f"reported unachievable reason: {pred}\n"
    message += (
        "The task described above is inherently unachievable due to the reason specified under 'actual unachievable reason'. "
        "An individual previously attempted this task and was unable to complete it. They provided a reason for their failure, "
        "which is listed under 'reported unachievable reason'. Your role is to review both the actual and reported reasons. "
        "Determine if the reported reason aligns with the actual reason, even if implicitly. "
        "If the stated reason is in line with the actual reason, respond with 'same'. Otherwise, respond with 'different'."
    )
    messages = [
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": message},
    ]
    response = generate_from_openai_chat_completion(
        model="gpt-4-1106-preview",
        messages=messages,
        temperature=0,
        max_tokens=768,
        top_p=1.0,
        context_length=0,
    ).lower()
    if "different" in response:
        return 0.0
    else:
        assert "same" in response
        return 1.0
--- a/VAB-WebArena-Lite/new/llms_init.py
+++ b/VAB-WebArena-Lite/new/llms_init.py
@ -0,0 +1,23 @@
 """This module is adapt from https://github.com/zeno-ml/zeno-build"""
 try:
    from .providers.gemini_utils import generate_from_gemini_completion
 except:
    print('Google Cloud not set up, skipping import of providers.gemini_utils.generate_from_gemini_completion')
 from .providers.hf_utils import generate_from_huggingface_completion
 from .providers.openai_utils import (
    generate_from_openai_chat_completion,
    generate_from_openai_completion,
 )
 from .providers.api_utils import (
    generate_with_api,
 )
 from .utils import call_llm
 __all__ = [
    "generate_from_openai_completion",
    "generate_from_openai_chat_completion",
    "generate_from_huggingface_completion",
    "generate_from_gemini_completion",
    "call_llm",
 ]
--- a/VAB-WebArena-Lite/new/lm_config.py
+++ b/VAB-WebArena-Lite/new/lm_config.py
@ -0,0 +1,57 @@
 """Config for language models."""
 from __future__ import annotations
 import argparse
 import dataclasses
 from dataclasses import dataclass
 from typing import Any
@dataclass(frozen=True)
 class LMConfig:
    """A config for a language model.
    Attributes:
        provider: The name of the API provider.
        model: The name of the model.
        model_cls: The Python class corresponding to the model, mostly for
             Hugging Face transformers.
        tokenizer_cls: The Python class corresponding to the tokenizer, mostly
            for Hugging Face transformers.
        mode: The mode of the API calls, e.g., "chat" or "generation".
    """
    provider: str
    model: str
    model_cls: type | None = None
    tokenizer_cls: type | None = None
    mode: str | None = None
    gen_config: dict[str, Any] = dataclasses.field(default_factory=dict)
 def construct_llm_config(args: argparse.Namespace) -> LMConfig:
    llm_config = LMConfig(
        provider=args.provider, model=args.model, mode=args.mode
    )
    if args.provider in ["openai", "google", "api", "finetune"]:
        llm_config.gen_config["temperature"] = args.temperature
        llm_config.gen_config["top_p"] = args.top_p
        llm_config.gen_config["context_length"] = args.context_length
        llm_config.gen_config["max_tokens"] = args.max_tokens
        llm_config.gen_config["stop_token"] = args.stop_token
        llm_config.gen_config["max_obs_length"] = args.max_obs_length
        llm_config.gen_config["max_retry"] = args.max_retry
    elif args.provider == "huggingface":
        llm_config.gen_config["temperature"] = args.temperature
        llm_config.gen_config["top_p"] = args.top_p
        llm_config.gen_config["max_new_tokens"] = args.max_tokens
        llm_config.gen_config["stop_sequences"] = (
            [args.stop_token] if args.stop_token else None
        )
        llm_config.gen_config["max_obs_length"] = args.max_obs_length
        llm_config.gen_config["model_endpoint"] = args.model_endpoint
        llm_config.gen_config["max_retry"] = args.max_retry
    else:
        raise NotImplementedError(f"provider {args.provider} not implemented")
    return llm_config
--- a/VAB-WebArena-Lite/new/openai_utils.py
+++ b/VAB-WebArena-Lite/new/openai_utils.py
@ -0,0 +1,286 @@
 """Tools to generate from OpenAI prompts.
 Adopted from https://github.com/zeno-ml/zeno-build/"""
 import asyncio
 import logging
 import os
 import random
 import time
 from typing import Any
 import aiolimiter
 import openai
 from openai import AsyncOpenAI, OpenAI
 base_url = os.environ.get("OPENAI_API_URL")
 client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url=base_url)
 aclient = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url=base_url)
 from tqdm.asyncio import tqdm_asyncio
 def retry_with_exponential_backoff(  # type: ignore
    func,
    initial_delay: float = 1,
    exponential_base: float = 2,
    jitter: bool = True,
    max_retries: int = 3,
    errors: tuple[Any] = (
        openai.RateLimitError,
        openai.BadRequestError,
        openai.InternalServerError,
    ),
 ):
    """Retry a function with exponential backoff."""
    def wrapper(*args, **kwargs):  # type: ignore
        # Initialize variables
        num_retries = 0
        delay = initial_delay
        # Loop until a successful response or max_retries is hit or an exception is raised
        while True:
            try:
                return func(*args, **kwargs)
            # Retry on specified errors
            except errors as e:
                # Increment retries
                num_retries += 1
                # Check if max retries has been reached
                if num_retries > max_retries:
                    raise Exception(
                        f"Maximum number of retries ({max_retries}) exceeded."
                    )
                # Increment the delay
                delay *= exponential_base * (1 + jitter * random.random())
                # Sleep for the delay
                time.sleep(delay)
            # Raise exceptions for any errors not specified
            except Exception as e:
                raise e
    return wrapper
 async def _throttled_openai_completion_acreate(
    engine: str,
    prompt: str,
    temperature: float,
    max_tokens: int,
    top_p: float,
    limiter: aiolimiter.AsyncLimiter,
 ) -> dict[str, Any]:
    async with limiter:
        for _ in range(3):
            try:
                return await aclient.completions.create(
                    engine=engine,
                    prompt=prompt,
                    temperature=temperature,
                    max_tokens=max_tokens,
                    top_p=top_p,
                )
            except openai.RateLimitError:
                logging.warning(
                    "OpenAI API rate limit exceeded. Sleeping for 10 seconds."
                )
                await asyncio.sleep(10)
            except openai.APIError as e:
                logging.warning(f"OpenAI API error: {e}")
                break
        return {"choices": [{"message": {"content": ""}}]}
 async def agenerate_from_openai_completion(
    prompts: list[str],
    engine: str,
    temperature: float,
    max_tokens: int,
    top_p: float,
    context_length: int,
    requests_per_minute: int = 300,
 ) -> list[str]:
    """Generate from OpenAI Completion API.
    Args:
        prompts: list of prompts
        temperature: Temperature to use.
        max_tokens: Maximum number of tokens to generate.
        top_p: Top p to use.
        context_length: Length of context to use.
        requests_per_minute: Number of requests per minute to allow.
    Returns:
        List of generated responses.
    """
    if "OPENAI_API_KEY" not in os.environ:
        raise ValueError(
            "OPENAI_API_KEY environment variable must be set when using OpenAI API."
        )
    limiter = aiolimiter.AsyncLimiter(requests_per_minute)
    async_responses = [
        _throttled_openai_completion_acreate(
            engine=engine,
            prompt=prompt,
            temperature=temperature,
            max_tokens=max_tokens,
            top_p=top_p,
            limiter=limiter,
        )
        for prompt in prompts
    ]
    responses = await tqdm_asyncio.gather(*async_responses)
    return [x["choices"][0]["text"] for x in responses]
@retry_with_exponential_backoff
 def generate_from_openai_completion(
    prompt: str,
    engine: str,
    temperature: float,
    max_tokens: int,
    top_p: float,
    context_length: int,
    stop_token: str | None = None,
 ) -> str:
    if "OPENAI_API_KEY" not in os.environ:
        raise ValueError(
            "OPENAI_API_KEY environment variable must be set when using OpenAI API."
        )
    response = client.completions.create(
        prompt=prompt,
        engine=engine,
        temperature=temperature,
        max_tokens=max_tokens,
        top_p=top_p,
        stop=[stop_token],
    )
    answer: str = response["choices"][0]["text"]
    return answer
 async def _throttled_openai_chat_completion_acreate(
    model: str,
    messages: list[dict[str, str]],
    temperature: float,
    max_tokens: int,
    top_p: float,
    limiter: aiolimiter.AsyncLimiter,
 ) -> dict[str, Any]:
    async with limiter:
        for _ in range(3):
            try:
                return await aclient.chat.completions.create(
                    model=model,
                    messages=messages,
                    temperature=temperature,
                    max_tokens=max_tokens,
                    top_p=top_p,
                )
            except openai.RateLimitError:
                logging.warning(
                    "OpenAI API rate limit exceeded. Sleeping for 10 seconds."
                )
                await asyncio.sleep(10)
            except asyncio.exceptions.TimeoutError:
                logging.warning("OpenAI API timeout. Sleeping for 10 seconds.")
                await asyncio.sleep(10)
            except openai.APIError as e:
                logging.warning(f"OpenAI API error: {e}")
                break
        return {"choices": [{"message": {"content": ""}}]}
 async def agenerate_from_openai_chat_completion(
    messages_list: list[list[dict[str, str]]],
    engine: str,
    temperature: float,
    max_tokens: int,
    top_p: float,
    context_length: int,
    requests_per_minute: int = 300,
 ) -> list[str]:
    """Generate from OpenAI Chat Completion API.
    Args:
        messages_list: list of message list
        temperature: Temperature to use.
        max_tokens: Maximum number of tokens to generate.
        top_p: Top p to use.
        context_length: Length of context to use.
        requests_per_minute: Number of requests per minute to allow.
    Returns:
        List of generated responses.
    """
    if "OPENAI_API_KEY" not in os.environ:
        raise ValueError(
            "OPENAI_API_KEY environment variable must be set when using OpenAI API."
        )
    limiter = aiolimiter.AsyncLimiter(requests_per_minute)
    async_responses = [
        _throttled_openai_chat_completion_acreate(
            model=engine,
            messages=message,
            temperature=temperature,
            max_tokens=max_tokens,
            top_p=top_p,
            limiter=limiter,
        )
        for message in messages_list
    ]
    responses = await tqdm_asyncio.gather(*async_responses)
    return [x["choices"][0]["message"]["content"] for x in responses]
@retry_with_exponential_backoff
 def generate_from_openai_chat_completion(
    messages: list[dict[str, str]],
    model: str,
    temperature: float,
    max_tokens: int,
    top_p: float,
    context_length: int,
    stop_token: str | None = None,
 ) -> str:
    if "OPENAI_API_KEY" not in os.environ:
        raise ValueError(
            "OPENAI_API_KEY environment variable must be set when using OpenAI API."
        )
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
        top_p=top_p,
    )
    answer: str = response.choices[0].message.content
    return answer
@retry_with_exponential_backoff
 # debug only
 def fake_generate_from_openai_chat_completion(
    messages: list[dict[str, str]],
    model: str,
    temperature: float,
    max_tokens: int,
    top_p: float,
    context_length: int,
    stop_token: str | None = None,
 ) -> str:
    if "OPENAI_API_KEY" not in os.environ:
        raise ValueError(
            "OPENAI_API_KEY environment variable must be set when using OpenAI API."
        )
    answer = "Let's think step-by-step. This page shows a list of links and buttons. There is a search box with the label 'Search query'. I will click on the search box to type the query. So the action I will perform is \"click [60]\"."
    return answer
--- a/VAB-WebArena-Lite/new/prompt_constructor.py
+++ b/VAB-WebArena-Lite/new/prompt_constructor.py
@ -0,0 +1,501 @@
 import json
 import re
 from pathlib import Path
 from typing import Any, TypedDict
 from PIL import Image
 from browser_env import Action, ActionParsingError, Trajectory
 from browser_env.env_config import URL_MAPPINGS
 from browser_env.utils import StateInfo, pil_to_b64, pil_to_vertex
 from llms import lm_config
 from llms.tokenizers import Tokenizer
 from llms.utils import APIInput
 class Instruction(TypedDict):
    """Instruction for constructing prompt"""
    intro: str
    examples: list[tuple[str, str]]
    template: str
    meta_data: dict[str, Any]
 class PromptConstructor(object):
    def __init__(
        self,
        instruction_path: str | Path,
        lm_config: lm_config.LMConfig,
        tokenizer: Tokenizer,
    ):
        self.instruction_path = Path(instruction_path)
        self.obs_modality = "text"
        self.lm_config = lm_config
        instruction = json.load(open(self.instruction_path))
        instruction["examples"] = [tuple(e) for e in instruction["examples"]]
        self.instruction: Instruction = instruction
        self.tokenizer = tokenizer
    def get_lm_api_input(
        self, intro: str, examples: list[tuple[str, str]], current: str
    ) -> APIInput:
        """Return the require format for an API"""
        message: list[dict[str, str]] | str
        if "openai" in self.lm_config.provider:
            if self.lm_config.mode == "chat":
                message = [{"role": "system", "content": intro}]
                for (x, y) in examples:
                    message.append(
                        {
                            "role": "system",
                            "name": "example_user",
                            "content": x,
                        }
                    )
                    message.append(
                        {
                            "role": "system",
                            "name": "example_assistant",
                            "content": y,
                        }
                    )
                message.append({"role": "user", "content": current})
                return message
            elif self.lm_config.mode == "completion":
                message = f"{intro}\n\n"
                message += "Here are a few examples:\n"
                for example in examples:
                    message += f"Observation\n:{example[0]}\n\n"
                    message += f"Action: {example[1]}\n\n"
                message += "Now make prediction given the observation\n\n"
                message += f"Observation\n:{current}\n\n"
                message += "Action:"
                return message
            else:
                raise ValueError(
                    f"OpenAI models do not support mode {self.lm_config.mode}"
                )
        elif "huggingface" in self.lm_config.provider:
            # https://huggingface.co/blog/llama2#how-to-prompt-llama-2
            # https://github.com/facebookresearch/llama/blob/main/llama/generation.py#L320
            if "Llama-2" in self.lm_config.model:
                if self.lm_config.mode == "chat":
                    B_INST, E_INST = "[INST]", "[/INST]"
                    B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
                    BOS, EOS = "<s>", "</s>"
                    # adding the system message to be the starting of the first example
                    examples = [
                        (
                            B_SYS + intro + E_SYS + examples[0][0],
                            examples[0][1],
                        )
                    ] + examples[1:]
                    message = "".join(
                        [
                            f"{BOS}{B_INST} {x.strip()} {E_INST} {y.strip()} {EOS}"
                            for (x, y) in examples
                        ]
                    )
                    # add the current observation
                    message += f"{BOS}{B_INST} {current.strip()} {E_INST} {self.instruction['meta_data'].get('force_prefix', '')}"
                    return message
                else:
                    raise ValueError("Only chat mode is supported for Llama-2")
            else:
                raise ValueError(
                    f"Huggingface models do not support model_tag {self.lm_config.gen_config['model_tag']}"
                )
        else:
            raise NotImplementedError(
                f"Provider {self.lm_config.provider} not implemented"
            )
    def construct(
        self,
        trajectory: Trajectory,
        intent: str,
        meta_data: dict[str, Any] = {},
    ) -> APIInput:
        raise NotImplementedError
    def map_url_to_real(self, url: str) -> str:
        """Map the urls to their real world counterparts"""
        for i, j in URL_MAPPINGS.items():
            if i in url:
                url = url.replace(i, j)
        return url
    def map_url_to_local(self, url: str) -> str:
        """Map the urls to their local counterparts"""
        for i, j in URL_MAPPINGS.items():
            if j in url:
                url = url.replace(j, i)
            # https
            if j.replace("http", "https") in url:
                url = url.replace(j.replace("http", "https"), i)
        return url
    def _extract_action(self, response: str) -> str:
        raise NotImplementedError
    def extract_action(self, response: str) -> str:
        response = self._extract_action(response)
        response = self.map_url_to_local(response)
        return response
 class DirectPromptConstructor(PromptConstructor):
    """The agent will direct predict the action"""
    def __init__(
        self,
        instruction_path: str | Path,
        lm_config: lm_config.LMConfig,
        tokenizer: Tokenizer,
    ):
        super().__init__(instruction_path, lm_config, tokenizer)
    def construct(
        self,
        trajectory: Trajectory,
        intent: str,
        meta_data: dict[str, Any] = {},
    ) -> APIInput:
        """Construct prompt given the trajectory"""
        intro = self.instruction["intro"]
        examples = self.instruction["examples"]
        template = self.instruction["template"]
        keywords = self.instruction["meta_data"]["keywords"]
        state_info: StateInfo = trajectory[-1]  # type: ignore[assignment]
        obs = state_info["observation"][self.obs_modality]
        max_obs_length = self.lm_config.gen_config["max_obs_length"]
        if max_obs_length:
            if self.lm_config.provider == "google":
                print("NOTE: This is a Gemini model, so we use characters instead of tokens for max_obs_length.")
                obs = obs[:max_obs_length]
            else:
                obs = self.tokenizer.decode(self.tokenizer.encode(obs)[:max_obs_length])  # type: ignore[arg-type]
        page = state_info["info"]["page"]
        url = page.url
        previous_action_str = meta_data["action_history"][-1]
        # input x
        current = template.format(
            objective=intent,
            url=self.map_url_to_real(url),
            observation=obs,
            previous_action=previous_action_str,
        )
        # make sure all keywords are replaced
        assert all([f"{{k}}" not in current for k in keywords])
        prompt = self.get_lm_api_input(intro, examples, current)
        return prompt
    def _extract_action(self, response: str) -> str:
        action_splitter = self.instruction["meta_data"]["action_splitter"]
        pattern = rf"{action_splitter}((.|\n)*?){action_splitter}"
        match = re.search(pattern, response)
        if match:
            return match.group(1).strip()
        else:
            raise ActionParsingError(
                f"Cannot parse action from response {response}"
            )
 class CoTPromptConstructor(PromptConstructor):
    """The agent will perform step-by-step reasoning before the answer"""
    def __init__(
        self,
        instruction_path: str | Path,
        lm_config: lm_config.LMConfig,
        tokenizer: Tokenizer,
    ):
        super().__init__(instruction_path, lm_config, tokenizer)
        self.answer_phrase = self.instruction["meta_data"]["answer_phrase"]
    def construct(
        self,
        trajectory: Trajectory,
        intent: str,
        meta_data: dict[str, Any] = {},
    ) -> APIInput:
        intro = self.instruction["intro"]
        examples = self.instruction["examples"]
        template = self.instruction["template"]
        keywords = self.instruction["meta_data"]["keywords"]
        state_info: StateInfo = trajectory[-1]  # type: ignore[assignment]
        obs = state_info["observation"][self.obs_modality]
        max_obs_length = self.lm_config.gen_config["max_obs_length"]
        if max_obs_length:
            if self.lm_config.provider == "google":
                print("NOTE: This is a Gemini model, so we use characters instead of tokens for max_obs_length.")
                obs = obs[:max_obs_length]
            else:
                obs = self.tokenizer.decode(self.tokenizer.encode(obs)[:max_obs_length])  # type: ignore[arg-type]
        page = state_info["info"]["page"]
        url = page.url
        previous_action_str = meta_data["action_history"][-1]
        current = template.format(
            objective=intent,
            url=self.map_url_to_real(url),
            observation=obs,
            previous_action=previous_action_str,
        )
        assert all([f"{{k}}" not in current for k in keywords])
        prompt = self.get_lm_api_input(intro, examples, current)
        return prompt
    def _extract_action(self, response: str) -> str:
        # find the first occurence of action
        action_splitter = self.instruction["meta_data"]["action_splitter"]
        pattern = rf"{action_splitter}((.|\n)*?){action_splitter}"
        match = re.search(pattern, response)
        if match:
            return match.group(1).strip()
        else:
            raise ActionParsingError(
                f'Cannot find the answer phrase "{self.answer_phrase}" in "{response}"'
            )
 class MultimodalCoTPromptConstructor(CoTPromptConstructor):
    """The agent will perform step-by-step reasoning before the answer"""
    def __init__(
        self,
        instruction_path: str | Path,
        lm_config: lm_config.LMConfig,
        tokenizer: Tokenizer,
    ):
        super().__init__(instruction_path, lm_config, tokenizer)
        self.answer_phrase = self.instruction["meta_data"]["answer_phrase"]
    def construct(
        self,
        trajectory: Trajectory,
        intent: str,
        page_screenshot_img: Image.Image,
        images: list[Image.Image],
        meta_data: dict[str, Any] = {},
    ) -> APIInput:
        intro = self.instruction["intro"]
        examples = self.instruction["examples"]
        template = self.instruction["template"]
        keywords = self.instruction["meta_data"]["keywords"]
        state_info: StateInfo = trajectory[-1]  # type: ignore[assignment]
        obs = state_info["observation"][self.obs_modality]
        max_obs_length = self.lm_config.gen_config["max_obs_length"]
        if max_obs_length:
            if self.lm_config.provider in ["google", "api", "finetune"]:
                print("NOTE: This is a Gemini / API model, so we use characters instead of tokens for max_obs_length.")
                obs = obs[:max_obs_length]
            else:
                obs = self.tokenizer.decode(self.tokenizer.encode(obs)[:max_obs_length])  # type: ignore[arg-type]
        page = state_info["info"]["page"]
        url = page.url
        previous_action_str = meta_data["action_history"][-1]
        current = template.format(
            objective=intent,
            url=self.map_url_to_real(url),
            observation=obs,
            previous_action=previous_action_str,
        )
        assert all([f"{{k}}" not in current for k in keywords])
        # TODO: for your finetune model, you can config you prompt here
        if self.lm_config.provider == "finetune":
            current = ""
            traj = trajectory[1::2]
            for rnd, tra in enumerate(traj):
                tar = '** screenshot **' if rnd > 0 else intent
                raw = tra["raw_prediction"]
                current += f"Round {rnd}\n\n<|user|>\n\n** node_info **\n\n{tar}\n\n<|assistant|>\n{raw}\n\n"""
            current += f"Round {len(traj)}\n\n<|user|>\n\n{obs}\n\n{'** screenshot **' if len(traj) > 0 else intent}\n"
        prompt = self.get_lm_api_input(
            intro, examples, current, page_screenshot_img, images
        )
        return prompt
    def get_lm_api_input(
        self,
        intro: str,
        examples: list[tuple[str, str, str]],
        current: str,
        page_screenshot_img: Image.Image,
        images: list[Image.Image],
    ) -> APIInput:
        """Return the require format for an API"""
        message: list[dict[str, str]] | str | list[str | Image.Image]
        if "openai" in self.lm_config.provider:
            if self.lm_config.mode == "chat":
                message = [
                    {
                        "role": "system",
                        "content": [{"type": "text", "text": intro}],
                    }
                ]
                for (x, y, z) in examples:
                    example_img = Image.open(z)
                    message.append(
                        {
                            "role": "system",
                            "name": "example_user",
                            "content": [
                                {"type": "text", "text": x},
                                {
                                    "type": "text",
                                    "text": "IMAGES: (1) current page screenshot",
                                },
                                {
                                    "type": "image_url",
                                    "image_url": {
                                        "url": pil_to_b64(example_img)
                                    },
                                },
                            ],
                        }
                    )
                    message.append(
                        {
                            "role": "system",
                            "name": "example_assistant",
                            "content": [{"type": "text", "text": y}],
                        }
                    )
                # Encode images and page_screenshot_img as base64 strings.
                current_prompt = current
                content = [
                    {
                        "type": "text",
                        "text": "IMAGES: (1) current page screenshot",
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": pil_to_b64(page_screenshot_img)},
                    },
                ]
                for image_i, image in enumerate(images):
                    content.extend(
                        [
                            {
                                "type": "text",
                                "text": f"({image_i+2}) input image {image_i+1}",
                            },
                            {
                                "type": "image_url",
                                "image_url": {"url": pil_to_b64(image)},
                            },
                        ]
                    )
                content = [{"type": "text", "text": current_prompt}] + content
                message.append({"role": "user", "content": content})
                return message
            else:
                raise ValueError(
                    f"GPT-4V models do not support mode {self.lm_config.mode}"
                )
        elif "google" in self.lm_config.provider:
            if self.lm_config.mode == "completion":
                message = [
                    intro,
                    "Here are a few examples:",
                ]
                for (x, y, z) in examples:
                    example_img = Image.open(z)
                    message.append(f"Observation\n:{x}\n")
                    message.extend(
                        [
                            "IMAGES:",
                            "(1) current page screenshot:",
                            pil_to_vertex(example_img),
                        ]
                    )
                    message.append(f"Action: {y}")
                message.append("Now make prediction given the observation")
                message.append(f"Observation\n:{current}\n")
                message.extend(
                    [
                        "IMAGES:",
                        "(1) current page screenshot:",
                        pil_to_vertex(page_screenshot_img),
                    ]
                )
                for image_i, image in enumerate(images):
                    message.extend(
                        [
                            f"({image_i+2}) input image {image_i+1}",
                            pil_to_vertex(image),
                        ]
                    )
                message.append("Action:")
                return message
            else:
                raise ValueError(
                    f"Gemini models do not support mode {self.lm_config.mode}"
                )
        elif self.lm_config.provider in ["api", "finetune"]:
            message = [
                {
                    "role": "system",
                    "content":  intro,
                }
            ]
            # we keep few-shot here, but remove the image corresponding to the current page.
            for (x, y, _) in examples:
                message.append({
                    "role": "user",
                    "content": [
                        { "type": "text", "text": x },
                        { "type": "text", "text": "IMAGES: (1) current page screenshot\n\n** Screenshot **\n" },
                    ],
                })
                message.append({
                    "role": "assistant",
                    "content": y,
                })
            # TODO: Encode images and page_screenshot_img as base64 strings, we only keep screenshot of current page.
            current_prompt = current
            content = []
            if self.lm_config.provider != "finetune":
                content.append({
                    "type": "text",
                    "text": "IMAGES: (1) current page screenshot",
                })
            if "text" not in self.lm_config.model:    
                content.append({
                    "type": "image_url",
                    "image_url": {"url": pil_to_b64(page_screenshot_img)},
                })
            content = [{"type": "text", "text": current_prompt}] + content
            message.append({"role": "user", "content": content})
            return message
        else:
            raise NotImplementedError(
                f"Provider {self.lm_config.provider} not implemented"
            )
--- a/VAB-WebArena-Lite/new/run.py
+++ b/VAB-WebArena-Lite/new/run.py
@ -0,0 +1,562 @@
 """Script to run end-to-end evaluation on the benchmark.
 Modified from https://github.com/web-arena-x/webarena/blob/main/run.py.
 """
 import argparse
 import glob
 import json
 import logging
 import os
 import random
 import subprocess
 import tempfile
 import time
 from pathlib import Path
 from typing import List
 import openai
 import requests
 import torch
 from PIL import Image
 from agent import (
    PromptAgent,
    construct_agent,
 )
 from agent.prompts import *
 from browser_env import (
    Action,
    ActionTypes,
    ScriptBrowserEnv,
    StateInfo,
    Trajectory,
    create_stop_action,
 )
 from browser_env.actions import is_equivalent
 from browser_env.auto_login import get_site_comb_from_filepath
 from browser_env.helper_functions import (
    RenderHelper,
    get_action_description,
 )
 from evaluation_harness import evaluator_router, image_utils
 DATASET = os.environ["DATASET"]
 LOG_FOLDER = "log_files"
 Path(LOG_FOLDER).mkdir(parents=True, exist_ok=True)
 LOG_FILE_NAME = f"{LOG_FOLDER}/log_{time.strftime('%Y%m%d%H%M%S', time.localtime())}_{random.randint(0, 10000)}.log"
 logger = logging.getLogger("logger")
 logger.setLevel(logging.INFO)
 console_handler = logging.StreamHandler()
 console_handler.setLevel(logging.DEBUG)
 logger.addHandler(console_handler)
 file_handler = logging.FileHandler(LOG_FILE_NAME)
 file_handler.setLevel(logging.DEBUG)
 logger.addHandler(file_handler)
 # Set the log format
 formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
 console_handler.setFormatter(formatter)
 file_handler.setFormatter(formatter)
 def config() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Run end-to-end evaluation on the benchmark"
    )
    parser.add_argument(
        "--render", action="store_true", help="Render the browser"
    )
    parser.add_argument(
        "--slow_mo",
        type=int,
        default=0,
        help="Slow down the browser by the specified amount",
    )
    parser.add_argument(
        "--action_set_tag", default="id_accessibility_tree", help="Action type"
    )
    parser.add_argument(
        "--observation_type",
        choices=[
            "accessibility_tree",
            "accessibility_tree_with_captioner",
            "html",
            "image",
            "image_som",
        ],
        default="accessibility_tree",
        help="Observation type",
    )
    parser.add_argument(
        "--current_viewport_only",
        action="store_true",
        help="Only use the current viewport for the observation",
    )
    parser.add_argument("--viewport_width", type=int, default=1280)
    parser.add_argument("--viewport_height", type=int, default=2048)
    parser.add_argument("--save_trace_enabled", action="store_true")
    parser.add_argument("--sleep_after_execution", type=float, default=0.0)
    parser.add_argument("--max_steps", type=int, default=30)
    # agent config
    parser.add_argument("--agent_type", type=str, default="prompt")
    parser.add_argument(
        "--instruction_path",
        type=str,
        default="agents/prompts/state_action_agent.json",
    )
    parser.add_argument(
        "--parsing_failure_th",
        help="When consecutive parsing failures exceed this threshold, the agent will terminate early.",
        type=int,
        default=3,
    )
    parser.add_argument(
        "--repeating_action_failure_th",
        help="When consecutive repeated actions exceed this threshold, the agent will terminate early.",
        type=int,
        default=5,
    )
    parser.add_argument("--test_config_base_dir", type=str)
    parser.add_argument(
        "--eval_captioning_model_device",
        type=str,
        default="cpu",
        choices=["cpu", "cuda"],
        help="Device to run eval captioning model on. By default, runs it on CPU.",
    )
    parser.add_argument(
        "--eval_captioning_model",
        type=str,
        default="Salesforce/blip2-flan-t5-xl",
        choices=["Salesforce/blip2-flan-t5-xl"],
        help="Captioning backbone for VQA-type evals.",
    )
    parser.add_argument(
        "--captioning_model",
        type=str,
        default="Salesforce/blip2-flan-t5-xl",
        choices=["Salesforce/blip2-flan-t5-xl", "llava-hf/llava-1.5-7b-hf"],
        help="Captioning backbone for accessibility tree alt text.",
    )
    # lm config
    parser.add_argument("--provider", type=str, default="openai")
    parser.add_argument("--model", type=str, default="gpt-3.5-turbo-0613")
    parser.add_argument("--mode", type=str, default="chat")
    parser.add_argument("--temperature", type=float, default=1.0)
    parser.add_argument("--top_p", type=float, default=0.9)
    parser.add_argument("--context_length", type=int, default=0)
    parser.add_argument("--max_tokens", type=int, default=384)
    parser.add_argument("--stop_token", type=str, default=None)
    parser.add_argument(
        "--max_retry",
        type=int,
        help="max retry times to perform generations when parsing fails",
        default=1,
    )
    parser.add_argument(
        "--max_obs_length",
        type=int,
        help="when not zero, will truncate the observation to this length before feeding to the model",
        default=3840,
    )
    # example config
    parser.add_argument("--test_start_idx", type=int, default=0)
    parser.add_argument("--test_end_idx", type=int, default=910)
    # logging related
    parser.add_argument("--result_dir", type=str, default="")
    args = parser.parse_args()
    # check the whether the action space is compatible with the observation space
    if (
        args.action_set_tag == "id_accessibility_tree"
        and args.observation_type
        not in [
            "accessibility_tree",
            "accessibility_tree_with_captioner",
            "image_som",
        ]
    ):
        raise ValueError(
            f"Action type {args.action_set_tag} is incompatible with the observation type {args.observation_type}"
        )
    return args
 def early_stop(
    trajectory: Trajectory, max_steps: int, thresholds: dict[str, int]
 ) -> tuple[bool, str]:
    """Check whether need to stop early"""
    # reach the max step
    num_steps = (len(trajectory) - 1) / 2
    if num_steps >= max_steps:
        return True, f"Reach max steps {max_steps}"
    last_k_actions: list[Action]
    action_seq: list[Action]
    # Case: parsing failure for k times
    k = thresholds["parsing_failure"]
    last_k_actions = trajectory[1::2][-k:]  # type: ignore[assignment]
    if len(last_k_actions) >= k:
        if all(
            [
                action["action_type"] == ActionTypes.NONE
                for action in last_k_actions
            ]
        ):
            return True, f"Failed to parse actions for {k} times"
    # Case: same action for k times
    k = thresholds["repeating_action"]
    last_k_actions = trajectory[1::2][-k:]  # type: ignore[assignment]
    action_seq = trajectory[1::2]  # type: ignore[assignment]
    if len(action_seq) == 0:
        return False, ""
    last_action: Action = action_seq[-1]
    if last_action["action_type"] != ActionTypes.TYPE:
        if len(last_k_actions) >= k:
            if all(
                [
                    is_equivalent(action, last_action)
                    for action in last_k_actions
                ]
            ):
                return True, f"Same action for {k} times"
    else:
        # check the action sequence
        if (
            sum([is_equivalent(action, last_action) for action in action_seq])
            >= k
        ):
            return True, f"Same typing action for {k} times"
    return False, ""
 def update_action_history(path: str, task_id: int, actions: List[str], score: float=-0.1):    
    obj = {
        "task_id": task_id,
        "score": score,
        "actions": actions
    }
    json.dump(obj, open(path, "w"), indent=4)
 def test(
    args: argparse.Namespace,
    config_file_list: list[str]
 ) -> None:
    scores = []
    max_steps = args.max_steps
    early_stop_thresholds = {
        "parsing_failure": args.parsing_failure_th,
        "repeating_action": args.repeating_action_failure_th,
    }
    if args.observation_type in [
        "accessibility_tree_with_captioner",
        # "image_som",
    ]:
        device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
        dtype = torch.float16 if torch.cuda.is_available() else torch.float32
        caption_image_fn = image_utils.get_captioning_fn(
            device, dtype, args.captioning_model
        )
    else:
        caption_image_fn = None
    # Load a (possibly different) captioning model for running VQA evals.
    if DATASET == 'visualwebarena':
        if (
            caption_image_fn
            and args.eval_captioning_model == args.captioning_model
        ):
            eval_caption_image_fn = caption_image_fn
        else:
            eval_caption_image_fn = image_utils.get_captioning_fn(
                args.eval_captioning_model_device,
                torch.float16
                if (
                    torch.cuda.is_available()
                    and args.eval_captioning_model_device == "cuda"
                )
                else torch.float32,
                args.eval_captioning_model,
            )
    else:
        caption_image_fn = None
        eval_caption_image_fn = None
    agent = construct_agent(
        args,
        captioning_fn=caption_image_fn
        if args.observation_type == "accessibility_tree_with_captioner"
        else None,
    )  # NOTE: captioning_fn here is used for captioning input images.
    env = ScriptBrowserEnv(
        headless=not args.render,
        slow_mo=args.slow_mo,
        observation_type=args.observation_type,
        current_viewport_only=args.current_viewport_only,
        viewport_size={
            "width": args.viewport_width,
            "height": args.viewport_height,
        },
        save_trace_enabled=args.save_trace_enabled,
        sleep_after_execution=args.sleep_after_execution,
        # NOTE: captioning_fn here is used for LLM + captioning baselines.
        # This can be different from the captioning model used for evals.
        captioning_fn=caption_image_fn,
    )
    for config_file in config_file_list:
        try:
            render_helper = RenderHelper(
                config_file, args.result_dir, args.action_set_tag
            )
            # Load task.
            with open(config_file) as f:
                _c = json.load(f)
                intent = _c["intent"]
                task_id = _c["task_id"]
                image_paths = _c.get("image", None)
                images = []
                # automatically login
                if _c["storage_state"]:
                    cookie_file_name = os.path.basename(_c["storage_state"])
                    comb = get_site_comb_from_filepath(cookie_file_name)
                    temp_dir = tempfile.mkdtemp()
                    # subprocess to renew the cookie
                    subprocess.run(
                        [
                            "python",
                            "browser_env/auto_login.py",
                            "--auth_folder",
                            temp_dir,
                            "--site_list",
                            *comb,
                        ]
                    )
                    _c["storage_state"] = f"{temp_dir}/{cookie_file_name}"
                    assert os.path.exists(_c["storage_state"])
                    # update the config file
                    config_file = f"{temp_dir}/{os.path.basename(config_file)}"
                    with open(config_file, "w") as f:
                        json.dump(_c, f)
                # Load input images for the task, if any.
                if image_paths is not None:
                    if isinstance(image_paths, str):
                        image_paths = [image_paths]
                    for image_path in image_paths:
                        # Load image either from the web or from a local path.
                        if image_path.startswith("http"):
                            input_image = Image.open(requests.get(image_path, stream=True).raw)
                        else:
                            input_image = Image.open(image_path)
                        images.append(input_image)
            logger.info(f"[Config file]: {config_file}")
            logger.info(f"[Intent]: {intent}")
            agent.reset(config_file)
            trajectory: Trajectory = []
            obs, info = env.reset(options={"config_file": config_file})
            state_info: StateInfo = {"observation": obs, "info": info}
            trajectory.append(state_info)
            meta_data = {"action_history": ["None"]}
            out_path = os.path.join(args.result_dir, "actions", f"{task_id}.json")
            actions = []
            while True:
                update_action_history(out_path, task_id, actions=actions)
                early_stop_flag, stop_info = early_stop(
                    trajectory, max_steps, early_stop_thresholds
                )
                if early_stop_flag:
                    action = create_stop_action(f"Early stop: {stop_info}")
                else:
                    try:
                        action = agent.next_action(
                            trajectory,
                            intent,
                            images=images,
                            meta_data=meta_data,
                        )
                    except ValueError as e:
                        # get the error message
                        action = create_stop_action(f"ERROR: {str(e)}")
                trajectory.append(action)
                action_str = get_action_description(
                    action,
                    state_info["info"]["observation_metadata"],
                    action_set_tag=args.action_set_tag,
                    prompt_constructor=agent.prompt_constructor
                    if isinstance(agent, PromptAgent)
                    else None,
                )
                render_helper.render(
                    action, state_info, meta_data, args.render_screenshot
                )
                meta_data["action_history"].append(action_str)
                actions.append(action_str)
                print(action_str)
                if action["action_type"] == ActionTypes.STOP:
                    break
                obs, _, terminated, _, info = env.step(action)
                state_info = {"observation": obs, "info": info}
                trajectory.append(state_info)
                if terminated:
                    # add a action place holder
                    trajectory.append(create_stop_action(""))
                    break
            # NOTE: eval_caption_image_fn is used for running eval_vqa functions.
            evaluator = evaluator_router(
                config_file, captioning_fn=eval_caption_image_fn
            )
            score = evaluator(
                trajectory=trajectory,
                config_file=config_file,
                page=env.page
            )
            update_action_history(out_path, task_id, actions=actions, score=score)
            scores.append(score)
            if score == 1:
                logger.info(f"[Result] (PASS) {config_file}")
            else:
                logger.info(f"[Result] (FAIL) {config_file}")
            if args.save_trace_enabled:
                env.save_trace(
                    Path(args.result_dir) / "traces" / f"{task_id}.zip"
                )
        except openai.OpenAIError as e:
            logger.info(f"[OpenAI Error] {repr(e)}")
        except Exception as e:
            logger.info(f"[Unhandled Error] {repr(e)}]")
            import traceback
            # write to error file
            with open(Path(args.result_dir) / "error.txt", "a") as f:
                f.write(f"[Config file]: {config_file}\n")
                f.write(f"[Unhandled Error] {repr(e)}\n")
                f.write(traceback.format_exc())  # write stack trace to file
        render_helper.close()
    env.close()
    if len(scores):
        logger.info(f"Average score: {sum(scores) / len(scores)}")
 def prepare(args: argparse.Namespace) -> None:
    # convert prompt python files to json
    from agent.prompts import to_json
    to_json.run()
    # prepare result dir
    result_dir = args.result_dir
    if not result_dir:
        result_dir = (
            f"cache/results_{time.strftime('%Y%m%d%H%M%S', time.localtime())}"
        )
    if not Path(result_dir).exists():
        Path(result_dir).mkdir(parents=True, exist_ok=True)
        args.result_dir = result_dir
        logger.info(f"Create result dir: {result_dir}")
    if not (Path(result_dir) / "traces").exists():
        (Path(result_dir) / "traces").mkdir(parents=True)
    os.makedirs(os.path.join(result_dir, "actions"), exist_ok=True)
    # log the log file
    with open(os.path.join(result_dir, "log_files.txt"), "a+") as f:
        f.write(f"{LOG_FILE_NAME}\n")
 def get_unfinished(config_files: list[str], result_dir: str) -> list[str]:
    result_files = glob.glob(f"{result_dir}/*.html")
    task_ids = [
        os.path.basename(f).split(".")[0].split("_")[1] for f in result_files
    ]
    unfinished_configs = []
    for config_file in config_files:
        task_id = os.path.basename(config_file).split(".")[0]
        try:
            with open(f"{result_dir}/actions/{task_id}.json", "r") as f:
                jd = json.load(f)
        except:
            jd = {}
        if task_id not in task_ids or jd.get('score', -1) < 0:
            unfinished_configs.append(config_file)
    return unfinished_configs
 def dump_config(args: argparse.Namespace) -> None:
    config_file = Path(args.result_dir) / "config.json"
    if not config_file.exists():
        with open(config_file, "w") as f:
            json.dump(vars(args), f, indent=4)
            logger.info(f"Dump config to {config_file}")
 if __name__ == "__main__":
    os.environ["TOKENIZERS_PARALLELISM"] = "false"
    args = config()
    args.sleep_after_execution = 2.5
    prepare(args)
    test_config_base_dir = args.test_config_base_dir
    test_file_list = []
    st_idx = args.test_start_idx
    ed_idx = args.test_end_idx
    for i in range(st_idx, ed_idx):
        test_file_list.append(os.path.join(test_config_base_dir, f"{i}.json"))
    test_file_list = get_unfinished(test_file_list, args.result_dir)
    print(f"Total {len(test_file_list)} tasks left")
    args.render = False
    args.render_screenshot = True
    args.save_trace_enabled = True
    args.current_viewport_only = True
    dump_config(args)
    test(args, test_file_list)
--- a/VAB-WebArena-Lite/new/test_webarena_lite.raw.json
+++ b/VAB-WebArena-Lite/new/test_webarena_lite.raw.json
--- a/VAB-WebArena-Lite/new/tokenizers.py
+++ b/VAB-WebArena-Lite/new/tokenizers.py
@ -0,0 +1,29 @@
 from typing import Any
 import tiktoken
 from transformers import LlamaTokenizer  # type: ignore
 class Tokenizer(object):
    def __init__(self, provider: str, model_name: str) -> None:
        if provider == "openai":
            self.tokenizer = tiktoken.encoding_for_model(model_name)
        elif provider == "huggingface":
            self.tokenizer = LlamaTokenizer.from_pretrained(model_name)
            # turn off adding special tokens automatically
            self.tokenizer.add_special_tokens = False  # type: ignore[attr-defined]
            self.tokenizer.add_bos_token = False  # type: ignore[attr-defined]
            self.tokenizer.add_eos_token = False  # type: ignore[attr-defined]
        elif provider in ["google", "api", "finetune"]:
            self.tokenizer = None  # Not used for input length computation, as Gemini is based on characters
        else:
            raise NotImplementedError
    def encode(self, text: str) -> list[int]:
        return self.tokenizer.encode(text)
    def decode(self, ids: list[int]) -> str:
        return self.tokenizer.decode(ids)
    def __call__(self, text: str) -> list[int]:
        return self.tokenizer.encode(text)
--- a/VAB-WebArena-Lite/new/utils.py
+++ b/VAB-WebArena-Lite/new/utils.py
@ -0,0 +1,87 @@
 import argparse
 from typing import Any
 try:
    from vertexai.preview.generative_models import Image
    from llms import generate_from_gemini_completion
 except:
    print('Google Cloud not set up, skipping import of vertexai.preview.generative_models.Image and llms.generate_from_gemini_completion')
 from llms import (
    generate_from_huggingface_completion,
    generate_from_openai_chat_completion,
    generate_from_openai_completion,
    generate_with_api,
    lm_config,
 )
 APIInput = str | list[Any] | dict[str, Any]
 def call_llm(
    lm_config: lm_config.LMConfig,
    prompt: APIInput,
 ) -> str:
    response: str
    if lm_config.provider == "openai":
        if lm_config.mode == "chat":
            assert isinstance(prompt, list)
            response = generate_from_openai_chat_completion(
                messages=prompt,
                model=lm_config.model,
                temperature=lm_config.gen_config["temperature"],
                top_p=lm_config.gen_config["top_p"],
                context_length=lm_config.gen_config["context_length"],
                max_tokens=lm_config.gen_config["max_tokens"],
                stop_token=None,
            )
        elif lm_config.mode == "completion":
            assert isinstance(prompt, str)
            response = generate_from_openai_completion(
                prompt=prompt,
                engine=lm_config.model,
                temperature=lm_config.gen_config["temperature"],
                max_tokens=lm_config.gen_config["max_tokens"],
                top_p=lm_config.gen_config["top_p"],
                stop_token=lm_config.gen_config["stop_token"],
            )
        else:
            raise ValueError(
                f"OpenAI models do not support mode {lm_config.mode}"
            )
    elif lm_config.provider == "huggingface":
        assert isinstance(prompt, str)
        response = generate_from_huggingface_completion(
            prompt=prompt,
            model_endpoint=lm_config.gen_config["model_endpoint"],
            temperature=lm_config.gen_config["temperature"],
            top_p=lm_config.gen_config["top_p"],
            stop_sequences=lm_config.gen_config["stop_sequences"],
            max_new_tokens=lm_config.gen_config["max_new_tokens"],
        )
    elif lm_config.provider == "google":
        assert isinstance(prompt, list)
        assert all(
            [isinstance(p, str) or isinstance(p, Image) for p in prompt]
        )
        response = generate_from_gemini_completion(
            prompt=prompt,
            engine=lm_config.model,
            temperature=lm_config.gen_config["temperature"],
            max_tokens=lm_config.gen_config["max_tokens"],
            top_p=lm_config.gen_config["top_p"],
        )
    elif lm_config.provider in ["api", "finetune"]:
        args = {
            "temperature": lm_config.gen_config["temperature"],   # openai, gemini, claude
            "max_tokens": lm_config.gen_config["max_tokens"],     # openai, gemini, claude
            "top_k": lm_config.gen_config["top_p"],               # qwen
        }
        response = generate_with_api(prompt, lm_config.model, args)
    else:
        raise NotImplementedError(
            f"Provider {lm_config.provider} not implemented"
        )
    return response
--- a/VAB-WebArena-Lite/new/wa_parallel_run.sh
+++ b/VAB-WebArena-Lite/new/wa_parallel_run.sh
@ -0,0 +1,78 @@
 #!/bin/bash
 DATASET='webarena' # webarena, visualwebarena
 result_dir=''
 provider=''
 model=''
 instruction_path='agent/prompts/jsons/p_som_cot_id_actree_3s.json' # e.g., agent/prompts/jsons/p_cot_id_actree_2s.json
 test_config_base_dir='config_files/wa/test_webarena_lite' # e.g., config_files/wa/test_webarena_lite
 temperature=0.0
 SERVER='' # your server address
 MAP_SERVER='' # the same as SERVER
 OPENAI_API_KEY=''
 OPENAI_ORGANIZATION=''
 CONDA_ENV_NAME='' # the name of your conda environment
 ENV_VARIABLES="export DATASET=${DATASET}; export SHOPPING='http://${SERVER}:7770';export SHOPPING_ADMIN='http://${SERVER}:7780/admin';export REDDIT='http://${SERVER}:9999';export GITLAB='http://${SERVER}:8023';export MAP='http://${MAP_SERVER}:3000';export WIKIPEDIA='http://${SERVER}:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing';export HOMEPAGE='http://${SERVER}:4399';export OPENAI_API_KEY=${OPENAI_API_KEY};export OPENAI_ORGANIZATION=${OPENAI_ORGANIZATION}"
 # get the number of tmux panes
 num_panes=$(tmux list-panes | wc -l)
 # calculate how many panes need to be created
 let "panes_to_create = 8 - num_panes"
 # array of tmux commands to create each pane
 tmux_commands=(
    'tmux split-window -h'
    'tmux split-window -v'
    'tmux select-pane -t 0; tmux split-window -v'
    'tmux select-pane -t 0; tmux split-window -v'
    'tmux select-pane -t 2; tmux split-window -v'
    'tmux select-pane -t 4; tmux split-window -v'
    'tmux select-pane -t 6; tmux split-window -v'
 )
 # create panes up to 8
 for ((i=0; i<$panes_to_create; i++)); do
    eval ${tmux_commands[$i]}
 done
 #!/bin/bash
 # Function to run a job
 run_job() {
    tmux select-pane -t $1
    tmux send-keys "tmux set mouse on; conda activate ${CONDA_ENV_NAME}; ${ENV_VARIABLES}; until python run.py --viewport_width 1280 --viewport_height 720 --test_start_idx $2 --test_end_idx $3 --provider ${provider} --model ${model} --instruction_path ${instruction_path} --temperature ${temperature} --test_config_base_dir ${test_config_base_dir} --result_dir ${result_dir}; do echo 'crashed' >&2; sleep 1; done" C-m
    sleep 3
 }
 TOLERANCE=2
 run_batch() {
    args=("$@") # save all arguments in an array
    num_jobs=${#args[@]} # get number of arguments
    for ((i=1; i<$num_jobs; i++)); do
        run_job $i ${args[i-1]} ${args[i]}
    done
    # Wait for all jobs to finish
    while tmux list-panes -F "#{pane_pid} #{pane_current_command}" | grep -q python; do
        sleep 100  # wait for 10 seconds before checking again
    done
    # Run checker
    while ! python scripts/check_error_runs.py ${result_dir} --delete_errors --tolerance ${TOLERANCE}; do
        echo "Check failed, rerunning jobs..."
        for ((i=1; i<$num_jobs; i++)); do
            run_job $i ${args[i-1]} ${args[i]}
        done
        # Wait for all jobs to finish
        while tmux list-panes -F "#{pane_pid} #{pane_current_command}" | grep -q python; do
            sleep 100  # wait for 10 seconds before checking again
        done
    done
 }
 run_batch 0 24 48 72 96 120 143 165
--- a/VAB-WebArena-Lite/refresh_website_dockers.sh
+++ b/VAB-WebArena-Lite/refresh_website_dockers.sh
@ -0,0 +1,71 @@
 #!/bin/bash
 docker stop shopping
 docker stop shopping_admin
 docker stop forum
 docker stop gitlab
 docker rm shopping
 docker rm shopping_admin
 docker rm forum
 docker rm gitlab
 # One Stop Shop
 # docker load --input shopping_final_0712.tar
 docker run --name shopping -p 7770:80 -d shopping_final_0712
 # CMS
 # docker load --input shopping_admin_final_0719.tar
 docker run --name shopping_admin -p 7780:80 -d shopping_admin_final_0719
 # Reddit
 # docker load --input postmill-populated-exposed-withimg.tar
 docker run --name forum -p 9999:80 -d postmill-populated-exposed-withimg
 # GitLab
 # docker load --input gitlab-populated-final-port8023.tar
 docker run --name gitlab --shm-size="10g" -d -p 8023:8023 gitlab-populated-final-port8023 /opt/gitlab/embedded/bin/runsvdir-start
 sleep 60
 # Define your actual server hostname
 YOUR_ACTUAL_HOSTNAME="http://localhost"
 # Remove trailing / if it exists
 YOUR_ACTUAL_HOSTNAME=${YOUR_ACTUAL_HOSTNAME%/}
 # OSS
 docker exec shopping /var/www/magento2/bin/magento setup:store-config:set --base-url="${YOUR_ACTUAL_HOSTNAME}:7770"
 docker exec shopping mysql -u magentouser -pMyPassword magentodb -e  "UPDATE core_config_data SET value='${YOUR_ACTUAL_HOSTNAME}:7775/' WHERE path = 'web/secure/base_url';"
 docker exec shopping /var/www/magento2/bin/magento cache:flush
 # Disable re-indexing of products
 docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule catalogrule_product
 docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule catalogrule_rule
 docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule catalogsearch_fulltext
 docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule catalog_category_product
 docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule customer_grid
 docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule design_config_grid
 docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule inventory
 docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule catalog_product_category
 docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule catalog_product_attribute
 docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule catalog_product_price
 docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule cataloginventory_stock
 # CMS
 docker exec shopping_admin /var/www/magento2/bin/magento setup:store-config:set --base-url="${YOUR_ACTUAL_HOSTNAME}:7780"
 docker exec shopping_admin mysql -u magentouser -pMyPassword magentodb -e  "UPDATE core_config_data SET value='${YOUR_ACTUAL_HOSTNAME}:7780/' WHERE path = 'web/secure/base_url';"
 docker exec shopping_admin /var/www/magento2/bin/magento cache:flush
 # Forum
 docker exec forum sed -i '/@RateLimit/,/)/d' /var/www/html/src/DataObject/CommentData.php
 docker exec forum sed -i '/@RateLimit/,/)/d' /var/www/html/src/DataObject/SubmissionData.php
 docker exec forum sed -i '/@RateLimit/,/)/d' /var/www/html/src/DataObject/UserData.php
 docker exec forum bin/console cache:clear --env=prod
 sleep 60
 # Gitlab
 docker exec gitlab sed -i "s|^external_url.*|external_url '${YOUR_ACTUAL_HOSTNAME}:8023'|" /etc/gitlab/gitlab.rb
 docker exec gitlab sed -i "s/.*postgresql\['max_connections'.*/postgresql\['max_connections'\] = 2000/g" /etc/gitlab/gitlab.rb
 docker exec gitlab gitlab-ctl reconfigure
 docker exec gitlab gitlab-ctl restart
--- a/VAB-WebArena-Lite/replace.sh
+++ b/VAB-WebArena-Lite/replace.sh
@ -0,0 +1,37 @@
 #!/bin/bash
 # 1. reset the original environment
 cd visualwebarena
 rm -rf .git
 cd ..
 # 2. replace files
 # webarena-lite
 cp -f new/test_webarena_lite.raw.json visualwebarena/config_files/wa/test_webarena_lite.raw.json
 cp -f new/generate_test_data.py visualwebarena/scripts/generate_test_data.py
 # agent
 cp -f new/run.py visualwebarena/run.py
 cp -f new/agent.py visualwebarena/agent/agent.py
 cp -f new/prompt_constructor.py visualwebarena/agent/prompts/prompt_constructor.py
 # llms
 cp -f new/utils.py visualwebarena/llms/utils.py
 cp -f new/llms_init.py visualwebarena/llms/__init__.py
 cp -f new/lm_config.py visualwebarena/llms/lm_config.py
 cp -f new/tokenizers.py visualwebarena/llms/tokenizers.py
 cp -f new/api_utils.py visualwebarena/llms/providers/api_utils.py
 cp -f new/openai_utils.py visualwebarena/llms/providers/openai_utils.py
 # eval
 cp -f new/evaluators.py visualwebarena/evaluation_harness/evaluators.py
 cp -f new/helper_functions.py visualwebarena/evaluation_harness/helper_functions.py
 # misc
 cp -f README.md visualwebarena/README.md
 cp -f new/wa_parallel_run.sh visualwebarena/wa_parallel_run.sh
 # 3. remove temporary files
 mv visualwebarena/* .
 rm -rf new
 rm -rf visualwebarena
--- a/assets/framework.png
+++ b/assets/framework.png
--- a/docs/detailed_setups/VAB-Minecraft.md
+++ b/docs/detailed_setups/VAB-Minecraft.md
@ -0,0 +1,49 @@
 # Setup for VAB-Minecraft
 ## Installation
 1. We have tested on Ubuntu. VAB-Minecraft requires at least 4 GB NVIDIA GPU and NVIDIA GPU driver version >= 530.30.02.
 2. Besides [docker](https://www.docker.com/), install [NVIDIA container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) on your machine.
 3. Get pre-built docker image.
    - If you have access to docker hub:
        ```bash
        docker pull tianjiezhang/vab_minecraft:latest
        ```
    - Or you can download from ModelScope.
        1. Make sure `git-lfs` is installed.
        2. Download from ModelScope:
            ```bash
            git lfs install
            git clone https://www.modelscope.cn/datasets/VisualAgentBench/VAB-Minecraft.git
            ```
        3. Load the docker image from ModelScope dataset.
            ```bash
            docker load -i VAB-Minecraft/vab_minecraft.tar
            ```
 4. Download weights of Steve-1 to `data/minecraft`. Please make sure you have access to google drive.
    ```bash
    python scripts/minecraft_download.py
    ```
 ## Get Started
 According to your hardware equipment, fill `available_ports` and `available_devices` in the task configuration file `configs/tasks/minecraft.yaml`.
 - `available_ports`: Please fill in available ports in your machine. Each concurrent docker container requires 1 port for communication with the task server. Ensure that you provide enough ports to accommodate the expected concurrency.
 - `available_devices`: Please fill in GPU IDs and their corresponding capability of concurrency. Each concurrent docker container occupies about **3.3 GB** memory. Ensure that you provide enough GPU memory to accommodate the expected concurrency.
 **Note: If you manually shut down the task server and assigner, please ensure you also stop the Minecraft containers to free up the ports!**
--- a/docs/detailed_setups/VAB-OmniGibson.md
+++ b/docs/detailed_setups/VAB-OmniGibson.md
@ -57,57 +57,3 @@
        2. Load the new value: `sudo sysctl -p`.
 **Note: If you manually shut down the task server and assigner, please ensure you also stop the OmniGibson containers to free up the ports!**
 # Setup for VAB-Minecraft
 ## Installation
 1. We have tested on Ubuntu. VAB-Minecraft requires at least 4 GB NVIDIA GPU and NVIDIA GPU driver version >= 530.30.02.
 2. Besides [docker](https://www.docker.com/), install [NVIDIA container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) on your machine.
 3. Get pre-built docker image.
    - If you have access to docker hub:
        ```bash
        docker pull tianjiezhang/vab_minecraft:latest
        ```
    - Or you can download from ModelScope.
        1. Make sure `git-lfs` is installed.
        2. Download from ModelScope:
            ```bash
            git lfs install
            git clone https://www.modelscope.cn/datasets/VisualAgentBench/VAB-Minecraft.git
            ```
        3. Load the docker image from ModelScope dataset.
            ```bash
            docker load -i VAB-Minecraft/vab_minecraft.tar
            ```
 4. Download weights of Steve-1 to `data/minecraft`. Please make sure you have access to google drive.
    ```bash
    python scripts/minecraft_download.py
    ```
 ## Get Started
 According to your hardware equipment, fill `available_ports` and `available_devices` in the task configuration file `configs/tasks/minecraft.yaml`.
 - `available_ports`: Please fill in available ports in your machine. Each concurrent docker container requires 1 port for communication with the task server. Ensure that you provide enough ports to accommodate the expected concurrency.
 - `available_devices`: Please fill in GPU IDs and their corresponding capability of concurrency. Each concurrent docker container occupies about **3.3 GB** memory. Ensure that you provide enough GPU memory to accommodate the expected concurrency.
 **Note: If you manually shut down the task server and assigner, please ensure you also stop the Minecraft containers to free up the ports!**
 # Setup for VAB-CSS
 TODO