diff --git a/README.md b/README.md index cf75e79..1d0782c 100644 --- a/README.md +++ b/README.md @@ -26,32 +26,30 @@ Compared to its predecessor [AgentBench](https://github.com/THUDM/AgentBench), V ## Table of Contents +- [Quick Start](#quick-start) - [Dataset Summary](#dataset-summary) - [Leaderboard](#leaderboard) - [Quick Start](#quick-start) - [Acknowledgement](#acknowledgement) - [Citation](#citation) -## Dataset Summary - -We offer two splits for each dataset: Testing and Training. Different from its predecessor [AgentBench](https://github.com/THUDM/AgentBench), VAB is accompanied with a trajectory training set for behavior cloning (BC) training, which allows development of more potent visual foundation agents with emerging open LMMs. - -![](./assets/statistics.png) - -## Leaderboard - -Here is the scores on test set results of VAB. All metrics are task Success Rate (SR). Noted that proprietary LMMs are tested with mere **Prompting**, and open LMMs are tested after **Multitask Finetuning** on VAB training set, as they usually fail to follow complicated agent task instructions. - -![](./assets/leaderboard.png) - ## Quick Start -This section will guide you on how to use `gpt-4o-2024-05-13` as an agent to launch 4 concurrent `VAB-Minecraft` tasks. -For the specific framework structure, please refer to AgentBench's [Framework Introduction](https://github.com/THUDM/AgentBench/blob/main/docs/Introduction_en.md). +This section will first give you an overview to the use and architecture of VAB. +Next, it will guide you on how to use `gpt-4o-2024-05-13` as an exemplar agent to launch 4 concurrent `VAB-Minecraft` tasks. + +### Overview on VAB Framework + +To allow fast evaluation over agent tasks, we leverage AgentBench's framework as the backbone (currently for VAB-OmniGibson, VAB-Minecraft, and VAB-CSS). +If you are interested in its detailed implementation, please refer to AgentBench's [Framework Introduction](https://github.com/THUDM/AgentBench/blob/main/docs/Introduction_en.md) (which may not be necessary). +Basically, the framework calls all LLM/LMM in API formats via `Agent-Controller`, and accesses to environments via `Task-Controller`. +The `Assigner` will automatically assign evaluation tasks by pairing `Agent-Controller` and `Task-Controller` to optimize the overall evaluation speed. For more detailed configuration and launch methods, please check [Configuration Guide](docs/Config_en.md) and [Program Entrance Guide](docs/Entrance_en.md). -### Step 1. Prerequisites +![](./assets/framework.png) + +### Step 1. Prerequisites for All Environments Clone this repo and install the dependencies. @@ -68,7 +66,14 @@ Ensure that [Docker](https://www.docker.com/) is properly installed. docker ps ``` -For specific environments, please refer to their respective prerequisites: [VAB-OmniGibson](docs/README_setup.md#Setup-for-VAB-OmniGibson), [VAB-Minecraft](docs/README_setup.md#Setup-for-VAB-Minecraft), [VAB-CSS](docs/README_setup.md#Setup-for-VAB-CSS). +For specific environments, please refer to their additional prerequisites respectively. +For VAB-WebArena-Lite, it is based on [WebArena](https://github.com/webarena-x/webarena) with some modifications, so please read its individual setup carefully. + +* [VAB-OmniGibson Setup](docs/detailed_setups/VAB-OmniGibson.md) +* [VAB-Minecraft Setup](docs/detailed_setups/VAB-Minecraft.md) +* VAB-Mobile: Ongoing +* [VAB-WebArena-Lite Setup](VAB-WebArena-Lite/README.md) (Separate installation and evaluation method) +* VAB-CSS: Ongoing ### Step 2. Configure the Agent @@ -117,6 +122,19 @@ python -m src.assigner --auto-retry --config configs/assignments/omnigibson.yaml You can modify the config files to launch other tasks or change task concurrency. +## Dataset Summary + +We offer two splits for each dataset: Testing and Training. Different from its predecessor [AgentBench](https://github.com/THUDM/AgentBench), VAB is accompanied with a trajectory training set for behavior cloning (BC) training, which allows development of more potent visual foundation agents with emerging open LMMs. + +![](./assets/statistics.png) + +## Leaderboard + +Here is the scores on test set results of VAB. All metrics are task Success Rate (SR). Noted that proprietary LMMs are tested with mere **Prompting**, and open LMMs are tested after **Multitask Finetuning** on VAB training set, as they usually fail to follow complicated agent task instructions. + +![](./assets/leaderboard.png) + + ## Acknowledgement This project is heavily built upon the following repositories (to be updated): diff --git a/VAB-WebArena-Lite/README.md b/VAB-WebArena-Lite/README.md new file mode 100644 index 0000000..15a2c34 --- /dev/null +++ b/VAB-WebArena-Lite/README.md @@ -0,0 +1,218 @@ +# Setup for VAB-WebArena-Lite + +## Brief Introduction + +VAB-WebArena-Lite is a 165-task refined subset from WebArena. +The purpose of building this subset is to manually ensure task correctness & feasibility, and speed up testing (original 812-task WebArena usually takes more than 6h to run through, while VAB-WebArena-Lite takes around 40m in practice). +The modified version of the test cases can be found in `config_files/wa/test_webarena_lite.raw.json`. + + +## Install + +First, you should clone the official repository of VisualWebArena to this directory + +```bash +# Assume you have cloned VAB and is now in the `VAB-WebArena-Lite` directory +git clone https://github.com/web-arena-x/visualwebarena.git visualwebarena +cd visualwebarena +git reset --hard ad57aae4dad71531504726900b80db02e0526158 +cd .. +``` + +Then, you should substitute the file with the commands below: + +```bash +bash replace.sh +``` + +After that, you should install the dependencies for VAB-WebArena-Lite (recommend using a independent conda environment to VAB): + +```bash +# Python 3.10 (or 3.11, but not 3.12 cause 3.12 deprecated distutils needed here) +python -m wal wal +source venv/bin/activate +pip install -r requirements.txt +playwright install +pip install -e . +``` + +You can also run the unit tests to ensure that WebArena-Lite is installed correctly: + +```bash +pytest -x +``` + +## Setup WebArena-Lite Environments + +1. Setup the standalone environments. +Please check out [this page](https://github.com/web-arena-x/webarena/tree/main/environment_docker) for details. + +2. Configurate the urls for each website. +First, export the `DATASET` to be `webarena`: + +```bash +export DATASET=webarena +``` + +Then, set the URL for the websites +(🚨 Notice: check if default ports of websites below correspond to those you setup in the first step) + +```bash +# Actually, the CLASSIFIEDS environment is not included in the WebArena-Lite evaluation, we keep the environment variables here just for consistency. +export CLASSIFIEDS=":9980" +export CLASSIFIEDS_RESET_TOKEN="4b61655535e7ed388f0d40a93600254c" + +# Below are the variables you should set for the evaluation. +export SHOPPING=":7770" +export REDDIT=":9999" +export SHOPPING_ADMIN=":7780/admin" +export GITLAB=":8023" +export MAP=":3000" +export WIKIPEDIA=":8888" +export HOMEPAGE=":4399" +``` + +3. Generate config files for each test example: + +```bash +python scripts/generate_test_data.py +``` + +You will see `*.json` files generated in the [config_files](./config_files) folder. Each file contains the configuration for one test example. + +4. Obtain and save the auto-login cookies for all websites: + +```bash +bash prepare.sh +``` + +5. Set up API keys. + +```bash +export OPENAI_API_KEY=your_key + +# Optional: if you use a different OpenAI model source +export OPENAI_API_URL=your_url + +# Optional: you can set the following variables to evaluate the preset model in llms/providers/api_utils.py +export GEMENI_API_KEY=your_key +export QWEN_API_KEY=your_key +export CLAUDE_API_KEY=your_key + +# Optional: if you have trained your model, we recommend deploying it as an API service, where you can set a FINETUNED_URL to evaluate it. +export FINETUNED_URL=your_url + +``` + +If using Gemini, first install the [gcloud CLI](https://cloud.google.com/sdk/docs/install). Configure the API key by authenticating with Google Cloud: + +```bash +gcloud auth login +gcloud config set project +``` + +## 🖼️ Evaluating in VAB Standard Setting with SoM (Set-of-Marks) Visual Agents + +### 👎 Run Single Agent For Evalution (Slow, but please read to understand meaning of arguments) + +To run your own model with SoM visual agent, you can run evaluation with the following flags: + +```bash +python run.py \ + --instruction_path agent/prompts/jsons/p_som_cot_id_actree_3s.json \ + --test_start_idx 0 \ + --test_end_idx 1 \ + --result_dir \ + --test_config_base_dir config_files/wa/test_webarena_lite \ + --provider api \ + --model openai_gpt-4-vision-preview \ + --action_set_tag som --observation_type image_som +``` + +Besides the original model providers (OpenAI, Google), you can also add your models in `llms/providers/api_utils.py`. Remember to set `--provider` to: + +- `api`: Keep the same input style as WebArena, suitable for regular API calls +- `finetune`: This is required for models trained with the data we provide. + +For the `--model` variable, we use the format `_` . + +- If there is no more optional models under source, you can set it to just `source`. +- Remember that the source name here should be added in the init function of `APIModel` in `llms/providers/api_utils.py`. +- For example, if you want to use the openai model "gpt-4o", you can set the flag like this: `--model openai_gpt-4o`. + +Finally, run `score.py` to get the pass rate +```bash +python score.py +``` + +### 👍 Run Parallel Agent For Evaluation (Recommended) + +To run the tests in parallel, you can first configure `wa_parallel_run.sh`, then run it. We default split the test set to 5 parallel-group for evaluation in VAB. + +```bash +# Remember to first launch a tmux session +tmux +bash wa_parallel_run.sh +``` + +The script is enabled with auto-resuming if it is interrupted or met unexpected error. Please feel free to rerun the above command until all tasks finish. + +After all parallel groupes finish, run `score.py` to get the pass rate +```bash +python score.py +``` + +### 🚨 Important: Refresh all websites before re-run another round of testing! +Since tasks in WebArena may involve changing status and database of websites (e.g., posting comments on Reddit), if websites are not all refreshed before another round of evaluation, the results would be problematic. + +Please remember to run following command (assume you are hosting WebArena websites on your own) to restart and refresh all website dockers to avoid potential contamination. +The process usually takes 3-5 minites. + +```bash +# Make sure the script is executed on the machine that you run those website dockers +bash refresh_website_docker.sh +``` + +You may need to change some contents in the script (e.g. configured ports of websites, names of dockers, etc.). + +## Run Visualized Demostration +Original WebArena have also prepared a demo for you to run the agents on your own task on an arbitrary webpage. An example is shown above where the agent is tasked to find the best Thai restaurant in Pittsburgh. + +After following the setup instructions above and setting the OpenAI API key (the other environment variables for website URLs aren't really used, so you should be able to set them to some dummy variable), you can run the GPT-4V + SoM agent with the following command: + +```bash +python run_demo.py \ + --instruction_path agent/prompts/jsons/p_som_cot_id_actree_3s.json \ + --start_url "https://www.amazon.com" \ + --image "https://media.npr.org/assets/img/2023/01/14/this-is-fine_wide-0077dc0607062e15b476fb7f3bd99c5f340af356-s1400-c100.jpg" \ + --intent "Help me navigate to a shirt that has this on it." \ + --result_dir demo_test_amazon \ + --model gpt-4-vision-preview \ + --action_set_tag som --observation_type image_som \ + --render +``` + +This tasks the agent to find a shirt that looks like the provided image (the "This is fine" dog) from Amazon. Have fun! + +## Acknowledgements + +Our code is heavily based off the WebArena codebase and VisualWebArena codebase. + +If you find this environment useful, please consider citing VisualWebArena as well as WebArena: + +```bibtex +@article{koh2024visualwebarena, + title={VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks}, + author={Koh, Jing Yu and Lo, Robert and Jang, Lawrence and Duvvur, Vikram and Lim, Ming Chong and Huang, Po-Yu and Neubig, Graham and Zhou, Shuyan and Salakhutdinov, Ruslan and Fried, Daniel}, + journal={arXiv preprint arXiv:2401.13649}, + year={2024} +} + +@article{zhou2024webarena, + title={WebArena: A Realistic Web Environment for Building Autonomous Agents}, + author={Zhou, Shuyan and Xu, Frank F and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Bisk, Yonatan and Fried, Daniel and Alon, Uri and others}, + journal={ICLR}, + year={2024} +} +``` + diff --git a/VAB-WebArena-Lite/new/agent.py b/VAB-WebArena-Lite/new/agent.py new file mode 100644 index 0000000..461ba7e --- /dev/null +++ b/VAB-WebArena-Lite/new/agent.py @@ -0,0 +1,227 @@ +import argparse +import json +from typing import Any, Optional + +import tiktoken +from beartype import beartype +from PIL import Image + +from agent.prompts import * +from browser_env import Trajectory +from browser_env.actions import ( + Action, + ActionParsingError, + create_id_based_action, + create_none_action, + create_playwright_action, +) +from browser_env.utils import Observation, StateInfo +from llms import ( + call_llm, + generate_from_huggingface_completion, + generate_from_openai_chat_completion, + generate_from_openai_completion, + lm_config, +) +from llms.tokenizers import Tokenizer + + +class Agent: + """Base class for the agent""" + + def __init__(self, *args: Any) -> None: + pass + + def next_action( + self, trajectory: Trajectory, intent: str, meta_data: Any + ) -> Action: + """Predict the next action given the observation""" + raise NotImplementedError + + def reset( + self, + test_config_file: str, + ) -> None: + raise NotImplementedError + + +class TeacherForcingAgent(Agent): + """Agent that follows a pre-defined action sequence""" + + def __init__(self) -> None: + super().__init__() + + def set_action_set_tag(self, tag: str) -> None: + self.action_set_tag = tag + + def set_actions(self, action_seq: str | list[str]) -> None: + if isinstance(action_seq, str): + action_strs = action_seq.strip().split("\n") + else: + action_strs = action_seq + action_strs = [a.strip() for a in action_strs] + + actions = [] + for a_str in action_strs: + try: + if self.action_set_tag == "playwright": + cur_action = create_playwright_action(a_str) + elif self.action_set_tag == "id_accessibility_tree": + cur_action = create_id_based_action(a_str) + else: + raise ValueError( + f"Unknown action type {self.action_set_tag}" + ) + except ActionParsingError as e: + cur_action = create_none_action() + + cur_action["raw_prediction"] = a_str + actions.append(cur_action) + + self.actions: list[Action] = actions + + def next_action( + self, trajectory: Trajectory, intent: str, meta_data: Any + ) -> Action: + """Predict the next action given the observation""" + return self.actions.pop(0) + + def reset( + self, + test_config_file: str, + ) -> None: + with open(test_config_file) as f: + ref_actions = json.load(f)["reference_action_sequence"] + tag = ref_actions["action_set_tag"] + action_seq = ref_actions["action_sequence"] + self.set_action_set_tag(tag) + self.set_actions(action_seq) + + +class PromptAgent(Agent): + """prompt-based agent that emits action given the history""" + + @beartype + def __init__( + self, + action_set_tag: str, + lm_config: lm_config.LMConfig, + prompt_constructor: PromptConstructor, + captioning_fn = None, + ) -> None: + super().__init__() + self.lm_config = lm_config + self.prompt_constructor = prompt_constructor + self.action_set_tag = action_set_tag + self.captioning_fn = captioning_fn + + # Check if the model is multimodal. + if ("gemini" in lm_config.model or "gpt-4" in lm_config.model and "vision" in lm_config.model or lm_config.provider in ["api", "finetune"]) and type(prompt_constructor) == MultimodalCoTPromptConstructor: + self.multimodal_inputs = True + else: + self.multimodal_inputs = False + + def set_action_set_tag(self, tag: str) -> None: + self.action_set_tag = tag + + @beartype + def next_action( + self, trajectory: Trajectory, intent: str, meta_data: dict[str, Any], images: Optional[list[Image.Image]] = None, + output_response: bool = False + ) -> Action: + # Create page screenshot image for multimodal models. + if self.multimodal_inputs: + page_screenshot_arr = trajectory[-1]["observation"]["image"] + page_screenshot_img = Image.fromarray( + page_screenshot_arr + ) # size = (viewport_width, viewport_width) + + # Caption the input image, if provided. + if images is not None and len(images) > 0: + if self.captioning_fn is not None: + image_input_caption = "" + for image_i, image in enumerate(images): + if image_i == 0: + image_input_caption += f'Input image {image_i+1}: "{self.captioning_fn([image])[0]}"' + else: + image_input_caption += f'input image {image_i+1}: "{self.captioning_fn([image])[0]}"' + if len(images) > 1: + image_input_caption += ", " + # Update intent to include captions of input images. + intent = f"{image_input_caption}\nIntent: {intent}" + elif not self.multimodal_inputs: + print( + "WARNING: Input image provided but no image captioner available." + ) + + if self.multimodal_inputs: + prompt = self.prompt_constructor.construct( + trajectory, intent, page_screenshot_img, images, meta_data + ) + else: + prompt = self.prompt_constructor.construct( + trajectory, intent, meta_data + ) + lm_config = self.lm_config + n = 0 + while True: + response = call_llm(lm_config, prompt) + force_prefix = self.prompt_constructor.instruction[ + "meta_data" + ].get("force_prefix", "") + response = f"{force_prefix}{response}" + if output_response: + print(f'Agent: {response}', flush=True) + n += 1 + try: + parsed_response = self.prompt_constructor.extract_action( + response + ) + if self.action_set_tag == "id_accessibility_tree": + action = create_id_based_action(parsed_response) + elif self.action_set_tag == "playwright": + action = create_playwright_action(parsed_response) + elif self.action_set_tag == "som": + action = create_id_based_action(parsed_response) + else: + raise ValueError( + f"Unknown action type {self.action_set_tag}" + ) + action["raw_prediction"] = response + break + except ActionParsingError as e: + if n >= lm_config.gen_config["max_retry"]: + action = create_none_action() + action["raw_prediction"] = response + break + + return action + + def reset(self, test_config_file: str) -> None: + pass + + +def construct_agent(args: argparse.Namespace, captioning_fn=None) -> Agent: + llm_config = lm_config.construct_llm_config(args) + + agent: Agent + if args.agent_type == "teacher_forcing": + agent = TeacherForcingAgent() + elif args.agent_type == "prompt": + with open(args.instruction_path) as f: + constructor_type = json.load(f)["meta_data"]["prompt_constructor"] + tokenizer = Tokenizer(args.provider, args.model) + prompt_constructor = eval(constructor_type)( + args.instruction_path, lm_config=llm_config, tokenizer=tokenizer + ) + agent = PromptAgent( + action_set_tag=args.action_set_tag, + lm_config=llm_config, + prompt_constructor=prompt_constructor, + captioning_fn=captioning_fn + ) + else: + raise NotImplementedError( + f"agent type {args.agent_type} not implemented" + ) + return agent diff --git a/VAB-WebArena-Lite/new/api_utils.py b/VAB-WebArena-Lite/new/api_utils.py new file mode 100644 index 0000000..54cef30 --- /dev/null +++ b/VAB-WebArena-Lite/new/api_utils.py @@ -0,0 +1,682 @@ +import os +import copy +import json +import time +import base64 +import shutil +import requests + +import dashscope +import http.client +import anthropic + +import google.auth +from google.oauth2 import service_account +from google.auth.transport.requests import Request + +from openai import OpenAI +from typing import List, Tuple, Dict +from http import HTTPStatus +from PIL import Image +from io import BytesIO + +PROXIES = { # gemini + "http": "http://127.0.0.1:7890", + "https": "http://127.0.0.1:7890" +} + +SEED = int(os.environ.get("SEED", 42)) + +GCLOUD_KEY_FILE_PATH = "" # path to the google cloud project json file +GCLOUD_REGIONAL_CODE = "asia-east1" +OPENAI_API_URL = os.environ.get("OPENAI_API_URL") +FINETUNED_URL = os.environ.get("FINETUNED_URL") # finetuned model url + +OPENAI_API_KEY = os.environ["OPENAI_API_KEY"] # you should alway setup openai api key for evaluation +GEMINI_API_KEY = os.environ.get("GEMENI_API_KEY", "") # no need when using google cloud +QWEN_API_KEY = os.environ.get("QWEN_API_KEY" , "") +CLAUDE_API_KEY = os.environ.get("CLAUDE_API_KEY", "") + +class BasicModel(object): + def __init__(self): + super().__init__() + # make temp dir here + file_path = os.path.dirname(__file__) + self.base_dir = os.path.join(file_path, "temp", f"{int(time.time())}") + os.makedirs(self.base_dir, exist_ok=True) + + def __del__(self): + # remove temp dir + shutil.rmtree(self.base_dir, ignore_errors=True) + + def prompt_construct(self, messages: List[Dict]) -> List[Dict]: + return messages + + @staticmethod + def process_system_prompt(messages: List[Dict]) -> List[Dict]: + if messages[0]["role"] != "system": + return messages + + new_messages = copy.deepcopy(messages[1:]) + system_prompt = messages[0]["content"] + + # Search for first user message and add system prompt to it + for item in new_messages: + if item.get("role") != "user": + continue + + for ct in item["content"]: + # Case 1: directly appended to the text + if ct["type"] == "text": + ct["text"] = system_prompt + "\n" + ct["text"] + return new_messages + + # Case 2: create a new text item + item["content"].insert(0, { + "type": "text", + "text": system_prompt + }) + return new_messages + + # Case 3: no user message found, add a new user message + new_messages.insert(0, { + "role": "user", + "content": [{ + "type": "text", + "text": system_prompt + }] + }) + return new_messages + + @staticmethod + def pil_to_b64(img: Image.Image) -> str: + with BytesIO() as image_buffer: + img.save(image_buffer, format="PNG") + byte_data = image_buffer.getvalue() + img_b64 = base64.b64encode(byte_data).decode("utf-8") + img_b64 = "data:image/png;base64," + img_b64 + return img_b64 + + # save base64 image and return filename + def b64_to_image(self, base64_data: str) -> str: + base64_data = base64_data.split(",")[1] + image_data = base64.b64decode(base64_data) + + filename = os.path.join(self.base_dir, f"{int(time.time())}.png") + with open(filename, "wb") as f: + f.write(image_data) + + return filename + + def get_model_response(self, messages: List[Dict], model_name: str, **args) -> Tuple[bool, str]: + raise NotImplementedError("Subclasses must implement this method") + + +class OpenAIModel(BasicModel): + def __init__(self): + super().__init__() + self.client = OpenAI(api_key=OPENAI_API_KEY, base_url=OPENAI_API_URL) + + def prompt_construct(self, messages: List[Dict]) -> List[Dict]: + return messages + + def get_model_response(self, messages: List[Dict], model_name: str, **args) -> Tuple[bool, str]: + if "OPENAI_API_KEY" not in os.environ: + raise ValueError( + "OPENAI_API_KEY environment variable must be set when using OpenAI API." + ) + + response = self.client.chat.completions.create( + model=model_name, + messages=messages, + temperature=args.get("temperature", 0.0), + max_tokens=args.get("max_tokens", 1024), + top_p=args.get("top_p", 1.0), + ) + + try: + answer: str = response.choices[0].message.content + return True, answer + except: + return False, str(response.error) + + +class FinetuneModel(BasicModel): + def __init__(self): + super().__init__() + self.url = FINETUNED_URL # inference api + + def prompt_construct(self, messages: List[Dict]) -> List[Dict]: + dialog, images = "", [] + for message in messages: + if message["role"] == "system": + dialog += f"<|system|>\n{message['content']}\n\n" + continue + elif message["role"] == "assistant": + dialog += f"<|assistant|>\n{message['content']}\n\n" + continue + + dialog += f"<|user|>\n" + for content in message["content"]: + if content["type"] == "text": + dialog += f"{content['text']}\n" + else: + # TODO: we use filename as image url here + images.append(self.b64_to_image(content["image_url"]["url"])) + dialog += "\n\n" + + dialog += "<|assistant|>\n" + images = [open(image, "rb") for image in images] + new_messages = [ + {"image": images[0]}, + {"prompt": dialog} + ] + + return new_messages + + def get_model_response(self, messages: List[Dict], model_name: str, **args) -> Tuple[bool, str]: + try: + response = requests.post(self.url, files=messages[0], data=messages[1], timeout=40) + response = response.json() + except Exception as e: + return False, str(e) + + if "error" in response: + return False, response["error"]["message"] + + # TODO: you should change the response format here + resp = f'```\n{response["response"].split("<|end_of_text|>")[0]}\n```' + return True, resp + + +class QwenModel(BasicModel): + def __init__(self): + super().__init__() + dashscope.api_key = QWEN_API_KEY + self.seed = SEED + + def prompt_construct(self, messages: List[Dict]) -> List[Dict]: + messages = self.process_system_prompt(messages) + + new_messages = [] + for message in messages: + if message["role"] != "user": + new_messages.append({ + "role": "assistant", + "content": [{"text": message["content"]}] + }) + continue + + new_content = [] + for content in message["content"]: + if content["type"] == "text": + new_content.append({ + "text": content["text"], + }) + else: + filename = self.b64_to_image(content["image_url"]["url"]) + new_content.append({ + "image": f"file://{filename}" + }) + new_messages.append({ + "role": "user", + "content": new_content + }) + + return new_messages + + def get_model_response(self, messages: List[Dict], model_name: str, **args) -> Tuple[bool, str]: + if "QWEN_API_KEY" not in os.environ: + raise ValueError( + "QWEN_API_KEY environment variable must be set when using Qwen API." + ) + + response = dashscope.MultiModalConversation.call( + model=model_name, + messages=messages, + top_k=args.get("top_k"), + seed=self.seed + ) + + if response.status_code == HTTPStatus.OK: + return True, response.output.choices[0].message.content[0]['text'] + else: + return False, response.message + + +class ClaudeModel(BasicModel): + def __init__(self): + super().__init__() + self.client = anthropic.Anthropic(api_key=CLAUDE_API_KEY) + + def prompt_construct(self, messages: List[Dict]) -> List[Dict]: + new_messages = [] + for message in messages: + if message["role"] in ["system", "assistant"]: + new_messages.append(message) + continue + + new_content = [] + for content in message["content"]: + if content["type"] == "text": + new_content.append(content) + continue + + hdr, idata = content["image_url"]["url"].split(";base64,") + new_content.append({ + "type": "image", + "source": { + "type": "base64", + "media_type": hdr.split("data:")[1], + "data": idata + } + }) + + new_messages.append({ + "role": "user", + "content": new_content + }) + + return new_messages + + def get_model_response(self, messages: List[Dict], model_name: str, **args) -> Tuple[bool, str]: + try: + if messages[0]["role"] == "system": + system_prompt = messages[0]["content"] + messages = messages[1:] + response = self.client.messages.create( + model="claude-3-5-sonnet-20240620", + max_tokens=args.get("max_tokens"), + temperature=args.get("temperature"), + system=system_prompt, + messages=messages + ) + + else: + response = self.client.messages.create( + model="claude-3-5-sonnet-20240620", + max_tokens=args.get("max_tokens"), + temperature=args.get("temperature"), + messages=messages + ) + + usage = response.usage + prompt_tokens = usage.input_tokens + completion_tokens = usage.output_tokens + + # print(response) + print(response.content) + print(f"Prompt Tokens: {prompt_tokens}\nCompletion Tokens: {completion_tokens}\n") + return True, response.content + + except Exception as e: + return False, str(e) + + def get_model_response_thirdapi(self, messages) -> Tuple[bool, str]: + + conn = http.client.HTTPSConnection("cn2us02.opapi.win", timeout=900) + + system_prompt = None + if messages[0]["role"] == "system": + system_prompt = messages[0]["content"] + messages = messages[1:] + payload = { + "model": "claude-3-opus", + "stream": False, + "system": system_prompt, + "messages": messages, + "temperature": self.temperature, + "max_tokens": self.max_tokens + } + else: + payload = { + "model": "claude-3-opus", + "stream": False, + "messages": messages, + "temperature": self.temperature, + "max_tokens": self.max_tokens + } + payload = json.dumps(payload) + headers = { + 'Accept': 'application/json', + 'Authorization': f'Bearer {CLAUDE_API_KEY}', + 'User-Agent': 'Apifox/1.0.0 (https://apifox.com)', + 'Content-Type': 'application/json' + } + try: + conn.request("POST", "/v1/messages", payload, headers) + res = conn.getresponse() + data = res.read() + response = json.loads(data.decode("utf-8")) + + except Exception as e: + return False, str(e) + + if "statusCode" in response and response["statusCode"] != 200: + return False, response["message"] + + usage = response["usage"] + prompt_tokens = usage["input_tokens"] + completion_tokens = usage["output_tokens"] + print(f"Prompt Tokens: {prompt_tokens}\nCompletion Tokens: {completion_tokens}\n") + + return True, response["content"][0]["text"] + + +class GeminiModel(BasicModel): + def __init__(self): + super().__init__() + self.api_key = GEMINI_API_KEY + + def prompt_construct(self, messages: List[Dict]) -> List[Dict]: + parts = [] + dialog = "" + sep = "\n\n###\n\n" + for message in messages: + if message["role"] == "system": + dialog += f"SYSTEM:\n{message['content']}{sep}" + elif message["role"] == "assistant": + dialog += f"ASSISTANT:\n{message['content']}{sep}" + elif message["role"] == "user": + dialog += "USER:\n" + for content in message["content"]: + if content["type"] == "text": + dialog += content["text"] + "\n" + continue + + assert content["type"] == "image_url" + + # save text + parts.append({ "text": dialog }) + dialog = "" + + # new content type for image + hdr, idata = content["image_url"]["url"].split(";base64,") + parts.append({ + "inline_data": { + "mime_type": hdr.split("data:")[1], + "data": idata + } + }) + dialog += sep + else: + raise ValueError(f"Invalid role: {message['role']}") + + parts.append({ + "text": dialog + "ASSISTANT:\n" + }) + + new_messages = [{ + "parts": parts, + "role": "user" + }] + + return new_messages + + def get_model_response(self, messages: List[Dict], model_name: str, **args) -> Tuple[bool, str]: + + headers = { + "Content-Type": "application/json" + } + + proxies = PROXIES + url = f"https://generativelanguage.googleapis.com/v1beta/models/{model_name}-latest:generateContent?key={self.api_key}" + + generation_config = { + "temperature": args.get('temperature'), + "maxOutputTokens": args.get('max_tokens'), + "stopSequences": ["\n\n###\n\n"] + } + + safety_settings = [ + { + "category": "HARM_CATEGORY_DANGEROUS_CONTENT", + "threshold": "BLOCK_ONLY_HIGH" + }, + { + "category": "HARM_CATEGORY_HARASSMENT", + "threshold": "BLOCK_ONLY_HIGH" + }, + { + "category": "HARM_CATEGORY_HATE_SPEECH", + "threshold": "BLOCK_ONLY_HIGH" + }, + { + "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", + "threshold": "BLOCK_ONLY_HIGH" + } + ] + + payload = { + "contents": messages, + "generationConfig": generation_config, + "safetySettings": safety_settings + } + + try: + response = requests.post(url, headers=headers, json=payload, proxies=proxies, timeout=30) + response = response.json() + except Exception as e: + return False, str(e) + + if "error" in response: + return False, response["error"]["message"] + if "content" not in response['candidates'][0]: + self.generation_config['maxOutputTokens'] *= 2 + return False, "No content generated." + return True, response['candidates'][0]['content']['parts'][0]['text'] + + +class VertexGeminiModel(BasicModel): + def __init__(self): + super().__init__() + + def prompt_construct(self, messages: List[Dict]) -> List[Dict]: + parts = [] + dialog = "" + sep = "\n\n###\n\n" + for message in messages: + if message["role"] == "system": + dialog += f"SYSTEM:\n{message['content']}{sep}" + elif message["role"] == "assistant": + dialog += f"ASSISTANT:\n{message['content']}{sep}" + elif message["role"] == "user": + dialog += "USER:\n" + for content in message["content"]: + if content["type"] == "text": + dialog += content["text"] + "\n" + continue + + assert content["type"] == "image_url" + # save text + parts.append({ "text": dialog }) + dialog = "" + # new content type for image + hdr, idata = content["image_url"]["url"].split(";base64,") + parts.append({ + "inline_data": { + "mime_type": hdr.split("data:")[1], + "data": idata + } + }) + dialog += sep + else: + raise ValueError(f"Invalid role: {message['role']}") + + parts.append({ + "text": dialog + "ASSISTANT:\n" + }) + + new_messages = [{ + "parts": parts, + "role": "user" + }] + + return new_messages + + def get_model_response(self, messages: List[Dict], model_name: str, **args) -> Tuple[bool, str]: + def get_gcloud_token(): + def get_token(): + try: + # Load the credentials from the key file + creds = service_account.Credentials.from_service_account_file( + GCLOUD_KEY_FILE_PATH, + # You can list multiple scopes if needed + scopes=['https://www.googleapis.com/auth/cloud-platform'] + ) + # Refresh the token (this is needed even for the first time) + creds.refresh(Request()) + return creds.token + + except Exception as e: + print(f"An error occurred while trying to fetch the gcloud token: {str(e)}") + return None + + os.environ['HTTP_PROXY'] = PROXIES['http'] + os.environ['HTTPS_PROXY'] = PROXIES['https'] + + fail_time = 0 + while not api_key and fail_time < 10: + time.sleep(5) + api_key = get_token() + fail_time += 1 + + return api_key + + def get_url(model_name: str) -> str: + region_code = GCLOUD_REGIONAL_CODE + model_id = f"{model_name}:generateContent" + with open(GCLOUD_KEY_FILE_PATH, "r") as f: + project_id = json.load(f)["project_id"] + return f"https://{region_code}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{region_code}/publishers/google/models/{model_id}" + + url = get_url(model_name) + api_key = get_gcloud_token() + + if not api_key: + return False, "Failed to fetch gcloud token." + + headers = { + "Content-Type": "application/json", + "Authorization": f"Bearer {api_key}" + } + + proxies = PROXIES + safety_settings = [ + { + "category": "HARM_CATEGORY_DANGEROUS_CONTENT", + "threshold": "BLOCK_ONLY_HIGH" + }, + { + "category": "HARM_CATEGORY_HARASSMENT", + "threshold": "BLOCK_ONLY_HIGH" + }, + { + "category": "HARM_CATEGORY_HATE_SPEECH", + "threshold": "BLOCK_ONLY_HIGH" + }, + { + "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", + "threshold": "BLOCK_ONLY_HIGH" + } + ] + + generation_config = { + "temperature": args.get('temperature'), + "maxOutputTokens": args.get('max_tokens'), + "stopSequences": ["\n\n###\n\n"] + } + + payload = { + "contents": messages, + "generationConfig": generation_config, + "safetySettings": safety_settings + } + + try: + response = requests.post(url, headers=headers, json=payload, proxies=proxies, timeout=120) + response = response.json() + except Exception as e: + return False, str(e) + + if "error" in response: + return False, response["error"]["message"] + if "content" not in response['candidates'][0]: + self.generation_config['maxOutputTokens'] *= 2 + return False, "No content generated." + + return True, response['candidates'][0]['content']['parts'][0]['text'] + + +class APIModel(object): + def __init__(self): + super().__init__() + self.models = { + "openai": OpenAIModel(), + "gemini": VertexGeminiModel(), + "qwen": QwenModel(), + "finetuned": FinetuneModel(), + "claude": ClaudeModel() + } + + def inference(self, model_id: str, messages: List[Dict], args: Dict) -> Tuple[bool, str]: + model_provider, model_name = model_id.split("_")[:2] if "_" in model_id else (model_id, model_id) # eg. "openai_gpt4o" + if model_provider not in self.models: + return False, f"Unsupported model: {model_provider} ({model_name})" + + model = self.models[model_provider] + prompt = model.prompt_construct(messages) + resp = model.get_model_response(prompt, model_name, **args) + + return resp + +model = APIModel() + +def generate_with_api(prompt: List[dict], model_id: str, args: Dict) -> str: + success, response = model.inference(model_id, prompt, args) + return response + +if __name__ == "__main__": + path_to_image = "../../coco_images/000000000285.jpg" + + from PIL import Image + image = Image.open(path_to_image) + img_str = BasicModel.pil_to_b64(image) + + messages = [ + { + "role": "system", + "content": "You are a helpful assistant. Please response concisely." + }, + { + "role": "user", + "content": [{ + "type": "text", + "text": "what's annotated in this image? Image: Omitted." + }] + }, + { + "role": "assistant", + "content": "Only 5.cart is annotated in this image." + }, + { + "role": "user", + "content": [{ + "type": "text", + "text": "What can you see?" + },{ + "type": "image_url", + "image_url": { + "url": img_str, + "detail": "high" + } + }] + } + ] + + response = generate_with_api(messages, "openai", { + "temperature": 0.5, + "max_tokens": 1024, + "top_p": 0.9, + "n": 1, + }) \ No newline at end of file diff --git a/VAB-WebArena-Lite/new/evaluators.py b/VAB-WebArena-Lite/new/evaluators.py new file mode 100644 index 0000000..85b5363 --- /dev/null +++ b/VAB-WebArena-Lite/new/evaluators.py @@ -0,0 +1,649 @@ +"""base class for evaluation""" +# answer string match +import importlib +import json +import re +import time +import urllib +from pathlib import Path +from typing import Any, Optional, Tuple, Union +from urllib.parse import urljoin + +import evaluate # type: ignore[import] +import requests +from beartype import beartype +from beartype.door import is_bearable +from nltk.tokenize import word_tokenize # type: ignore +from PIL import Image +from playwright.sync_api import CDPSession, Page + +from browser_env.actions import Action +from browser_env.utils import StateInfo +from evaluation_harness import image_utils +from evaluation_harness.helper_functions import ( + PseudoPage, + get_query_text, + get_query_text_lowercase, + gitlab_get_project_memeber_role, + llm_fuzzy_match, + llm_ua_match, + reddit_get_latest_comment_content_by_username, + reddit_get_latest_comment_obj_by_username, + reddit_get_parent_comment_username_of_latest_comment_by_username, + reddit_get_post_url, + shopping_get_latest_order_url, + shopping_get_num_reviews, + shopping_get_order_product_name_list, + shopping_get_order_product_option, + shopping_get_order_product_quantity, + shopping_get_product_attributes, + shopping_get_product_price, + shopping_get_rating_as_percentage, + shopping_get_sku_latest_review_author, + shopping_get_sku_latest_review_rating, + shopping_get_sku_latest_review_text, +) + +Trajectory = list[Union[Action, StateInfo]] + + +@beartype +class Evaluator(object): + def __init__(self, eval_tag: str = "") -> None: + self.eval_tag = eval_tag + + def __call__( + self, + trajectory: Trajectory, + config_file: Path | str, + page: Page | PseudoPage + ) -> float: + raise NotImplementedError + + @staticmethod + def get_last_action(trajectory: Trajectory) -> Action: + try: + is_bearable(trajectory[-1], Action) + last_action = trajectory[-1] + except Exception: + raise ValueError( + "The last element of trajectory should be an action, add a fake stop action if needed" + ) + + return last_action # type: ignore[return-value] + + @staticmethod + def get_last_state(trajectory: Trajectory) -> StateInfo: + try: + is_bearable(trajectory[-2], StateInfo) + last_state = trajectory[-2] + except Exception: + raise ValueError( + "The second last element of trajectory should be a state, add a fake stop action if needed" + ) + + return last_state # type: ignore[return-value] + + +@beartype +class NumericEvaluator(Evaluator): + """Check if the numerical relationship is correct""" + + @staticmethod + @beartype + def str_2_int(s: str) -> Optional[int]: + try: + s = s.strip() + if "," in s: + s = s.replace(",", "") + + return int(s) + except ValueError: + # Return None if the string cannot be converted to int + print(f"[NumericEvaluator error]: Cannot convert {s} to int") + return None + + @staticmethod + @beartype + def compare_inequality( + value: Union[int, float], inequality: str, tol: float = 1e-8 + ) -> bool: + """ + Compare a value (int or float) against an inequality string. + + Args: + - value (int/float): The value to be compared. + - inequality (str): Inequality in the form of "< 700", ">= 300", etc. + - tol (float): Tolerance for floating point comparisons. + + Returns: + - bool: True if the value satisfies the inequality, False otherwise. + """ + # Extract the operator and the number from the inequality string + ops = { + "<=": lambda x, y: x <= y + tol, + ">=": lambda x, y: x >= y - tol, + "==": lambda x, y: abs(x - y) <= tol, + "<": lambda x, y: x < y + tol, + ">": lambda x, y: x > y - tol, + } + + for op, func in ops.items(): + if op in inequality: + _, num = inequality.split(op) + return func(value, float(num.strip())) + + raise ValueError(f"Invalid inequality string: {inequality}") + + +@beartype +class StringEvaluator(Evaluator): + """Check whether the answer is correct with: + exact match: the answer is exactly the same as the reference answer + must include: each phrase in the reference answer must be included in the answer + fuzzy match: the answer is similar to the reference answer, using LLM judge + """ + + @staticmethod + @beartype + def clean_answer(answer: str) -> str: + if answer.startswith("'") and answer.endswith("'"): + answer = answer[1:-1] + elif answer.startswith('"') and answer.endswith('"'): + answer = answer[1:-1] + return answer.lower() + + @staticmethod + @beartype + def exact_match(ref: str, pred: Union[str, int]) -> float: + if isinstance(pred, int): + pred = str(pred) + return float( + StringEvaluator.clean_answer(pred) + == StringEvaluator.clean_answer(ref) + ) + + @staticmethod + @beartype + def must_include(ref: str, pred: str) -> float: + clean_ref = StringEvaluator.clean_answer(ref) + clean_pred = StringEvaluator.clean_answer(pred) + # tokenize the answer if the ref is a single word + # prevent false positive (e.g, 0) + if len(word_tokenize(clean_ref)) == 1: + tok_pred = word_tokenize(clean_pred) + return float(clean_ref in tok_pred) + else: + return float(clean_ref in clean_pred) + + @staticmethod + @beartype + def must_exclude(ref: str, pred: str) -> float: + """Returns 1 if pred is not in ref, and 0 otherwise""" + clean_ref = StringEvaluator.clean_answer(ref) + clean_pred = StringEvaluator.clean_answer(pred) + # tokenize the answer if the ref is a single word + # prevent false positive (e.g, 0) + if len(word_tokenize(clean_ref)) == 1: + tok_pred = word_tokenize(clean_pred) + return float(clean_ref not in tok_pred) + else: + return float(clean_ref not in clean_pred) + + @staticmethod + @beartype + def fuzzy_match(ref: str, pred: str, intent: str) -> float: + return llm_fuzzy_match(pred, ref, intent) + + @staticmethod + @beartype + def ua_match(ref: str, pred: str, intent: str) -> float: + return llm_ua_match(pred, ref, intent) + + def __call__( + self, + trajectory: Trajectory, + config_file: Path | str, + page: Page | PseudoPage | None = None + ) -> float: + with open(config_file, "r") as f: + configs = json.load(f) + + last_action = self.get_last_action(trajectory) + pred = self.clean_answer(last_action["answer"]) + + score = 1.0 + for approach, value in configs["eval"]["reference_answers"].items(): + match approach: + case "exact_match": + score *= self.exact_match(ref=value, pred=pred) + case "required_values": + required_values = value + assert isinstance(required_values, list) + pred = NumericEvaluator.str_2_int(pred) + if pred is None: + score = 0.0 + else: + for v in required_values: + value_or = v.split(" |OR| ") + score *= any( + [ + NumericEvaluator.compare_inequality( + pred, value + ) + for value in value_or + ] + ) + case "must_include": + assert isinstance(value, list) + for must_value in value: + value_or = must_value.split(" |OR| ") + score *= any([self.must_include(ref=v, pred=pred) for v in value_or]) + case "must_exclude": + assert isinstance(value, list) + for must_excl_value in value: + score *= self.must_exclude( + ref=must_excl_value, pred=pred + ) + case "one_of": + assert isinstance(value, list) + found = False + for one_of_value in value: + one_of_value = self.clean_answer(one_of_value) + if one_of_value in pred: + found = True + break + score = score * found + case "fuzzy_match": + intent = configs["intent"] + if value == "N/A": + # if the instruction only asks the model to generate N/A when encountering an unachievable task + # without more concrete reasons + score *= self.exact_match(ref=value, pred=pred) + # if the instruction also asks the model to generate the reason why the task is unachievable + # this should be the default as it will prevent false positive N/A` + if score != 1: + score = 1.0 * self.ua_match( + intent=configs["intent"], + ref=configs["eval"]["string_note"], + pred=pred, + ) + else: + assert isinstance(value, list) + reference = ', '.join(value) + score *= self.fuzzy_match( + ref=reference, pred=pred, intent=intent + ) + return score + + +@beartype +class StringSoftEvaluator(Evaluator): + """Use text generation metrics such as BLEU, ROUGE, etc. to evaluate the answer""" + + def __call__( + self, + trajectory: Trajectory, + config_file: Path | str, + page: Page | PseudoPage | None = None + ) -> float: + with open(config_file, "r") as f: + configs = json.load(f) + + last_action = self.get_last_action(trajectory) + pred = last_action["answer"] + ref = configs["eval"]["reference_answers"] + # rouge + m = evaluate.load("rouge") + rouge = m.compute(predictions=[pred], references=[ref]) + return float(rouge["rouge1"]) + + +@beartype +class URLExactEvaluator(Evaluator): + """Check whether the URL is exactly the same as of the reference URLs""" + + def __call__( + self, + trajectory: Trajectory, + config_file: Path | str, + page: Page | PseudoPage + ) -> float: + with open(config_file, "r") as f: + configs = json.load(f) + + def clean_url(url: str) -> str: + url = str(url) + # Replace http://localhost with http://127.0.0.1 to keep things consistent across evals. + url = url.replace("localhost", "127.0.0.1") + if url.endswith("/"): + url = url[:-1] + return url + + pred = clean_url(page.url) + ref_urls = configs["eval"]["reference_url"].split(" |OR| ") + ref_urls = [clean_url(url) for url in ref_urls] + matching_rule = configs["eval"].get("url_note", "EXACT") + if matching_rule == "EXACT": + if pred in ref_urls: + return 1.0 + else: + return 0.0 + elif matching_rule == "GOLD in PRED": + if any([ref in pred for ref in ref_urls]): + return 1.0 + else: + return 0.0 + else: + raise ValueError(f"Unknown matching rule: {matching_rule}") + + +@beartype +class HTMLContentExactEvaluator(Evaluator): + """Check whether the contents appear in the page""" + + @staticmethod + @beartype + def fuzzy_match(ref: str, pred: str, intent: str) -> float: + return llm_fuzzy_match(pred, ref, intent) + + def __call__( + self, + trajectory: Trajectory, + config_file: Path | str, + page: Page | PseudoPage + ) -> float: + with open(config_file, "r") as f: + configs = json.load(f) + + targets = configs["eval"]["program_html"] + + score = 1.0 + for target in targets: + target_url: str = target["url"] # which url to check + if target_url.startswith("func"): + func = target_url.split("func:")[1] + func = func.replace("__last_url__", page.url) + target_url = eval(func) + + locator: str = target["locator"] # js element locator + + # navigate to that url + if target_url != "last": + page.goto(target_url) + time.sleep(3) # TODO [shuyanzh]: fix this hard-coded sleep + + # empty, use the full page + if not locator.strip(): + selected_element = page.content() + # use JS to select the element + elif locator.startswith("document.") or locator.startswith( + "[...document." + ): + if "prep_actions" in target: + try: + for prep_action in target["prep_actions"]: + page.evaluate(f"() => {prep_action}") + except Exception: + pass + try: + selected_element = str(page.evaluate(f"() => {locator}")) + if not selected_element: + selected_element = "" + except Exception: + # the page is wrong, return empty + selected_element = "" + elif locator.startswith("lambda:"): + try: + locator = locator.lstrip("lambda:") + selected_element = page.evaluate(locator) + if not selected_element: + selected_element = None + except Exception: + # the page is wrong, return empty + selected_element = None + # run program to call API + elif locator.startswith("func:"): # a helper function + func = locator.split("func:")[1] + func = func.replace("__page__", "page") + selected_element = eval(func) + else: + raise ValueError(f"Unknown locator: {locator}") + + # If the selected element is None, then the page is wrong + if selected_element is None: + score = 0.0 + break + + if "exact_match" in target["required_contents"]: + required_contents = target["required_contents"]["exact_match"] + score *= StringEvaluator.exact_match( + ref=required_contents, pred=selected_element + ) + elif "must_include" in target["required_contents"]: + required_contents = target["required_contents"]["must_include"] + assert isinstance(required_contents, list) + for content in required_contents: + content_or = content.split(" |OR| ") + score *= any( + [ + StringEvaluator.must_include( + ref=content, pred=selected_element + ) + for content in content_or + ] + ) + elif "must_exclude" in target["required_contents"]: + required_contents = target["required_contents"]["must_exclude"] + assert isinstance(required_contents, list) + for content in required_contents: + assert " |OR| " not in content + score *= StringEvaluator.must_exclude( + content, pred=selected_element + ) + elif "required_values" in target["required_contents"]: + required_values = target["required_contents"][ + "required_values" + ] + assert isinstance(required_values, list) + if isinstance(selected_element, str): + selected_element = NumericEvaluator.str_2_int( + selected_element + ) + if selected_element is None: + score = 0.0 + else: + for value in required_values: + value_or = value.split(" |OR| ") + score *= any( + [ + NumericEvaluator.compare_inequality( + selected_element, value + ) + for value in value_or + ] + ) + elif "fuzzy_match" in target["required_contents"]: + required_contents = target["required_contents"]["fuzzy_match"] + intent = configs["intent"] + + assert isinstance(required_contents, list) + reference = ', '.join(required_contents) + score *= self.fuzzy_match( + ref=reference, pred=selected_element, intent=intent + ) + else: + raise ValueError( + f"Unknown required_contents: {target['required_contents'].keys()}" + ) + + return score + + +@beartype +class PageImageEvaluator(Evaluator): + """Check whether the answer is correct by querying a vision model.""" + + def __init__(self, captioning_fn): + self.captioning_fn = captioning_fn + # Default to 0.8 as the threshold for similarity to account for compression, resizing, etc + # This might be too generous but we bias towards minimizing false negatives. + self.ssim_threshold = 0.8 + + def __call__( + self, + trajectory: Trajectory, + config_file: Path | str, + page: Page | PseudoPage | None = None + ) -> float: + with open(config_file, "r") as f: + configs = json.load(f) + + for query in configs["eval"]["page_image_query"]: + locator: str = query["eval_image_class"] + target_url: str = query["eval_image_url"] + if target_url.startswith("func"): + func = target_url.split("func:")[1] + func = func.replace("__last_url__", page.url) + target_url = eval(func) + + # navigate to that url + if target_url != "last": + page.goto(target_url) + time.sleep(3) # TODO(jykoh): fix this hard-coded sleep + + # empty, use the full page + if not locator.strip(): + images = page.get_by_role("img").all() + # use JS to select the element + elif locator.startswith("."): + # Get all img children under the locator + elements = page.query_selector_all(locator) + images = [] + for element in elements: + is_img = element.evaluate( + 'element => element.tagName === "IMG"' + ) + if is_img: + images.append(element) + else: + images.extend(element.query_selector_all("img")) + else: + raise ValueError(f"Unknown locator: {locator}") + + if images == []: + return 0.0 + + all_image_pixels = [] + for image in images: + try: + # Get image from URL. + image_url = image.get_attribute("src") + if not image_url.startswith( + ("http://", "https://", "www.") + ): + image_url = urljoin(page.url, image_url) + image = Image.open( + requests.get(image_url, stream=True).raw + ) + all_image_pixels.append(image) + except Exception as e: + print("[WARNING]: ", e) + + score = 1.0 + if all_image_pixels == []: + return 0.0 + else: + # Run the VQA eval on the image elements. + eval_vqas = query.get("eval_vqa", []) + assert ( + len(eval_vqas) > 0 or "eval_fuzzy_image_match" in query + ), "eval_vqa must have at least 2 questions or eval_fuzzy_image_match must be True" + for qa in eval_vqas: + question, answer = qa["question"], qa["answer"] + prompt = f"Q: {question} A:" + pred_ans = self.captioning_fn( + all_image_pixels, [prompt] * len(all_image_pixels) + ) + score *= float( + any( + [answer.lower() in ans.lower() for ans in pred_ans] + ) + ) + + if "eval_fuzzy_image_match" in query: + ssim_threshold = query.get( + "ssim_threshold", self.ssim_threshold + ) + exact_match_imgs = query["eval_fuzzy_image_match"].split( + " |OR| " + ) + all_exact_match_pixels = [] + + for exact_match_img in exact_match_imgs: + if exact_match_img.startswith("http"): + exact_match_pixels = Image.open( + requests.get(exact_match_img, stream=True).raw + ) + else: + exact_match_pixels = Image.open(exact_match_img) + all_exact_match_pixels.append(exact_match_pixels) + + # Check if any of the images on the page match + found_exact_match = False + for exact_match_pixels in all_exact_match_pixels: + for image_pixels in all_image_pixels: + ssim = image_utils.get_image_ssim( + image_pixels, exact_match_pixels + ) + if ssim > ssim_threshold: + found_exact_match = True + break + score *= float(found_exact_match) + + return score + + +class EvaluatorComb: + def __init__(self, evaluators: list[Evaluator]) -> None: + self.evaluators = evaluators + + def __call__( + self, + trajectory: Trajectory, + config_file: Path | str, + page: Page | PseudoPage + ) -> float: + + score = 1.0 + for evaluator in self.evaluators: + cur_score = evaluator(trajectory, config_file, page) + score *= cur_score + + return score + + +@beartype +def evaluator_router( + config_file: Path | str, captioning_fn=None +) -> EvaluatorComb: + """Router to get the evaluator class""" + with open(config_file, "r") as f: + configs = json.load(f) + + eval_types = configs["eval"]["eval_types"] + evaluators: list[Evaluator | EvaluatorPartial] = [] + for eval_type in eval_types: + match eval_type: + case "string_match": + evaluators.append(StringEvaluator()) + case "url_match": + evaluators.append(URLExactEvaluator()) + case "program_html": + evaluators.append(HTMLContentExactEvaluator()) + case "page_image_query": + evaluators.append(PageImageEvaluator(captioning_fn)) + case _: + raise ValueError(f"eval_type {eval_type} is not supported") + + return EvaluatorComb(evaluators) diff --git a/VAB-WebArena-Lite/new/generate_test_data.py b/VAB-WebArena-Lite/new/generate_test_data.py new file mode 100644 index 0000000..0938d29 --- /dev/null +++ b/VAB-WebArena-Lite/new/generate_test_data.py @@ -0,0 +1,63 @@ +"""Replace the website placeholders with website domains from env_config +Generate the test data""" +import json +import os + +from browser_env.env_config import * + + +def main() -> None: + DATASET = os.environ["DATASET"] + if DATASET == "webarena": + print("DATASET: webarena") + print(f"REDDIT: {REDDIT}") + print(f"SHOPPING: {SHOPPING}") + print(f"SHOPPING_ADMIN: {SHOPPING_ADMIN}") + print(f"GITLAB: {GITLAB}") + print(f"WIKIPEDIA: {WIKIPEDIA}") + print(f"MAP: {MAP}") + + inp_paths = ["config_files/wa/test_webarena.raw.json", "config_files/wa/test_webarena_lite.raw.json"] + replace_map = { + "__REDDIT__": REDDIT, + "__SHOPPING__": SHOPPING, + "__SHOPPING_ADMIN__": SHOPPING_ADMIN, + "__GITLAB__": GITLAB, + "__WIKIPEDIA__": WIKIPEDIA, + "__MAP__": MAP, + } + elif DATASET == "visualwebarena": + print("DATASET: visualwebarena") + print(f"CLASSIFIEDS: {CLASSIFIEDS}") + print(f"REDDIT: {REDDIT}") + print(f"SHOPPING: {SHOPPING}") + inp_paths = [ + "config_files/vwa/test_classifieds.raw.json", "config_files/vwa/test_shopping.raw.json", "config_files/vwa/test_reddit.raw.json", + ] + replace_map = { + "__REDDIT__": REDDIT, + "__SHOPPING__": SHOPPING, + "__WIKIPEDIA__": WIKIPEDIA, + "__CLASSIFIEDS__": CLASSIFIEDS, + } + else: + raise ValueError(f"Dataset not implemented: {DATASET}") + + for inp_path in inp_paths: + output_dir = inp_path.replace('.raw.json', '') + os.makedirs(output_dir, exist_ok=True) + with open(inp_path, "r") as f: + raw = f.read() + for k, v in replace_map.items(): + raw = raw.replace(k, v) + + with open(inp_path.replace(".raw", ""), "w") as f: + f.write(raw) + data = json.loads(raw) + for idx, item in enumerate(data): + with open(os.path.join(output_dir, f"{idx}.json"), "w") as f: + json.dump(item, f, indent=2) + + +if __name__ == "__main__": + main() diff --git a/VAB-WebArena-Lite/new/helper_functions.py b/VAB-WebArena-Lite/new/helper_functions.py new file mode 100644 index 0000000..f98b24d --- /dev/null +++ b/VAB-WebArena-Lite/new/helper_functions.py @@ -0,0 +1,647 @@ +"""Implements helper functions to assist evaluation cases where other evaluators are not suitable.""" +import json +from datetime import datetime, timezone +from typing import Any, Union +from urllib.parse import urlparse + +import requests +from beartype import beartype +from beartype.typing import Dict, List +from playwright.sync_api import CDPSession, Page + +from browser_env.env_config import ( + ACCOUNTS, + REDDIT, + SHOPPING, + WIKIPEDIA, +) +from llms.providers.openai_utils import ( + generate_from_openai_chat_completion, +) + +import logging +logger = logging.getLogger("logger") + +class PseudoPage: + def __init__(self, original_page: Page, url: str): + self.url = url + self.original_page = original_page + + def __getattr__(self, attr: str) -> Any: + # Delegate attribute access to the original page object + if attr not in ["url"]: + return getattr(self.original_page, attr) + else: + return getattr(self, attr) + + +@beartype +def shopping_get_auth_token() -> str: + response = requests.post( + url=f"{SHOPPING}/rest/default/V1/integration/admin/token", + headers={"content-type": "application/json"}, + data=json.dumps( + { + "username": ACCOUNTS["shopping_site_admin"]["username"], + "password": ACCOUNTS["shopping_site_admin"]["password"], + } + ), + ) + token: str = response.json() + return token + + +@beartype +def shopping_get_latest_order_url() -> str: + """Get the latest order url from the shopping website.""" + + header = { + "Authorization": f"Bearer {shopping_get_auth_token()}", + "Content-Type": "application/json", + } + + params = { + "searchCriteria[sortOrders][0][field]": "created_at", + "searchCriteria[sortOrders][0][direction]": "DESC", + "searchCriteria[pageSize]": "1", + } + + response = requests.get( + f"{SHOPPING}/rest/V1/orders", params=params, headers=header + ) + assert response.status_code == 200 + response_obj = response.json()["items"][0] + order_id = int(response_obj["increment_id"]) + order_url = f"{SHOPPING}/sales/order/view/order_id/{order_id}/" + return order_url + + +@beartype +def shopping_get_sku_latest_review_author(sku: str) -> str: + """Get the latest review for shopping admin.""" + header = { + "Authorization": f"Bearer {shopping_get_auth_token()}", + "Content-Type": "application/json", + } + response = requests.get( + f"{SHOPPING}/rest/V1/products/{sku}/reviews", headers=header + ) + assert response.status_code == 200 + response_obj = response.json() + if len(response_obj) == 0: + return "" + author: str = response_obj[-1]["nickname"] + return author + + +@beartype +def shopping_get_sku_latest_review_rating(sku: str) -> str: + """Get the latest review for shopping admin.""" + header = { + "Authorization": f"Bearer {shopping_get_auth_token()}", + "Content-Type": "application/json", + } + response = requests.get( + f"{SHOPPING}/rest/V1/products/{sku}/reviews", headers=header + ) + assert response.status_code == 200 + response_obj = response.json() + if len(response_obj) == 0: + return "" + assert response_obj[0]["ratings"][0]["rating_name"] == "Rating" + rating: str = str(response_obj[-1]["ratings"][0]["percent"]) + return rating + + +@beartype +def shopping_get_sku_latest_review_text(sku: str) -> str: + """Get the latest review text for shopping admin.""" + header = { + "Authorization": f"Bearer {shopping_get_auth_token()}", + "Content-Type": "application/json", + } + response = requests.get( + f"{SHOPPING}/rest/V1/products/{sku}/reviews", headers=header + ) + assert response.status_code == 200 + response_obj = response.json() + if len(response_obj) == 0: + return "" + text: str = response_obj[-1]["detail"] + return text + + +@beartype +def shopping_get_sku_latest_review_title(sku: str) -> str: + """Get the latest review title for shopping admin.""" + header = { + "Authorization": f"Bearer {shopping_get_auth_token()}", + "Content-Type": "application/json", + } + response = requests.get( + f"{SHOPPING}/rest/V1/products/{sku}/reviews", headers=header + ) + assert response.status_code == 200 + response_obj = response.json() + if len(response_obj) == 0: + return "" + title: str = response_obj[-1]["title"] + return title + + +@beartype +def shopping_get_sku_product_page_url(sku: str) -> str: + """Get product page url from sku""" + header = { + "Authorization": f"Bearer {shopping_get_auth_token()}", + "Content-Type": "application/json", + } + response = requests.get( + f"{SHOPPING}/rest/V1/products/{sku}", headers=header + ) + assert response.status_code == 200 + response_obj = response.json() + if len(response_obj) == 0: + return "" + for custom_attributes in response_obj["custom_attributes"]: + if custom_attributes["attribute_code"] == "url_key": + return f"{SHOPPING}/{custom_attributes['value']}.html" + return "" + + +@beartype +def shopping_get_all_product_order( + page: Page | PseudoPage, +) -> List[Dict[str, str]]: + """ + Get info of all product in a given order page. + + Example output: + [ + { + "name": "Kellogg's Special K Protein Bars, Meal Replacement, Protein Snacks, Value Size, Strawberry, 19oz Box (12 Bars)\nSize\n12 Count (Pack of 1)", + "options": { + "Size": "12 Count (Pack of 1)" + }, + "sku": "B00MXUFL0E", + "price": "$24.50", + "qty": "Ordered2", + "subtotal": "$49.00" + }, + { + "name": "Kellogg's Special K Protein Bars, Meal Replacement, Protein Snacks, Value Size, Chocolatey Chip Cookie Dough, 19oz Box (12 Bars)", + "sku": "B07ZD2PB9F", + "price": "$42.30", + "qty": "Ordered2", + "subtotal": "$84.60" + } + ] + """ + try: + result = page.evaluate( + f""" +(() => {{ + try {{ + const products = [...document.querySelector("#my-orders-table").getElementsByTagName('tbody')].map( + (x) => {{ + return [...x.getElementsByTagName('td')].reduce(function(obj, y) {{ + const key = y.className.split(' ')[1]; + obj[key] = y.outerText; + // check if options exist + if (key === 'name' && y.querySelector('dl')) {{ + var option_dict = {{}} + const options = [...y.querySelector('dl').children]; + for (let i = 0; i < options.length; i += 2) {{ + option_dict[options[i].outerText] = options[i+1].outerText; + }} + obj['options'] = option_dict; + }} + return obj; + }}, {{}}) + }} + ); + return products; + }} catch (e) {{ + // If any errors are caught, return an empty string + return e; + return []; + }} +}})(); + """ + ) + return result + except Exception as e: + result = [] + + return result + + +@beartype +def shopping_get_order_product_name_list(page: Page | PseudoPage) -> str: + try: + products = shopping_get_all_product_order(page) + + return " |OR| ".join([p["name"] for p in products]) + except Exception: + return "" + + +@beartype +def shopping_get_order_product_quantity( + page: Page | PseudoPage, sku: str +) -> int: + try: + if "|OR|" in sku: + skus = sku.split(" |OR| ") + else: + skus = [sku] + + products = shopping_get_all_product_order(page) + for product in products: + if product["sku"].strip() in skus: + # Ordered{qty} + return int(product["qty"][7:]) + return 0 + except Exception: + return 0 + + +@beartype +def shopping_get_order_product_option( + page: Page | PseudoPage, sku: str, option_name: str +) -> str: + try: + products = shopping_get_all_product_order(page) + for product in products: + if product["sku"].strip() == sku: + # Ordered{qty} + return product["options"][option_name] + return "" + except Exception as e: + return "" + + +@beartype +def shopping_get_product_attributes( + page: Page | PseudoPage, attribute: str +) -> str: + # Get the values of all cells in the table for the given attribute + try: + result = page.evaluate( + f""" + (() => {{ + try {{ + // Create an array of search terms, splitting the string by ' |OR| ' + const searchTerms = '{attribute}'.toLowerCase().split(' |or| '); + // Convert the children of the tbody inside the element with the given ID into an array + return Array.from( + document.querySelector('#productDetails_detailBullets_sections1 > tbody').children + ) + // Filter the array to only include elements where the first child's text includes any of the search terms + .filter(x => + searchTerms.some(term => x.children[0].outerText.toLowerCase().includes(term)) + ) + // Map over the filtered elements to get the outerText of their second child + .map(x => x.children[1].outerText) + // Join all the resulting strings with a comma and a space + .join(', ') + }} catch (e) {{ + // If any errors are caught, return an empty string + return '' + }} + }})(); + """ + ) + except Exception: + result = "" + + return result + + +@beartype +def shopping_get_product_price(page: Page | PseudoPage) -> Union[float, int]: + """Get the price of the product on the shopping website.""" + try: + result = page.evaluate( + """ + (() => {{ + res = parseFloat(document.querySelector(\"#maincontent > div.columns > div > div.product-info-main > div.product-info-price > div.price-box.price-final_price > span > span\") + .outerText.substr(1)); + return res ? res : 0; + }})(); + """ + ) + except Exception: + result = 0 + + return result + + +@beartype +def shopping_get_num_reviews(page: Page | PseudoPage) -> int: + """Get the price of the product on the shopping website.""" + try: + result = page.evaluate( + """ + (() => {{ + res = parseInt(document.querySelector(\"#tab-label-reviews-title\") + .outerText.split(' ')[1]); + return res ? res : 0; }} + )(); + """ + ) + except Exception: + result = 0 + + return result + + +@beartype +def shopping_get_rating_as_percentage(page: Page | PseudoPage) -> int: + """Get the rating of the product on the shopping website as a percentage out of 100.""" + try: + rating = page.evaluate( + """ + (() => {{ + ratingPercentage = parseFloat(document.querySelector('.rating-result').title.replace('%', '')); + return ratingPercentage ? ratingPercentage : 0; + }})(); + """ + ) + except Exception: + rating = 0 + + return rating + + +@beartype +def get_query_text(page: Page | PseudoPage, selector: str) -> str: + """Get the text content of the element matching the given selector. + + Note that this function DOES NOT perform downcasing. + """ + try: + result = page.evaluate( + f""" + (() => {{ + try {{ + return document.querySelector('{selector}').textContent; + }} catch (e) {{ + return ''; + }} + }})(); + """ + ) + except Exception: + result = "" + + return result + + +@beartype +def get_query_text_lowercase(page: Page | PseudoPage, selector: str) -> str: + """Get the lowercase text content of the element matching the given selector.""" + return get_query_text(page, selector).lower() + + +@beartype +def reddit_get_post_url(url: str) -> str: + """Get the post url""" + # Url is http://domain/f/subreddit/post_id/... + # get domain, subreddit, post_id + domain = urlparse(url).netloc + tok_url = urlparse(url).path.split("/") + # not a valid post/comment url, return the url as is + if len(tok_url) < 4: + return url + if tok_url[1] != "f": + return url + subreddit = urlparse(url).path.split("/")[2] + post_id = urlparse(url).path.split("/")[3] + scheme = urlparse(url).scheme + post_url = f"{scheme}://{domain}/f/{subreddit}/{post_id}/" + return post_url + + +@beartype +def reddit_get_post_comment_tree(page: Page | PseudoPage) -> Dict[str, Any]: + try: + comment_tree = page.evaluate( + f"""(function buildCommentTree(node, data_level) {{ + let tree = {{ + "username": node.querySelector(".fg-inherit").outerText, + "net_score": parseInt(node.querySelector(".vote__net-score").outerText), + "content": node.querySelector(".comment__content").outerText, + "time": new Date(node.querySelector('.comment__main > header > h1 > span > time').dateTime), + "children": [] + }}; + node.querySelectorAll(".comment").forEach((child) => {{ + if (parseInt(child.getAttribute('data-level')) === data_level+1) {{ + tree['children'].push(buildCommentTree(child, data_level+1)); + }} + }}) + + return tree; +}})(document.querySelector("#main"), 0)""" + ) + except Exception: + comment_tree = {} + + return comment_tree + + +@beartype +def reddit_get_latest_comment_obj_by_username( + page: Page | PseudoPage, username: str +) -> Dict[str, Any]: + try: + comment_tree = reddit_get_post_comment_tree(page) + latest_time = datetime.min.replace(tzinfo=timezone.utc) + comment = {} + + def dfs(node): + nonlocal latest_time + nonlocal comment + if node["username"] == username: + if node["time"] > latest_time: + comment = { + "username": node["username"], + "net_score": node["net_score"], + "content": node["content"], + "time": node["time"], + } + latest_time = node["time"] + + for child in node["children"]: + dfs(child) + + dfs(comment_tree) + + except Exception as e: + comment = {} + return comment + + +@beartype +def reddit_get_latest_comment_content_by_username( + page: Page | PseudoPage, username: str +) -> str: + try: + comment = reddit_get_latest_comment_obj_by_username(page, username) + content = comment["content"] + + except Exception: + content = "" + + return content + + +@beartype +def reddit_get_parent_comment_obj_of_latest_comment_by_username( + page: Page | PseudoPage, username: str +) -> Dict[str, Any]: + try: + comment_tree = reddit_get_post_comment_tree(page) + latest_time = datetime.min.replace(tzinfo=timezone.utc) + comment = {} + + def dfs(node): + nonlocal latest_time + nonlocal comment + for child in node["children"]: + if child["username"] == username: + if child["time"] > latest_time: + comment = { + "username": node["username"], + "net_score": node["net_score"], + "content": node["content"], + "time": node["time"], + } + latest_time = child["time"] + else: + dfs(child) + + dfs(comment_tree) + + except Exception: + comment = {} + return comment + + +@beartype +def reddit_get_parent_comment_username_of_latest_comment_by_username( + page: Page | PseudoPage, username: str +) -> str: + try: + comment = reddit_get_parent_comment_obj_of_latest_comment_by_username( + page, username + ) + username = comment["username"] + + except Exception: + username = "" + + return username + + +@beartype +def gitlab_get_project_memeber_role( + page: Page | PseudoPage, account_name: str +) -> str: + # get the account index + try: + account_idx = page.evaluate( + f"""(() => {{ + const elements = document.querySelectorAll("td[data-label='Account'] span.gl-avatar-labeled-sublabel"); + let index = -1; // Default value if not found + + for(let i = 0; i < elements.length; i++) {{ + if(elements[i].outerText === '@{account_name}') {{ + index = i; + break; + }} + }} + + return index; + }})()""" + ) + + # get the role + role: str = page.evaluate( + f"""(() => {{ + return document.querySelectorAll("td.col-max-role span")[{account_idx}].outerText; + }})()""" + ) + except Exception: + role = "" + + return role + + +@beartype +def llm_fuzzy_match(pred: str, reference: str, question: str) -> float: + """Check whether the prediction matches the reference with GPT-4-turbo""" + messages: list[dict[str, Any]] = [] + # construct the question to ask + message = "Help a teacher to grade the answer of a student given a question. Keep in mind that the student may use different phrasing or wording to answer the question. The goal is to evaluate whether the answer is semantically equivalent to the reference answer.\n" + message += f"question: {question}\n" + message += f"reference answer: {reference}\n" + message += "all the string 'N/A' that you see is a special sequence that means 'not achievable'\n" + message += f"student answer: {pred}\n" + message += "Conclude the judgement by 'correct', 'incorrect', or 'partially correct'. Only output one of these options, and nothing else." + messages = [ + {"role": "system", "content": "You are a helpful assistant"}, + {"role": "user", "content": message}, + ] + + logger.info(f'[R] {reference}') + logger.info(f'[P] {pred}') + + response = generate_from_openai_chat_completion( + model="gpt-4-1106-preview", + messages=messages, + temperature=0, + max_tokens=768, + top_p=1.0, + context_length=0, + ).lower() + if "partially correct" in response or "incorrect" in response: + return 0.0 + else: + assert "correct" in response, response + return 1.0 + + +def llm_ua_match(pred: str, reference: str, question: str) -> float: + """Check whether the prediction matches the reference with GPT-4-turbo""" + messages: list[dict[str, Any]] = [] + # construct the question to ask + message = "" + message += f"task: {question}\n" + message += f"actual unachievable reason: {reference}\n" + message += f"reported unachievable reason: {pred}\n" + message += ( + "The task described above is inherently unachievable due to the reason specified under 'actual unachievable reason'. " + "An individual previously attempted this task and was unable to complete it. They provided a reason for their failure, " + "which is listed under 'reported unachievable reason'. Your role is to review both the actual and reported reasons. " + "Determine if the reported reason aligns with the actual reason, even if implicitly. " + "If the stated reason is in line with the actual reason, respond with 'same'. Otherwise, respond with 'different'." + ) + messages = [ + {"role": "system", "content": "You are a helpful assistant"}, + {"role": "user", "content": message}, + ] + + response = generate_from_openai_chat_completion( + model="gpt-4-1106-preview", + messages=messages, + temperature=0, + max_tokens=768, + top_p=1.0, + context_length=0, + ).lower() + if "different" in response: + return 0.0 + else: + assert "same" in response + return 1.0 diff --git a/VAB-WebArena-Lite/new/llms_init.py b/VAB-WebArena-Lite/new/llms_init.py new file mode 100644 index 0000000..09dba08 --- /dev/null +++ b/VAB-WebArena-Lite/new/llms_init.py @@ -0,0 +1,23 @@ +"""This module is adapt from https://github.com/zeno-ml/zeno-build""" +try: + from .providers.gemini_utils import generate_from_gemini_completion +except: + print('Google Cloud not set up, skipping import of providers.gemini_utils.generate_from_gemini_completion') + +from .providers.hf_utils import generate_from_huggingface_completion +from .providers.openai_utils import ( + generate_from_openai_chat_completion, + generate_from_openai_completion, +) +from .providers.api_utils import ( + generate_with_api, +) +from .utils import call_llm + +__all__ = [ + "generate_from_openai_completion", + "generate_from_openai_chat_completion", + "generate_from_huggingface_completion", + "generate_from_gemini_completion", + "call_llm", +] diff --git a/VAB-WebArena-Lite/new/lm_config.py b/VAB-WebArena-Lite/new/lm_config.py new file mode 100644 index 0000000..7d73a64 --- /dev/null +++ b/VAB-WebArena-Lite/new/lm_config.py @@ -0,0 +1,57 @@ +"""Config for language models.""" + +from __future__ import annotations + +import argparse +import dataclasses +from dataclasses import dataclass +from typing import Any + + +@dataclass(frozen=True) +class LMConfig: + """A config for a language model. + + Attributes: + provider: The name of the API provider. + model: The name of the model. + model_cls: The Python class corresponding to the model, mostly for + Hugging Face transformers. + tokenizer_cls: The Python class corresponding to the tokenizer, mostly + for Hugging Face transformers. + mode: The mode of the API calls, e.g., "chat" or "generation". + """ + + provider: str + model: str + model_cls: type | None = None + tokenizer_cls: type | None = None + mode: str | None = None + gen_config: dict[str, Any] = dataclasses.field(default_factory=dict) + + +def construct_llm_config(args: argparse.Namespace) -> LMConfig: + llm_config = LMConfig( + provider=args.provider, model=args.model, mode=args.mode + ) + if args.provider in ["openai", "google", "api", "finetune"]: + llm_config.gen_config["temperature"] = args.temperature + llm_config.gen_config["top_p"] = args.top_p + llm_config.gen_config["context_length"] = args.context_length + llm_config.gen_config["max_tokens"] = args.max_tokens + llm_config.gen_config["stop_token"] = args.stop_token + llm_config.gen_config["max_obs_length"] = args.max_obs_length + llm_config.gen_config["max_retry"] = args.max_retry + elif args.provider == "huggingface": + llm_config.gen_config["temperature"] = args.temperature + llm_config.gen_config["top_p"] = args.top_p + llm_config.gen_config["max_new_tokens"] = args.max_tokens + llm_config.gen_config["stop_sequences"] = ( + [args.stop_token] if args.stop_token else None + ) + llm_config.gen_config["max_obs_length"] = args.max_obs_length + llm_config.gen_config["model_endpoint"] = args.model_endpoint + llm_config.gen_config["max_retry"] = args.max_retry + else: + raise NotImplementedError(f"provider {args.provider} not implemented") + return llm_config diff --git a/VAB-WebArena-Lite/new/openai_utils.py b/VAB-WebArena-Lite/new/openai_utils.py new file mode 100644 index 0000000..8374c7a --- /dev/null +++ b/VAB-WebArena-Lite/new/openai_utils.py @@ -0,0 +1,286 @@ +"""Tools to generate from OpenAI prompts. +Adopted from https://github.com/zeno-ml/zeno-build/""" + +import asyncio +import logging +import os +import random +import time +from typing import Any + +import aiolimiter +import openai +from openai import AsyncOpenAI, OpenAI + +base_url = os.environ.get("OPENAI_API_URL") +client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url=base_url) +aclient = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url=base_url) +from tqdm.asyncio import tqdm_asyncio + + +def retry_with_exponential_backoff( # type: ignore + func, + initial_delay: float = 1, + exponential_base: float = 2, + jitter: bool = True, + max_retries: int = 3, + errors: tuple[Any] = ( + openai.RateLimitError, + openai.BadRequestError, + openai.InternalServerError, + ), +): + """Retry a function with exponential backoff.""" + + def wrapper(*args, **kwargs): # type: ignore + # Initialize variables + num_retries = 0 + delay = initial_delay + + # Loop until a successful response or max_retries is hit or an exception is raised + while True: + try: + + return func(*args, **kwargs) + + # Retry on specified errors + except errors as e: + # Increment retries + num_retries += 1 + + # Check if max retries has been reached + if num_retries > max_retries: + raise Exception( + f"Maximum number of retries ({max_retries}) exceeded." + ) + + # Increment the delay + delay *= exponential_base * (1 + jitter * random.random()) + + # Sleep for the delay + time.sleep(delay) + + # Raise exceptions for any errors not specified + except Exception as e: + raise e + + return wrapper + + +async def _throttled_openai_completion_acreate( + engine: str, + prompt: str, + temperature: float, + max_tokens: int, + top_p: float, + limiter: aiolimiter.AsyncLimiter, +) -> dict[str, Any]: + async with limiter: + for _ in range(3): + try: + return await aclient.completions.create( + engine=engine, + prompt=prompt, + temperature=temperature, + max_tokens=max_tokens, + top_p=top_p, + ) + except openai.RateLimitError: + logging.warning( + "OpenAI API rate limit exceeded. Sleeping for 10 seconds." + ) + await asyncio.sleep(10) + except openai.APIError as e: + logging.warning(f"OpenAI API error: {e}") + break + return {"choices": [{"message": {"content": ""}}]} + + +async def agenerate_from_openai_completion( + prompts: list[str], + engine: str, + temperature: float, + max_tokens: int, + top_p: float, + context_length: int, + requests_per_minute: int = 300, +) -> list[str]: + """Generate from OpenAI Completion API. + + Args: + prompts: list of prompts + temperature: Temperature to use. + max_tokens: Maximum number of tokens to generate. + top_p: Top p to use. + context_length: Length of context to use. + requests_per_minute: Number of requests per minute to allow. + + Returns: + List of generated responses. + """ + if "OPENAI_API_KEY" not in os.environ: + raise ValueError( + "OPENAI_API_KEY environment variable must be set when using OpenAI API." + ) + + limiter = aiolimiter.AsyncLimiter(requests_per_minute) + async_responses = [ + _throttled_openai_completion_acreate( + engine=engine, + prompt=prompt, + temperature=temperature, + max_tokens=max_tokens, + top_p=top_p, + limiter=limiter, + ) + for prompt in prompts + ] + responses = await tqdm_asyncio.gather(*async_responses) + return [x["choices"][0]["text"] for x in responses] + + +@retry_with_exponential_backoff +def generate_from_openai_completion( + prompt: str, + engine: str, + temperature: float, + max_tokens: int, + top_p: float, + context_length: int, + stop_token: str | None = None, +) -> str: + if "OPENAI_API_KEY" not in os.environ: + raise ValueError( + "OPENAI_API_KEY environment variable must be set when using OpenAI API." + ) + + response = client.completions.create( + prompt=prompt, + engine=engine, + temperature=temperature, + max_tokens=max_tokens, + top_p=top_p, + stop=[stop_token], + ) + answer: str = response["choices"][0]["text"] + return answer + + +async def _throttled_openai_chat_completion_acreate( + model: str, + messages: list[dict[str, str]], + temperature: float, + max_tokens: int, + top_p: float, + limiter: aiolimiter.AsyncLimiter, +) -> dict[str, Any]: + async with limiter: + for _ in range(3): + try: + return await aclient.chat.completions.create( + model=model, + messages=messages, + temperature=temperature, + max_tokens=max_tokens, + top_p=top_p, + ) + except openai.RateLimitError: + logging.warning( + "OpenAI API rate limit exceeded. Sleeping for 10 seconds." + ) + await asyncio.sleep(10) + except asyncio.exceptions.TimeoutError: + logging.warning("OpenAI API timeout. Sleeping for 10 seconds.") + await asyncio.sleep(10) + except openai.APIError as e: + logging.warning(f"OpenAI API error: {e}") + break + return {"choices": [{"message": {"content": ""}}]} + + +async def agenerate_from_openai_chat_completion( + messages_list: list[list[dict[str, str]]], + engine: str, + temperature: float, + max_tokens: int, + top_p: float, + context_length: int, + requests_per_minute: int = 300, +) -> list[str]: + """Generate from OpenAI Chat Completion API. + + Args: + messages_list: list of message list + temperature: Temperature to use. + max_tokens: Maximum number of tokens to generate. + top_p: Top p to use. + context_length: Length of context to use. + requests_per_minute: Number of requests per minute to allow. + + Returns: + List of generated responses. + """ + if "OPENAI_API_KEY" not in os.environ: + raise ValueError( + "OPENAI_API_KEY environment variable must be set when using OpenAI API." + ) + + limiter = aiolimiter.AsyncLimiter(requests_per_minute) + async_responses = [ + _throttled_openai_chat_completion_acreate( + model=engine, + messages=message, + temperature=temperature, + max_tokens=max_tokens, + top_p=top_p, + limiter=limiter, + ) + for message in messages_list + ] + responses = await tqdm_asyncio.gather(*async_responses) + return [x["choices"][0]["message"]["content"] for x in responses] + + +@retry_with_exponential_backoff +def generate_from_openai_chat_completion( + messages: list[dict[str, str]], + model: str, + temperature: float, + max_tokens: int, + top_p: float, + context_length: int, + stop_token: str | None = None, +) -> str: + if "OPENAI_API_KEY" not in os.environ: + raise ValueError( + "OPENAI_API_KEY environment variable must be set when using OpenAI API." + ) + response = client.chat.completions.create( + model=model, + messages=messages, + temperature=temperature, + max_tokens=max_tokens, + top_p=top_p, + ) + answer: str = response.choices[0].message.content + return answer + + +@retry_with_exponential_backoff +# debug only +def fake_generate_from_openai_chat_completion( + messages: list[dict[str, str]], + model: str, + temperature: float, + max_tokens: int, + top_p: float, + context_length: int, + stop_token: str | None = None, +) -> str: + if "OPENAI_API_KEY" not in os.environ: + raise ValueError( + "OPENAI_API_KEY environment variable must be set when using OpenAI API." + ) + + answer = "Let's think step-by-step. This page shows a list of links and buttons. There is a search box with the label 'Search query'. I will click on the search box to type the query. So the action I will perform is \"click [60]\"." + return answer diff --git a/VAB-WebArena-Lite/new/prompt_constructor.py b/VAB-WebArena-Lite/new/prompt_constructor.py new file mode 100644 index 0000000..9cd5a2a --- /dev/null +++ b/VAB-WebArena-Lite/new/prompt_constructor.py @@ -0,0 +1,501 @@ +import json +import re +from pathlib import Path +from typing import Any, TypedDict +from PIL import Image + +from browser_env import Action, ActionParsingError, Trajectory +from browser_env.env_config import URL_MAPPINGS +from browser_env.utils import StateInfo, pil_to_b64, pil_to_vertex +from llms import lm_config +from llms.tokenizers import Tokenizer +from llms.utils import APIInput + + +class Instruction(TypedDict): + """Instruction for constructing prompt""" + + intro: str + examples: list[tuple[str, str]] + template: str + meta_data: dict[str, Any] + + +class PromptConstructor(object): + def __init__( + self, + instruction_path: str | Path, + lm_config: lm_config.LMConfig, + tokenizer: Tokenizer, + ): + self.instruction_path = Path(instruction_path) + self.obs_modality = "text" + self.lm_config = lm_config + instruction = json.load(open(self.instruction_path)) + instruction["examples"] = [tuple(e) for e in instruction["examples"]] + self.instruction: Instruction = instruction + self.tokenizer = tokenizer + + def get_lm_api_input( + self, intro: str, examples: list[tuple[str, str]], current: str + ) -> APIInput: + + """Return the require format for an API""" + message: list[dict[str, str]] | str + if "openai" in self.lm_config.provider: + if self.lm_config.mode == "chat": + message = [{"role": "system", "content": intro}] + for (x, y) in examples: + message.append( + { + "role": "system", + "name": "example_user", + "content": x, + } + ) + message.append( + { + "role": "system", + "name": "example_assistant", + "content": y, + } + ) + message.append({"role": "user", "content": current}) + return message + elif self.lm_config.mode == "completion": + message = f"{intro}\n\n" + message += "Here are a few examples:\n" + for example in examples: + message += f"Observation\n:{example[0]}\n\n" + message += f"Action: {example[1]}\n\n" + message += "Now make prediction given the observation\n\n" + message += f"Observation\n:{current}\n\n" + message += "Action:" + return message + else: + raise ValueError( + f"OpenAI models do not support mode {self.lm_config.mode}" + ) + elif "huggingface" in self.lm_config.provider: + # https://huggingface.co/blog/llama2#how-to-prompt-llama-2 + # https://github.com/facebookresearch/llama/blob/main/llama/generation.py#L320 + if "Llama-2" in self.lm_config.model: + if self.lm_config.mode == "chat": + B_INST, E_INST = "[INST]", "[/INST]" + B_SYS, E_SYS = "<>\n", "\n<>\n\n" + BOS, EOS = "", "" + # adding the system message to be the starting of the first example + examples = [ + ( + B_SYS + intro + E_SYS + examples[0][0], + examples[0][1], + ) + ] + examples[1:] + message = "".join( + [ + f"{BOS}{B_INST} {x.strip()} {E_INST} {y.strip()} {EOS}" + for (x, y) in examples + ] + ) + # add the current observation + message += f"{BOS}{B_INST} {current.strip()} {E_INST} {self.instruction['meta_data'].get('force_prefix', '')}" + + return message + else: + raise ValueError("Only chat mode is supported for Llama-2") + else: + raise ValueError( + f"Huggingface models do not support model_tag {self.lm_config.gen_config['model_tag']}" + ) + else: + raise NotImplementedError( + f"Provider {self.lm_config.provider} not implemented" + ) + + def construct( + self, + trajectory: Trajectory, + intent: str, + meta_data: dict[str, Any] = {}, + ) -> APIInput: + raise NotImplementedError + + def map_url_to_real(self, url: str) -> str: + """Map the urls to their real world counterparts""" + for i, j in URL_MAPPINGS.items(): + if i in url: + url = url.replace(i, j) + return url + + def map_url_to_local(self, url: str) -> str: + """Map the urls to their local counterparts""" + for i, j in URL_MAPPINGS.items(): + if j in url: + url = url.replace(j, i) + # https + if j.replace("http", "https") in url: + url = url.replace(j.replace("http", "https"), i) + return url + + def _extract_action(self, response: str) -> str: + raise NotImplementedError + + def extract_action(self, response: str) -> str: + response = self._extract_action(response) + response = self.map_url_to_local(response) + return response + + +class DirectPromptConstructor(PromptConstructor): + """The agent will direct predict the action""" + + def __init__( + self, + instruction_path: str | Path, + lm_config: lm_config.LMConfig, + tokenizer: Tokenizer, + ): + super().__init__(instruction_path, lm_config, tokenizer) + + def construct( + self, + trajectory: Trajectory, + intent: str, + meta_data: dict[str, Any] = {}, + ) -> APIInput: + """Construct prompt given the trajectory""" + intro = self.instruction["intro"] + examples = self.instruction["examples"] + template = self.instruction["template"] + keywords = self.instruction["meta_data"]["keywords"] + state_info: StateInfo = trajectory[-1] # type: ignore[assignment] + + obs = state_info["observation"][self.obs_modality] + max_obs_length = self.lm_config.gen_config["max_obs_length"] + if max_obs_length: + if self.lm_config.provider == "google": + print("NOTE: This is a Gemini model, so we use characters instead of tokens for max_obs_length.") + obs = obs[:max_obs_length] + else: + obs = self.tokenizer.decode(self.tokenizer.encode(obs)[:max_obs_length]) # type: ignore[arg-type] + + page = state_info["info"]["page"] + url = page.url + previous_action_str = meta_data["action_history"][-1] + + # input x + current = template.format( + objective=intent, + url=self.map_url_to_real(url), + observation=obs, + previous_action=previous_action_str, + ) + + # make sure all keywords are replaced + assert all([f"{{k}}" not in current for k in keywords]) + prompt = self.get_lm_api_input(intro, examples, current) + return prompt + + def _extract_action(self, response: str) -> str: + action_splitter = self.instruction["meta_data"]["action_splitter"] + pattern = rf"{action_splitter}((.|\n)*?){action_splitter}" + match = re.search(pattern, response) + if match: + return match.group(1).strip() + else: + raise ActionParsingError( + f"Cannot parse action from response {response}" + ) + + +class CoTPromptConstructor(PromptConstructor): + """The agent will perform step-by-step reasoning before the answer""" + + def __init__( + self, + instruction_path: str | Path, + lm_config: lm_config.LMConfig, + tokenizer: Tokenizer, + ): + super().__init__(instruction_path, lm_config, tokenizer) + self.answer_phrase = self.instruction["meta_data"]["answer_phrase"] + + def construct( + self, + trajectory: Trajectory, + intent: str, + meta_data: dict[str, Any] = {}, + ) -> APIInput: + intro = self.instruction["intro"] + examples = self.instruction["examples"] + template = self.instruction["template"] + keywords = self.instruction["meta_data"]["keywords"] + state_info: StateInfo = trajectory[-1] # type: ignore[assignment] + + obs = state_info["observation"][self.obs_modality] + max_obs_length = self.lm_config.gen_config["max_obs_length"] + if max_obs_length: + if self.lm_config.provider == "google": + print("NOTE: This is a Gemini model, so we use characters instead of tokens for max_obs_length.") + obs = obs[:max_obs_length] + else: + obs = self.tokenizer.decode(self.tokenizer.encode(obs)[:max_obs_length]) # type: ignore[arg-type] + + page = state_info["info"]["page"] + url = page.url + previous_action_str = meta_data["action_history"][-1] + current = template.format( + objective=intent, + url=self.map_url_to_real(url), + observation=obs, + previous_action=previous_action_str, + ) + + assert all([f"{{k}}" not in current for k in keywords]) + + prompt = self.get_lm_api_input(intro, examples, current) + return prompt + + def _extract_action(self, response: str) -> str: + # find the first occurence of action + action_splitter = self.instruction["meta_data"]["action_splitter"] + pattern = rf"{action_splitter}((.|\n)*?){action_splitter}" + match = re.search(pattern, response) + if match: + return match.group(1).strip() + else: + raise ActionParsingError( + f'Cannot find the answer phrase "{self.answer_phrase}" in "{response}"' + ) + + +class MultimodalCoTPromptConstructor(CoTPromptConstructor): + """The agent will perform step-by-step reasoning before the answer""" + + def __init__( + self, + instruction_path: str | Path, + lm_config: lm_config.LMConfig, + tokenizer: Tokenizer, + ): + super().__init__(instruction_path, lm_config, tokenizer) + self.answer_phrase = self.instruction["meta_data"]["answer_phrase"] + + def construct( + self, + trajectory: Trajectory, + intent: str, + page_screenshot_img: Image.Image, + images: list[Image.Image], + meta_data: dict[str, Any] = {}, + ) -> APIInput: + intro = self.instruction["intro"] + examples = self.instruction["examples"] + template = self.instruction["template"] + keywords = self.instruction["meta_data"]["keywords"] + state_info: StateInfo = trajectory[-1] # type: ignore[assignment] + + obs = state_info["observation"][self.obs_modality] + max_obs_length = self.lm_config.gen_config["max_obs_length"] + if max_obs_length: + if self.lm_config.provider in ["google", "api", "finetune"]: + print("NOTE: This is a Gemini / API model, so we use characters instead of tokens for max_obs_length.") + obs = obs[:max_obs_length] + else: + obs = self.tokenizer.decode(self.tokenizer.encode(obs)[:max_obs_length]) # type: ignore[arg-type] + + page = state_info["info"]["page"] + url = page.url + previous_action_str = meta_data["action_history"][-1] + current = template.format( + objective=intent, + url=self.map_url_to_real(url), + observation=obs, + previous_action=previous_action_str, + ) + + assert all([f"{{k}}" not in current for k in keywords]) + + # TODO: for your finetune model, you can config you prompt here + if self.lm_config.provider == "finetune": + current = "" + traj = trajectory[1::2] + for rnd, tra in enumerate(traj): + tar = '** screenshot **' if rnd > 0 else intent + raw = tra["raw_prediction"] + current += f"Round {rnd}\n\n<|user|>\n\n** node_info **\n\n{tar}\n\n<|assistant|>\n{raw}\n\n""" + + current += f"Round {len(traj)}\n\n<|user|>\n\n{obs}\n\n{'** screenshot **' if len(traj) > 0 else intent}\n" + + prompt = self.get_lm_api_input( + intro, examples, current, page_screenshot_img, images + ) + return prompt + + def get_lm_api_input( + self, + intro: str, + examples: list[tuple[str, str, str]], + current: str, + page_screenshot_img: Image.Image, + images: list[Image.Image], + ) -> APIInput: + """Return the require format for an API""" + message: list[dict[str, str]] | str | list[str | Image.Image] + if "openai" in self.lm_config.provider: + if self.lm_config.mode == "chat": + message = [ + { + "role": "system", + "content": [{"type": "text", "text": intro}], + } + ] + for (x, y, z) in examples: + example_img = Image.open(z) + message.append( + { + "role": "system", + "name": "example_user", + "content": [ + {"type": "text", "text": x}, + { + "type": "text", + "text": "IMAGES: (1) current page screenshot", + }, + { + "type": "image_url", + "image_url": { + "url": pil_to_b64(example_img) + }, + }, + ], + } + ) + message.append( + { + "role": "system", + "name": "example_assistant", + "content": [{"type": "text", "text": y}], + } + ) + + # Encode images and page_screenshot_img as base64 strings. + current_prompt = current + content = [ + { + "type": "text", + "text": "IMAGES: (1) current page screenshot", + }, + { + "type": "image_url", + "image_url": {"url": pil_to_b64(page_screenshot_img)}, + }, + ] + for image_i, image in enumerate(images): + content.extend( + [ + { + "type": "text", + "text": f"({image_i+2}) input image {image_i+1}", + }, + { + "type": "image_url", + "image_url": {"url": pil_to_b64(image)}, + }, + ] + ) + content = [{"type": "text", "text": current_prompt}] + content + + message.append({"role": "user", "content": content}) + return message + else: + raise ValueError( + f"GPT-4V models do not support mode {self.lm_config.mode}" + ) + elif "google" in self.lm_config.provider: + if self.lm_config.mode == "completion": + message = [ + intro, + "Here are a few examples:", + ] + for (x, y, z) in examples: + example_img = Image.open(z) + message.append(f"Observation\n:{x}\n") + message.extend( + [ + "IMAGES:", + "(1) current page screenshot:", + pil_to_vertex(example_img), + ] + ) + message.append(f"Action: {y}") + message.append("Now make prediction given the observation") + message.append(f"Observation\n:{current}\n") + message.extend( + [ + "IMAGES:", + "(1) current page screenshot:", + pil_to_vertex(page_screenshot_img), + ] + ) + for image_i, image in enumerate(images): + message.extend( + [ + f"({image_i+2}) input image {image_i+1}", + pil_to_vertex(image), + ] + ) + message.append("Action:") + return message + else: + raise ValueError( + f"Gemini models do not support mode {self.lm_config.mode}" + ) + elif self.lm_config.provider in ["api", "finetune"]: + message = [ + { + "role": "system", + "content": intro, + } + ] + + # we keep few-shot here, but remove the image corresponding to the current page. + for (x, y, _) in examples: + message.append({ + "role": "user", + "content": [ + { "type": "text", "text": x }, + { "type": "text", "text": "IMAGES: (1) current page screenshot\n\n** Screenshot **\n" }, + ], + }) + message.append({ + "role": "assistant", + "content": y, + }) + + + # TODO: Encode images and page_screenshot_img as base64 strings, we only keep screenshot of current page. + current_prompt = current + content = [] + + if self.lm_config.provider != "finetune": + content.append({ + "type": "text", + "text": "IMAGES: (1) current page screenshot", + }) + + if "text" not in self.lm_config.model: + content.append({ + "type": "image_url", + "image_url": {"url": pil_to_b64(page_screenshot_img)}, + }) + + content = [{"type": "text", "text": current_prompt}] + content + + message.append({"role": "user", "content": content}) + return message + + else: + raise NotImplementedError( + f"Provider {self.lm_config.provider} not implemented" + ) diff --git a/VAB-WebArena-Lite/new/run.py b/VAB-WebArena-Lite/new/run.py new file mode 100644 index 0000000..e406e2c --- /dev/null +++ b/VAB-WebArena-Lite/new/run.py @@ -0,0 +1,562 @@ +"""Script to run end-to-end evaluation on the benchmark. + +Modified from https://github.com/web-arena-x/webarena/blob/main/run.py. +""" +import argparse +import glob +import json +import logging +import os +import random +import subprocess +import tempfile +import time +from pathlib import Path +from typing import List + +import openai +import requests +import torch +from PIL import Image + +from agent import ( + PromptAgent, + construct_agent, +) +from agent.prompts import * +from browser_env import ( + Action, + ActionTypes, + ScriptBrowserEnv, + StateInfo, + Trajectory, + create_stop_action, +) +from browser_env.actions import is_equivalent +from browser_env.auto_login import get_site_comb_from_filepath +from browser_env.helper_functions import ( + RenderHelper, + get_action_description, +) +from evaluation_harness import evaluator_router, image_utils + +DATASET = os.environ["DATASET"] + +LOG_FOLDER = "log_files" +Path(LOG_FOLDER).mkdir(parents=True, exist_ok=True) +LOG_FILE_NAME = f"{LOG_FOLDER}/log_{time.strftime('%Y%m%d%H%M%S', time.localtime())}_{random.randint(0, 10000)}.log" + +logger = logging.getLogger("logger") +logger.setLevel(logging.INFO) + +console_handler = logging.StreamHandler() +console_handler.setLevel(logging.DEBUG) +logger.addHandler(console_handler) + +file_handler = logging.FileHandler(LOG_FILE_NAME) +file_handler.setLevel(logging.DEBUG) +logger.addHandler(file_handler) + +# Set the log format +formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s") +console_handler.setFormatter(formatter) +file_handler.setFormatter(formatter) + + +def config() -> argparse.Namespace: + parser = argparse.ArgumentParser( + description="Run end-to-end evaluation on the benchmark" + ) + parser.add_argument( + "--render", action="store_true", help="Render the browser" + ) + + parser.add_argument( + "--slow_mo", + type=int, + default=0, + help="Slow down the browser by the specified amount", + ) + parser.add_argument( + "--action_set_tag", default="id_accessibility_tree", help="Action type" + ) + parser.add_argument( + "--observation_type", + choices=[ + "accessibility_tree", + "accessibility_tree_with_captioner", + "html", + "image", + "image_som", + ], + default="accessibility_tree", + help="Observation type", + ) + parser.add_argument( + "--current_viewport_only", + action="store_true", + help="Only use the current viewport for the observation", + ) + parser.add_argument("--viewport_width", type=int, default=1280) + parser.add_argument("--viewport_height", type=int, default=2048) + parser.add_argument("--save_trace_enabled", action="store_true") + parser.add_argument("--sleep_after_execution", type=float, default=0.0) + + parser.add_argument("--max_steps", type=int, default=30) + + # agent config + parser.add_argument("--agent_type", type=str, default="prompt") + parser.add_argument( + "--instruction_path", + type=str, + default="agents/prompts/state_action_agent.json", + ) + parser.add_argument( + "--parsing_failure_th", + help="When consecutive parsing failures exceed this threshold, the agent will terminate early.", + type=int, + default=3, + ) + parser.add_argument( + "--repeating_action_failure_th", + help="When consecutive repeated actions exceed this threshold, the agent will terminate early.", + type=int, + default=5, + ) + + parser.add_argument("--test_config_base_dir", type=str) + + parser.add_argument( + "--eval_captioning_model_device", + type=str, + default="cpu", + choices=["cpu", "cuda"], + help="Device to run eval captioning model on. By default, runs it on CPU.", + ) + parser.add_argument( + "--eval_captioning_model", + type=str, + default="Salesforce/blip2-flan-t5-xl", + choices=["Salesforce/blip2-flan-t5-xl"], + help="Captioning backbone for VQA-type evals.", + ) + parser.add_argument( + "--captioning_model", + type=str, + default="Salesforce/blip2-flan-t5-xl", + choices=["Salesforce/blip2-flan-t5-xl", "llava-hf/llava-1.5-7b-hf"], + help="Captioning backbone for accessibility tree alt text.", + ) + + # lm config + parser.add_argument("--provider", type=str, default="openai") + parser.add_argument("--model", type=str, default="gpt-3.5-turbo-0613") + parser.add_argument("--mode", type=str, default="chat") + parser.add_argument("--temperature", type=float, default=1.0) + parser.add_argument("--top_p", type=float, default=0.9) + parser.add_argument("--context_length", type=int, default=0) + parser.add_argument("--max_tokens", type=int, default=384) + parser.add_argument("--stop_token", type=str, default=None) + parser.add_argument( + "--max_retry", + type=int, + help="max retry times to perform generations when parsing fails", + default=1, + ) + parser.add_argument( + "--max_obs_length", + type=int, + help="when not zero, will truncate the observation to this length before feeding to the model", + default=3840, + ) + + # example config + parser.add_argument("--test_start_idx", type=int, default=0) + parser.add_argument("--test_end_idx", type=int, default=910) + + # logging related + parser.add_argument("--result_dir", type=str, default="") + args = parser.parse_args() + + # check the whether the action space is compatible with the observation space + if ( + args.action_set_tag == "id_accessibility_tree" + and args.observation_type + not in [ + "accessibility_tree", + "accessibility_tree_with_captioner", + "image_som", + ] + ): + raise ValueError( + f"Action type {args.action_set_tag} is incompatible with the observation type {args.observation_type}" + ) + + return args + + +def early_stop( + trajectory: Trajectory, max_steps: int, thresholds: dict[str, int] +) -> tuple[bool, str]: + """Check whether need to stop early""" + + # reach the max step + num_steps = (len(trajectory) - 1) / 2 + if num_steps >= max_steps: + return True, f"Reach max steps {max_steps}" + + last_k_actions: list[Action] + action_seq: list[Action] + + # Case: parsing failure for k times + k = thresholds["parsing_failure"] + last_k_actions = trajectory[1::2][-k:] # type: ignore[assignment] + if len(last_k_actions) >= k: + if all( + [ + action["action_type"] == ActionTypes.NONE + for action in last_k_actions + ] + ): + return True, f"Failed to parse actions for {k} times" + + # Case: same action for k times + k = thresholds["repeating_action"] + last_k_actions = trajectory[1::2][-k:] # type: ignore[assignment] + action_seq = trajectory[1::2] # type: ignore[assignment] + + if len(action_seq) == 0: + return False, "" + + last_action: Action = action_seq[-1] + + if last_action["action_type"] != ActionTypes.TYPE: + if len(last_k_actions) >= k: + if all( + [ + is_equivalent(action, last_action) + for action in last_k_actions + ] + ): + return True, f"Same action for {k} times" + + else: + # check the action sequence + if ( + sum([is_equivalent(action, last_action) for action in action_seq]) + >= k + ): + return True, f"Same typing action for {k} times" + + return False, "" + + +def update_action_history(path: str, task_id: int, actions: List[str], score: float=-0.1): + obj = { + "task_id": task_id, + "score": score, + "actions": actions + } + json.dump(obj, open(path, "w"), indent=4) + + +def test( + args: argparse.Namespace, + config_file_list: list[str] +) -> None: + scores = [] + max_steps = args.max_steps + + early_stop_thresholds = { + "parsing_failure": args.parsing_failure_th, + "repeating_action": args.repeating_action_failure_th, + } + + if args.observation_type in [ + "accessibility_tree_with_captioner", + # "image_som", + ]: + device = torch.device("cuda") if torch.cuda.is_available() else "cpu" + dtype = torch.float16 if torch.cuda.is_available() else torch.float32 + caption_image_fn = image_utils.get_captioning_fn( + device, dtype, args.captioning_model + ) + else: + caption_image_fn = None + + # Load a (possibly different) captioning model for running VQA evals. + if DATASET == 'visualwebarena': + if ( + caption_image_fn + and args.eval_captioning_model == args.captioning_model + ): + eval_caption_image_fn = caption_image_fn + else: + eval_caption_image_fn = image_utils.get_captioning_fn( + args.eval_captioning_model_device, + torch.float16 + if ( + torch.cuda.is_available() + and args.eval_captioning_model_device == "cuda" + ) + else torch.float32, + args.eval_captioning_model, + ) + else: + caption_image_fn = None + eval_caption_image_fn = None + + agent = construct_agent( + args, + captioning_fn=caption_image_fn + if args.observation_type == "accessibility_tree_with_captioner" + else None, + ) # NOTE: captioning_fn here is used for captioning input images. + + env = ScriptBrowserEnv( + headless=not args.render, + slow_mo=args.slow_mo, + observation_type=args.observation_type, + current_viewport_only=args.current_viewport_only, + viewport_size={ + "width": args.viewport_width, + "height": args.viewport_height, + }, + save_trace_enabled=args.save_trace_enabled, + sleep_after_execution=args.sleep_after_execution, + # NOTE: captioning_fn here is used for LLM + captioning baselines. + # This can be different from the captioning model used for evals. + captioning_fn=caption_image_fn, + ) + + for config_file in config_file_list: + try: + render_helper = RenderHelper( + config_file, args.result_dir, args.action_set_tag + ) + + # Load task. + with open(config_file) as f: + _c = json.load(f) + intent = _c["intent"] + task_id = _c["task_id"] + image_paths = _c.get("image", None) + images = [] + + # automatically login + if _c["storage_state"]: + cookie_file_name = os.path.basename(_c["storage_state"]) + comb = get_site_comb_from_filepath(cookie_file_name) + temp_dir = tempfile.mkdtemp() + # subprocess to renew the cookie + subprocess.run( + [ + "python", + "browser_env/auto_login.py", + "--auth_folder", + temp_dir, + "--site_list", + *comb, + ] + ) + _c["storage_state"] = f"{temp_dir}/{cookie_file_name}" + assert os.path.exists(_c["storage_state"]) + # update the config file + config_file = f"{temp_dir}/{os.path.basename(config_file)}" + with open(config_file, "w") as f: + json.dump(_c, f) + + # Load input images for the task, if any. + if image_paths is not None: + if isinstance(image_paths, str): + image_paths = [image_paths] + for image_path in image_paths: + # Load image either from the web or from a local path. + if image_path.startswith("http"): + input_image = Image.open(requests.get(image_path, stream=True).raw) + else: + input_image = Image.open(image_path) + + images.append(input_image) + + logger.info(f"[Config file]: {config_file}") + logger.info(f"[Intent]: {intent}") + + agent.reset(config_file) + trajectory: Trajectory = [] + obs, info = env.reset(options={"config_file": config_file}) + state_info: StateInfo = {"observation": obs, "info": info} + trajectory.append(state_info) + + meta_data = {"action_history": ["None"]} + out_path = os.path.join(args.result_dir, "actions", f"{task_id}.json") + actions = [] + + while True: + update_action_history(out_path, task_id, actions=actions) + + early_stop_flag, stop_info = early_stop( + trajectory, max_steps, early_stop_thresholds + ) + + if early_stop_flag: + action = create_stop_action(f"Early stop: {stop_info}") + else: + try: + action = agent.next_action( + trajectory, + intent, + images=images, + meta_data=meta_data, + ) + except ValueError as e: + # get the error message + action = create_stop_action(f"ERROR: {str(e)}") + + trajectory.append(action) + + action_str = get_action_description( + action, + state_info["info"]["observation_metadata"], + action_set_tag=args.action_set_tag, + prompt_constructor=agent.prompt_constructor + if isinstance(agent, PromptAgent) + else None, + ) + render_helper.render( + action, state_info, meta_data, args.render_screenshot + ) + meta_data["action_history"].append(action_str) + actions.append(action_str) + print(action_str) + + if action["action_type"] == ActionTypes.STOP: + break + + obs, _, terminated, _, info = env.step(action) + state_info = {"observation": obs, "info": info} + trajectory.append(state_info) + + if terminated: + # add a action place holder + trajectory.append(create_stop_action("")) + break + + # NOTE: eval_caption_image_fn is used for running eval_vqa functions. + evaluator = evaluator_router( + config_file, captioning_fn=eval_caption_image_fn + ) + score = evaluator( + trajectory=trajectory, + config_file=config_file, + page=env.page + ) + + update_action_history(out_path, task_id, actions=actions, score=score) + scores.append(score) + + if score == 1: + logger.info(f"[Result] (PASS) {config_file}") + else: + logger.info(f"[Result] (FAIL) {config_file}") + + if args.save_trace_enabled: + env.save_trace( + Path(args.result_dir) / "traces" / f"{task_id}.zip" + ) + except openai.OpenAIError as e: + logger.info(f"[OpenAI Error] {repr(e)}") + except Exception as e: + logger.info(f"[Unhandled Error] {repr(e)}]") + import traceback + + # write to error file + with open(Path(args.result_dir) / "error.txt", "a") as f: + f.write(f"[Config file]: {config_file}\n") + f.write(f"[Unhandled Error] {repr(e)}\n") + f.write(traceback.format_exc()) # write stack trace to file + + render_helper.close() + + env.close() + if len(scores): + logger.info(f"Average score: {sum(scores) / len(scores)}") + + +def prepare(args: argparse.Namespace) -> None: + # convert prompt python files to json + from agent.prompts import to_json + + to_json.run() + + # prepare result dir + result_dir = args.result_dir + if not result_dir: + result_dir = ( + f"cache/results_{time.strftime('%Y%m%d%H%M%S', time.localtime())}" + ) + if not Path(result_dir).exists(): + Path(result_dir).mkdir(parents=True, exist_ok=True) + args.result_dir = result_dir + logger.info(f"Create result dir: {result_dir}") + + if not (Path(result_dir) / "traces").exists(): + (Path(result_dir) / "traces").mkdir(parents=True) + + os.makedirs(os.path.join(result_dir, "actions"), exist_ok=True) + + # log the log file + with open(os.path.join(result_dir, "log_files.txt"), "a+") as f: + f.write(f"{LOG_FILE_NAME}\n") + + +def get_unfinished(config_files: list[str], result_dir: str) -> list[str]: + result_files = glob.glob(f"{result_dir}/*.html") + task_ids = [ + os.path.basename(f).split(".")[0].split("_")[1] for f in result_files + ] + unfinished_configs = [] + for config_file in config_files: + task_id = os.path.basename(config_file).split(".")[0] + try: + with open(f"{result_dir}/actions/{task_id}.json", "r") as f: + jd = json.load(f) + except: + jd = {} + if task_id not in task_ids or jd.get('score', -1) < 0: + unfinished_configs.append(config_file) + return unfinished_configs + + +def dump_config(args: argparse.Namespace) -> None: + config_file = Path(args.result_dir) / "config.json" + if not config_file.exists(): + with open(config_file, "w") as f: + json.dump(vars(args), f, indent=4) + logger.info(f"Dump config to {config_file}") + + +if __name__ == "__main__": + os.environ["TOKENIZERS_PARALLELISM"] = "false" + + args = config() + args.sleep_after_execution = 2.5 + prepare(args) + + test_config_base_dir = args.test_config_base_dir + + test_file_list = [] + st_idx = args.test_start_idx + ed_idx = args.test_end_idx + for i in range(st_idx, ed_idx): + test_file_list.append(os.path.join(test_config_base_dir, f"{i}.json")) + test_file_list = get_unfinished(test_file_list, args.result_dir) + print(f"Total {len(test_file_list)} tasks left") + args.render = False + args.render_screenshot = True + args.save_trace_enabled = True + + args.current_viewport_only = True + dump_config(args) + + test(args, test_file_list) diff --git a/VAB-WebArena-Lite/new/test_webarena_lite.raw.json b/VAB-WebArena-Lite/new/test_webarena_lite.raw.json new file mode 100644 index 0000000..c77225a --- /dev/null +++ b/VAB-WebArena-Lite/new/test_webarena_lite.raw.json @@ -0,0 +1,5838 @@ +[ + { + "sites": [ + "shopping_admin" + ], + "task_id": 0, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "What are the top-{{n}} best-selling product in {{period}}", + "instantiation_dict": { + "n": 3, + "period": "Jan 2023" + }, + "intent": "What are the top-3 best-selling product in Jan 2023", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "Impulse Duffle", + "Overnight Duffle", + "Hawkeye Yoga Short-32-Blue" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Impulse Duffle, Overnight Duffle, Hawkeye Yoga Short-32-Blue" + }, + "intent_template_id": 279, + "old_task_id": 4 + }, + { + "sites": [ + "map" + ], + "task_id": 1, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Tell me the full address of all {{airport_type}} that are within a driving distance of {{radius}} to {{start}}", + "instantiation_dict": { + "airport_type": "international airports", + "start": "Carnegie Mellon University", + "radius": "50 km" + }, + "intent": "Tell me the full address of all international airports that are within a driving distance of 50 km to Carnegie Mellon University", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "Pittsburgh International Airport, Southern Beltway, Findlay Township, Allegheny County, 15231, United States" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Pittsburgh International Airport People Movers, Airport Boulevard, Findlay Township, Allegheny County, Pennsylvania, 15231, United States" + }, + "intent_template_id": 79, + "old_task_id": 7 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 2, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Tell me the the number of reviews that our store received by far that mention term \"{{term}}\"", + "instantiation_dict": { + "term": "best" + }, + "intent": "Tell me the the number of reviews that our store received by far that mention term \"best\"", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "2" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "2" + }, + "intent_template_id": 288, + "old_task_id": 15 + }, + { + "sites": [ + "map" + ], + "task_id": 3, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Compare the difference in time for walking and driving route from {{start}} to {{end}}", + "instantiation_dict": { + "start": "Randyland", + "end": "Carnegie Mellon University" + }, + "intent": "Compare the difference in time for walking and driving route from Randyland to Carnegie Mellon University", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "driving: 13min", + "walking: 1h 45min" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "driving: 13min, walking: 1h 45min." + }, + "intent_template_id": 73, + "old_task_id": 20 + }, + { + "sites": [ + "shopping" + ], + "task_id": 4, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__/3-pack-samsung-galaxy-s6-screen-protector-nearpow-tempered-glass-screen-protector-with-9h-hardness-crystal-clear-easy-bubble-free-installation-scratch-resist.html", + "geolocation": null, + "intent_template": "List out reviewers, if exist, who mention about {{description}}", + "instantiation_dict": { + "description": "good fingerprint resistant" + }, + "intent": "List out reviewers, if exist, who mention about good fingerprint resistant", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "Rachel", + "T. Gannon" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Rachel, T. Gannon, " + }, + "intent_template_id": 222, + "old_task_id": 23 + }, + { + "sites": [ + "reddit" + ], + "task_id": 5, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Tell me the count of comments that have received more downvotes than upvotes for the user who made the latest post on the {{forum}} forum.", + "instantiation_dict": { + "forum": "Showerthoughts" + }, + "intent": "Tell me the count of comments that have received more downvotes than upvotes for the user who made the latest post on the Showerthoughts forum.", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "0" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "0" + }, + "intent_template_id": 33, + "old_task_id": 27 + }, + { + "sites": [ + "map" + ], + "task_id": 6, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "I will arrive {{place}} soon. Provide the name of a {{target1}} in the vicinity, if available. Then, tell me the {{information}} to {{target2}} from the hotel.", + "instantiation_dict": { + "place": "Pittsburgh Airport", + "target1": "Hilton hotel", + "information": "the shortest walking distance", + "target2": "a supermarket" + }, + "intent": "I will arrive Pittsburgh Airport soon. Provide the name of a Hilton hotel in the vicinity, if available. Then, tell me the the shortest walking distance to a supermarket from the hotel.", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "DoubleTree by Hilton Hotel Pittsburgh Airport", + "1.4km" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "DoubleTree by Hilton Hotel Pittsburgh Airport Distance: 1.4km" + }, + "intent_template_id": 78, + "old_task_id": 33 + }, + { + "sites": [ + "map" + ], + "task_id": 7, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Check if the {{place}} in pittsburgh can be reached in one hour by car from {{location}}", + "instantiation_dict": { + "place": "police station", + "location": "gates building at CMU" + }, + "intent": "Check if the police station in pittsburgh can be reached in one hour by car from gates building at CMU", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "Yes" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Yes" + }, + "intent_template_id": 77, + "old_task_id": 37 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 8, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "List the top {{n}} search terms in my store", + "instantiation_dict": { + "n": "3" + }, + "intent": "List the top 3 search terms in my store", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "hollister", + "Joust Bag", + "nike" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "hollister, Joust Bag, nike" + }, + "intent_template_id": 285, + "old_task_id": 43 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 9, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Check out my todos", + "instantiation_dict": {}, + "intent": "Check out my todos", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__GITLAB__/dashboard/todos", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 303, + "old_task_id": 44 + }, + { + "sites": [ + "shopping" + ], + "task_id": 10, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Today is 6/12/2023. Tell me how many fulfilled orders I have {{period}}, and the total amount of money I spent.", + "instantiation_dict": { + "period": "over the past three days" + }, + "intent": "Today is 6/12/2023. Tell me how many fulfilled orders I have over the past three days, and the total amount of money I spent.", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "0 order", + "$0 total spend" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "0 order, $0 total spend" + }, + "intent_template_id": 197, + "old_task_id": 48 + }, + { + "sites": [ + "map" + ], + "task_id": 11, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "How long does it take to walk from {{start}} to {{end}}?", + "instantiation_dict": { + "start": "Carnegie Museum of Art", + "end": "a library at CMU" + }, + "intent": "How long does it take to walk from Carnegie Museum of Art to a library at CMU?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "11 min" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "11 min" + }, + "intent_template_id": 68, + "old_task_id": 56 + }, + { + "sites": [ + "map" + ], + "task_id": 12, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Tell me the closest {{place1}}(s) to {{place2}}", + "instantiation_dict": { + "place1": "cafe", + "place2": "CMU Hunt library" + }, + "intent": "Tell me the closest cafe(s) to CMU Hunt library", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "De Fer Coffee & Tea" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "De Fer Coffee & Tea" + }, + "intent_template_id": 69, + "old_task_id": 58 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 13, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Which customer has completed the {{quantifier}} number of orders in the entire history?", + "instantiation_dict": { + "quantifier": "fifth most" + }, + "intent": "Which customer has completed the fifth most number of orders in the entire history?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "Matt Baker" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Matt Baker" + }, + "intent_template_id": 276, + "old_task_id": 65 + }, + { + "sites": [ + "reddit" + ], + "task_id": 14, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Among the top {{number}} post in \"{{subreddit}}\" forum, {{description}}", + "instantiation_dict": { + "number": 10, + "subreddit": "books", + "description": "is there any post talks about supporting local book stores? If so, tell me the organizations involved" + }, + "intent": "Among the top 10 post in \"books\" forum, is there any post talks about supporting local book stores? If so, tell me the organizations involved", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "bookshop.org" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "bookshop.org" + }, + "intent_template_id": 17, + "old_task_id": 69 + }, + { + "sites": [ + "map" + ], + "task_id": 15, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "What is the zip code of {{place}}?", + "instantiation_dict": { + "place": "Chatham University" + }, + "intent": "What is the zip code of Chatham University?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "15232" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "15232" + }, + "intent_template_id": 70, + "old_task_id": 71 + }, + { + "sites": [ + "map" + ], + "task_id": 16, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Given the following locations, {{place_list}}, what would be the optimal route to travel through them all in order to minimize total travel time? Please note the journey begins at the first place listed.", + "instantiation_dict": { + "place_list": [ + "Massachusetts Institute of Technology", + "Harvard University", + "Boston Logan International Airport" + ] + }, + "intent": "Given the following locations, ['Massachusetts Institute of Technology', 'Harvard University', 'Boston Logan International Airport'], what would be the optimal route to travel through them all in order to minimize total travel time? Please note the journey begins at the first place listed.", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "The order is Massachusetts Institute of Technology, Harvard University, Boston Logan International Airport" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Massachusetts Institute of Technology, Harvard University, Boston Logan International Airport" + }, + "intent_template_id": 65, + "old_task_id": 75 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 17, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "What is the total count of {{status}} reviews amongst all the reviews?", + "instantiation_dict": { + "status": "Pending" + }, + "intent": "What is the total count of Pending reviews amongst all the reviews?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "5" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "5" + }, + "intent_template_id": 277, + "old_task_id": 77 + }, + { + "sites": [ + "map" + ], + "task_id": 18, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "What is the duration required to first walk from {{place_A}} to {{place_B}}, and then drive to {{place_C}}?", + "instantiation_dict": { + "place_A": "Massachusetts Institute of Technology", + "place_B": "Harvard University", + "place_C": "Boston Logan International Airport" + }, + "intent": "What is the duration required to first walk from Massachusetts Institute of Technology to Harvard University, and then drive to Boston Logan International Airport?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "64 min" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "63 min" + }, + "intent_template_id": 72, + "old_task_id": 82 + }, + { + "sites": [ + "map" + ], + "task_id": 19, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "From my stay at {{hotel}}, what's the estimated driving time to reach {{place}}?", + "instantiation_dict": { + "hotel": "Homewood Suites Southpointe", + "place": "PPG Paints Arena" + }, + "intent": "From my stay at Homewood Suites Southpointe, what's the estimated driving time to reach PPG Paints Arena?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "34 minutes" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "34 minutes" + }, + "intent_template_id": 64, + "old_task_id": 88 + }, + { + "sites": [ + "map" + ], + "task_id": 20, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Which US states border {{state}}?", + "instantiation_dict": { + "state": "New Hampshire" + }, + "intent": "Which US states border New Hampshire?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "Massachusetts", + "Vermont", + "Maine" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Massachusetts, Vermont, Maine" + }, + "intent_template_id": 67, + "old_task_id": 93 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 21, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Telll me the grand total of invoice {{id}}.", + "instantiation_dict": { + "id": "000000002" + }, + "intent": "Telll me the grand total of invoice 000000002.", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "39.64" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "$39.64" + }, + "intent_template_id": 274, + "old_task_id": 95 + }, + { + "sites": [ + "shopping" + ], + "task_id": 22, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Tell me the status of my latest order and when will it arrive", + "instantiation_dict": {}, + "intent": "Tell me the status of my latest order and when will it arrive", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "The last order was canceled. It will never arrive." + ] + }, + "reference_url": "", + "program_html": [], + "reference_answer_raw_annotation": "The last order was canceled. It will never arrive.", + "string_note": "" + }, + "intent_template_id": 193, + "old_task_id": 96 + }, + { + "sites": [ + "map", + "wikipedia" + ], + "task_id": 23, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Tell me the distance to drive from Carnegie Mellon University to the top computer science school in massachusetts", + "instantiation_dict": {}, + "intent": "Tell me the distance to drive from Carnegie Mellon University to the top computer science school in massachusetts", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "914km" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "914 km" + }, + "intent_template_id": 120, + "old_task_id": 97 + }, + { + "sites": [ + "map" + ], + "task_id": 24, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Where is the nearest {{places}} to {{start}}, and what is the walking distance to it?", + "instantiation_dict": { + "places": "tea cafe", + "start": "University of Pittsburgh" + }, + "intent": "Where is the nearest tea cafe to University of Pittsburgh, and what is the walking distance to it?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "Fuku Tea", + "3716", + "Forbes Avenue", + "Central Oakland", + "Pittsburgh", + "653m" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Fuku Tea, 3716, Forbes Avenue, Oakland, Central Oakland, Pittsburgh, Allegheny County, Pennsylvania, 15213, United States\n653m" + }, + "intent_template_id": 66, + "old_task_id": 98 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 25, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Display the list of issues in the {{repo}} repository that have labels related to {{label}}", + "instantiation_dict": { + "label": "questions", + "repo": "kkroening/ffmpeg-python" + }, + "intent": "Display the list of issues in the kkroening/ffmpeg-python repository that have labels related to questions", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__GITLAB__/kkroening/ffmpeg-python/-/issues/?sort=created_date&state=opened&label_name%5B%5D=question&first_page_size=20", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 349, + "old_task_id": 103 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 26, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Presents the monthly count of successful orders {{period}} in MM:COUNT format", + "instantiation_dict": { + "period": "from Jan to December 2022" + }, + "intent": "Presents the monthly count of successful orders from Jan to December 2022 in MM:COUNT format", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "01:11", + "02:16", + "03:14", + "04:8", + "05:8", + "06:13", + "07:9", + "08:8", + "09:10", + "10:4", + "11:5", + "12:10" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "01:11 02:16 03:14 04:8 05:8 06:13 07:9 08:8 09:10 10:4 11:5 12:10" + }, + "intent_template_id": 270, + "old_task_id": 109 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 27, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Show me the name of the customers who have expressed dissatisfaction with {{product}}", + "instantiation_dict": { + "product": "Chloe tank" + }, + "intent": "Show me the name of the customers who have expressed dissatisfaction with Chloe tank", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "exact_match": "N/A" + }, + "reference_url": "", + "program_html": [], + "string_note": "There is no negative review for Chloe tank", + "reference_answer_raw_annotation": "" + }, + "intent_template_id": 245, + "old_task_id": 115 + }, + { + "sites": [ + "shopping" + ], + "task_id": 28, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "What is the date when I made my first purchase on this site?", + "instantiation_dict": {}, + "intent": "What is the date when I made my first purchase on this site?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "3/2/22" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "3/2/22" + }, + "intent_template_id": 161, + "old_task_id": 117 + }, + { + "sites": [ + "shopping" + ], + "task_id": 29, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "I have jaw bruxism problem, show me something that could alleviate the problem.", + "instantiation_dict": {}, + "intent": "I have jaw bruxism problem, show me something that could alleviate the problem.", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": null, + "program_html": [ + { + "url": "last", + "locator": "", + "required_contents": { + "must_include": [ + "jaw bruxism", + "mouth guard" + ] + } + } + ] + }, + "intent_template_id": 151, + "old_task_id": 118 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 30, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Tell me the reasons why customers like {{product}}", + "instantiation_dict": { + "product": "Circe's products" + }, + "intent": "Tell me the reasons why customers like Circe's products", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "Warm and comfortable. True to size." + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Warm and comfortable. True to size." + }, + "intent_template_id": 250, + "old_task_id": 123 + }, + { + "sites": [ + "shopping" + ], + "task_id": 31, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "What is the price range of {{product}} in the One Stop Market?", + "instantiation_dict": { + "product": "teeth grinding mouth guard" + }, + "intent": "What is the price range of teeth grinding mouth guard in the One Stop Market?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "1.46", + "85" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "$1.46 - $85" + }, + "intent_template_id": 159, + "old_task_id": 125 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 32, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "What brands appear most frequently among the top search terms?", + "instantiation_dict": {}, + "intent": "What brands appear most frequently among the top search terms?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "Hollister" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Hollister" + }, + "intent_template_id": 1001, + "old_task_id": 127 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 33, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "What's the total number of items sold in the most recent {{k}} orders?", + "instantiation_dict": { + "k": "7" + }, + "intent": "What's the total number of items sold in the most recent 7 orders?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "25" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "25" + }, + "intent_template_id": 1002, + "old_task_id": 131 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 34, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "How many commits did {{user}} make to {{repo}} on {{date}}?", + "instantiation_dict": { + "user": "Eric and Kilian", + "repo": "a11yproject", + "date": "1/3/2023" + }, + "intent": "How many commits did Eric and Kilian make to a11yproject on 1/3/2023?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "0" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "0" + }, + "intent_template_id": 322, + "old_task_id": 135 + }, + { + "sites": [ + "map" + ], + "task_id": 35, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "What is the estimated driving time between {{city1}} and {{city2}}?", + "instantiation_dict": { + "city1": "the hometown of Joe Biden", + "city2": "Bridgeport" + }, + "intent": "What is the estimated driving time between the hometown of Joe Biden and Bridgeport?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "3h 20min" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "3h 20min" + }, + "intent_template_id": 51, + "old_task_id": 139 + }, + { + "sites": [ + "shopping" + ], + "task_id": 36, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "How much I spent on {{category}} shopping during {{time}}", + "instantiation_dict": { + "category": "food", + "time": "from mid Jan to the end Jan 2023" + }, + "intent": "How much I spent on food shopping during from mid Jan to the end Jan 2023", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "0" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "0" + }, + "intent_template_id": 162, + "old_task_id": 144 + }, + { + "sites": [ + "shopping" + ], + "task_id": 37, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "What is the {{option}} configuration of the {{product}} I bought {{time}}", + "instantiation_dict": { + "option": "color", + "product": "artifical plants", + "time": "Feb 2023" + }, + "intent": "What is the color configuration of the artifical plants I bought Feb 2023", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "Green-vines" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Green-vines" + }, + "intent_template_id": 155, + "old_task_id": 149 + }, + { + "sites": [ + "map" + ], + "task_id": 38, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "What is the minimum travel time by car from {{location1}} to {{location2}}?", + "instantiation_dict": { + "location1": "Animal Rescue League of Pittsburgh", + "location2": "Schenley park" + }, + "intent": "What is the minimum travel time by car from Animal Rescue League of Pittsburgh to Schenley park?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "9min" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "9min" + }, + "intent_template_id": 36, + "old_task_id": 155 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 39, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Checkout merge requests assigned to me", + "instantiation_dict": {}, + "intent": "Checkout merge requests assigned to me", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__GITLAB__/dashboard/merge_requests?assignee_username=byteblaze", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 290, + "old_task_id": 156 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 40, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Show all customers", + "instantiation_dict": {}, + "intent": "Show all customers", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__SHOPPING_ADMIN__/customer/index/", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 255, + "old_task_id": 157 + }, + { + "sites": [ + "shopping" + ], + "task_id": 41, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "I have a lot of Nintendo Switch game cards now, help me find the best storage option to fit all {{num}} cards", + "instantiation_dict": { + "num": 40 + }, + "intent": "I have a lot of Nintendo Switch game cards now, help me find the best storage option to fit all 40 cards", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__SHOPPING__/game-card-holder-storage-case-for-nintendo-switch-games-or-ps-vita-game-case-or-sd-memory-cards-black.html", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 171, + "old_task_id": 162 + }, + { + "sites": [ + "shopping" + ], + "task_id": 42, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__/photosmart-plus-b209-clr-inkjetfb-p-s-c-usb-wrls-1.html", + "geolocation": null, + "intent_template": "What are the main criticisms of this product? Please extract the relevant sentences.", + "instantiation_dict": {}, + "intent": "What are the main criticisms of this product? Please extract the relevant sentences.", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "The wireless connection works on a whim (about 40% of the time I've owned it)", + "It seems to constantly run out of ink", + "Cartridge prices are less than some printers I've had", + "This printer seems to have more reasons NOT to work (none that are findable or correctable) Ex: error boxes saying that it's out of paper when it automatically switches to photo printing for some reason", + "Scanner is as slow as my first scanner I ever owned in the mid-90's", + "For the $176 I paid, there isn't even a fax component on it. I guess the \"PLUS\" part of it's name is in reference to the migraines it causes when you can't figure out the new reason why it's not working for the 10th time in the past 2 months." + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "The wireless connection works on a whim (about 40% of the time I've owned it). It seems to constantly run out of ink. Cartridge prices are less than some printers I've had, but now I understand why. This printer seems to have more reasons NOT to work (none that are findable or correctable) Ex: error boxes saying that it's out of paper when it automatically switches to photo printing for some reason. Scanner is as slow as my first scanner I ever owned in the mid-90's. For the $176 I paid, there isn't even a fax component on it. I guess the \"PLUS\" part of it's name is in reference to the migraines it causes when you can't figure out the new reason why it's not working for the 10th time in the past 2 months." + }, + "intent_template_id": 136, + "old_task_id": 167 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 43, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Tell me the full names of the repositories where I made contributions and they got {{description}} stars?", + "instantiation_dict": { + "description": "the most" + }, + "intent": "Tell me the full names of the repositories where I made contributions and they got the most stars?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "a11yproject.com", + "design" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "a11yproject.com, Primer/design" + }, + "intent_template_id": 289, + "old_task_id": 169 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 44, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Open my latest updated issue that has keyword \"{{keyword}}\" in its title to check if it is closed", + "instantiation_dict": { + "keyword": "better" + }, + "intent": "Open my latest updated issue that has keyword \"better\" in its title to check if it is closed", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match", + "url_match" + ], + "reference_answers": { + "fuzzy_match": [ + "No, it is open" + ] + }, + "reference_url": "__GITLAB__/byteblaze/empathy-prompts/-/issues/8", + "program_html": [], + "reference_answer_raw_annotation": "Not closed", + "string_note": "", + "url_note": "GOLD in PRED" + }, + "intent_template_id": 310, + "old_task_id": 173 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 45, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Open my latest created issue that has {{keyword}} in its title to check if it is closed", + "instantiation_dict": { + "keyword": "homepage content" + }, + "intent": "Open my latest created issue that has homepage content in its title to check if it is closed", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match", + "url_match" + ], + "reference_answers": { + "exact_match": "Yes" + }, + "reference_url": "__GITLAB__/a11yproject/a11yproject.com/-/issues/719", + "program_html": [], + "reference_answer_raw_annotation": "closed", + "string_note": "" + }, + "intent_template_id": 500, + "old_task_id": 182 + }, + { + "sites": [ + "shopping" + ], + "task_id": 46, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Tell me the total cost of my latest {{status}} order?", + "instantiation_dict": { + "status": "complete" + }, + "intent": "Tell me the total cost of my latest complete order?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "65.32" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "65.32" + }, + "intent_template_id": 214, + "old_task_id": 190 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 47, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Compare the payment difference of the last {{N}} {{status_1}} orders and {{status_2}} orders", + "instantiation_dict": { + "status_1": "cancelled", + "status_2": "completed", + "N": "4" + }, + "intent": "Compare the payment difference of the last 4 cancelled orders and completed orders", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "194.25" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "194.25" + }, + "intent_template_id": 367, + "old_task_id": 196 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 48, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Get the {{attribute}} of the {{status}} order", + "instantiation_dict": { + "attribute": "date", + "status": "most recent canlled" + }, + "intent": "Get the date of the most recent canlled order", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "May 23 2023" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "May 23, 2023" + }, + "intent_template_id": 366, + "old_task_id": 202 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 49, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__/a11yproject/a11yproject.com", + "geolocation": null, + "intent_template": "How many commits did {{user}} make on {{date}}?", + "instantiation_dict": { + "user": "kilian", + "date": "3/5/2023" + }, + "intent": "How many commits did kilian make on 3/5/2023?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "1" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "1" + }, + "intent_template_id": 320, + "old_task_id": 205 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 50, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Find the customer name and email with phone number {{PhoneNum}}", + "instantiation_dict": { + "PhoneNum": "8015551212" + }, + "intent": "Find the customer name and email with phone number 8015551212", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "Sean Miller", + "sean.miller@gmail.com" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Sean Miller, sean.miller@gmail.com" + }, + "intent_template_id": 364, + "old_task_id": 211 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 51, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "What are the key aspects that the customers don't like about {{product}}", + "instantiation_dict": { + "product": "Circe ice fleece" + }, + "intent": "What are the key aspects that the customers don't like about Circe ice fleece", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "exact_match": "N/A" + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "N/A" + }, + "intent_template_id": 249, + "old_task_id": 215 + }, + { + "sites": [ + "map" + ], + "task_id": 52, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Show me the walking distance from nearby hotels to {{location}} that take at most {{n}} minutes?", + "instantiation_dict": { + "location": "Gardner Steel Conference Center,", + "n": 5 + }, + "intent": "Show me the walking distance from nearby hotels to Gardner Steel Conference Center, that take at most 5 minutes?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "Wyndham Pittsburgh University Cente: 375m", + "The Oaklander Hotel: 338m" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Wyndham Pittsburgh University Cente: 375 m\nThe Oaklander Hotel: 338 m" + }, + "intent_template_id": 41, + "old_task_id": 220 + }, + { + "sites": [ + "map" + ], + "task_id": 53, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "I am at CMU Pittsburgh, how long it takes to the nearest {{location}} with different transportation methods?", + "instantiation_dict": { + "location": "USPS postal office" + }, + "intent": "I am at CMU Pittsburgh, how long it takes to the nearest USPS postal office with different transportation methods?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "Walk: 1 minute", + "Drive: less than 1 minute", + "Bike: less than 1 minute" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Walk: 1 minute to walk and\nDrive: less than 1 minute\nBike: less than 1 minute" + }, + "intent_template_id": 35, + "old_task_id": 221 + }, + { + "sites": [ + "shopping" + ], + "task_id": 54, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "What do customers say about {{product_type}} from {{manufature}}", + "instantiation_dict": { + "product_type": "brush", + "manufature": "sephora" + }, + "intent": "What do customers say about brush from sephora", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "No reviews available for the sephora brushes." + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "The sephora brushes don't have reviews", + "reference_answer_raw_annotation": "N/A" + }, + "intent_template_id": 135, + "old_task_id": 225 + }, + { + "sites": [ + "shopping" + ], + "task_id": 55, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "What is the price range for products from {{brand}}?", + "instantiation_dict": { + "brand": "EYZUTAK" + }, + "intent": "What is the price range for products from EYZUTAK?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "9.99" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "$9.99" + }, + "intent_template_id": 370, + "old_task_id": 227 + }, + { + "sites": [ + "shopping" + ], + "task_id": 56, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Get the order number of my most recent {{status}} order ", + "instantiation_dict": { + "status": "under delivery" + }, + "intent": "Get the order number of my most recent under delivery order ", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "N/A" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "There is no under delivery order" + }, + "intent_template_id": 213, + "old_task_id": 235 + }, + { + "sites": [ + "map" + ], + "task_id": 57, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Where is the nearest {{location}} from {{location2}} {{condition}}", + "instantiation_dict": { + "location": "pharmacy", + "location2": "Carnegie Mellon", + "condition": "I can walk within 20mins" + }, + "intent": "Where is the nearest pharmacy from Carnegie Mellon I can walk within 20mins", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "Schiller's Pharmacy", + "811", + "South Aiken Avenue", + "Shadyside", + "Pittsburgh" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Schiller's Pharmacy, 811, South Aiken Avenue, Shadyside, Pittsburgh, Allegheny County, 15232, United States" + }, + "intent_template_id": 39, + "old_task_id": 236 + }, + { + "sites": [ + "shopping" + ], + "task_id": 58, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "I am doing a market survey for one stop market, show me the most expensive product from {{product_category}} category", + "instantiation_dict": { + "product_category": "competative swimwear" + }, + "intent": "I am doing a market survey for one stop market, show me the most expensive product from competative swimwear category", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__SHOPPING__/clothing-shoes-jewelry/sport-specific-clothing/competitive-swimwear.html?product_list_order=price&product_list_dir=desc", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 138, + "old_task_id": 240 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 59, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Show me the {{information}} of the customer who is the most unhappy with {{product}}", + "instantiation_dict": { + "information": "email address", + "product": "the style of Zoe products" + }, + "intent": "Show me the email address of the customer who is the most unhappy with the style of Zoe products", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "exact_match": "N/A" + }, + "reference_url": "Valorie doesn't have a email in the system", + "program_html": [], + "string_note": "There is no negative review for Zoe products", + "reference_answer_raw_annotation": "N/A" + }, + "intent_template_id": 244, + "old_task_id": 247 + }, + { + "sites": [ + "map" + ], + "task_id": 60, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Tell me the coordinates of {{location}} in DD format", + "instantiation_dict": { + "location": "Apple Store near Pitt" + }, + "intent": "Tell me the coordinates of Apple Store near Pitt in DD format", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "40.451", + "-79.933" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "40.4511693, -79.9334241" + }, + "intent_template_id": 46, + "old_task_id": 250 + }, + { + "sites": [ + "map" + ], + "task_id": 61, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "What is the {{information}} of {{location}}", + "instantiation_dict": { + "location": "Western Pennsylvania Hospital", + "information": "phone number" + }, + "intent": "What is the phone number of Western Pennsylvania Hospital", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "+1 412 578 5000" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "4125785000" + }, + "intent_template_id": 501, + "old_task_id": 254 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 62, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "See all public projects", + "instantiation_dict": {}, + "intent": "See all public projects", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__GITLAB__/explore", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 325, + "old_task_id": 258 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 63, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Get me my RSS feed token", + "instantiation_dict": {}, + "intent": "Get me my RSS feed token", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "TMN_bBn9Z48qVbUFZV45" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "TMN_bBn9Z48qVbUFZV45" + }, + "intent_template_id": 312, + "old_task_id": 259 + }, + { + "sites": [ + "wikipedia", + "map" + ], + "task_id": 64, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "What's the closest national park to {{city}}? How long does it take to bike there?", + "instantiation_dict": { + "city": "Vinalhaven, ME" + }, + "intent": "What's the closest national park to Vinalhaven, ME? How long does it take to bike there?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "Acadia National Park" + ], + "fuzzy_match": [ + "10h 33min" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Acadia National Park\n10h 33min" + }, + "intent_template_id": 85, + "old_task_id": 268 + }, + { + "sites": [ + "shopping" + ], + "task_id": 65, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Show me products under ${{price}} in \"{{product_category}}\" category", + "instantiation_dict": { + "price": "30", + "product_category": "men shoes" + }, + "intent": "Show me products under $30 in \"men shoes\" category", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__SHOPPING__/clothing-shoes-jewelry/men/shoes.html?price=0-30", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 139, + "old_task_id": 270 + }, + { + "sites": [ + "shopping" + ], + "task_id": 66, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Search for \"{{keyword}}\"", + "instantiation_dict": { + "keyword": "switch accessories" + }, + "intent": "Search for \"switch accessories\"", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__SHOPPING__/catalogsearch/result/?q=switch+accessories", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 212, + "old_task_id": 276 + }, + { + "sites": [ + "shopping" + ], + "task_id": 67, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Look up the most recent models of XBox controllers released between 2020-2021?", + "instantiation_dict": {}, + "intent": "Look up the most recent models of XBox controllers released between 2020-2021?", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__SHOPPING__/microsoft-xbox-controller-carbon-black-for-series-x-series-s-xbox-one-windows-10-android-ios-bundled-with-dual-port-charging-dock-xbox-controller-skin-voucher-premgear-cloth.html", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 210, + "old_task_id": 283 + }, + { + "sites": [ + "shopping" + ], + "task_id": 68, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Show the least expensive {{product}} with a minimum storage capacity of {{min_storage}}.", + "instantiation_dict": { + "product": "switch card holder", + "min_storage": "15 cards" + }, + "intent": "Show the least expensive switch card holder with a minimum storage capacity of 15 cards.", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__SHOPPING__/game-card-holder-storage-case-for-nintendo-switch-games-or-ps-vita-game-case-or-sd-memory-cards-black.html", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 207, + "old_task_id": 285 + }, + { + "sites": [ + "map" + ], + "task_id": 69, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "How much time does it take from Pittsburgh to Philadelphia by car?", + "instantiation_dict": {}, + "intent": "How much time does it take from Pittsburgh to Philadelphia by car?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "8h 33min" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "5h 47min" + }, + "intent_template_id": 47, + "old_task_id": 287 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 70, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Tell me the {{attribute}} of the customer who has the most cancellations in the history", + "instantiation_dict": { + "attribute": "name" + }, + "intent": "Tell me the name of the customer who has the most cancellations in the history", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "Samantha Jones" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Samantha Jones" + }, + "intent_template_id": 234, + "old_task_id": 288 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 71, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Show me the command to clone {{repo}} with SSH.", + "instantiation_dict": { + "repo": "the best GAN python implementation" + }, + "intent": "Show me the command to clone the best GAN python implementation with SSH.", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "exact_match": "ssh://git@localhost:2222/eriklindernoren/PyTorch-GAN.git" + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "ssh://git@localhost:2222/eriklindernoren/PyTorch-GAN.git" + }, + "intent_template_id": 329, + "old_task_id": 296 + }, + { + "sites": [ + "shopping" + ], + "task_id": 72, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Show the most recent {{status}} order", + "instantiation_dict": { + "status": "pending" + }, + "intent": "Show the most recent pending order", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__SHOPPING__/sales/order/view/order_id/189/", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 180, + "old_task_id": 300 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 73, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Tell me who has made the most contributions, in terms of number of commits, to the {{repo}} project", + "instantiation_dict": { + "repo": "Pytorch GAN" + }, + "intent": "Tell me who has made the most contributions, in terms of number of commits, to the Pytorch GAN project", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "Erik Linder-Norén" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Erik Linder-Norén" + }, + "intent_template_id": 323, + "old_task_id": 311 + }, + { + "sites": [ + "shopping" + ], + "task_id": 74, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Which number to call for the customer service?", + "instantiation_dict": {}, + "intent": "Which number to call for the customer service?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "exact_match": "N/A" + }, + "reference_url": "", + "program_html": [], + "string_note": "There is no phone number in the website", + "reference_answer_raw_annotation": "N/A" + }, + "intent_template_id": 134, + "old_task_id": 313 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 75, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "List the {{attribute}} of the top 3 contributors to {{repo}} repo, ranked by the number of commits?", + "instantiation_dict": { + "repo": "2019-nCov", + "attribute": "last names" + }, + "intent": "List the last names of the top 3 contributors to 2019-nCov repo, ranked by the number of commits?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "Lo", + "Chen", + "Chu" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Lo, Chen, Chu" + }, + "intent_template_id": 324, + "old_task_id": 318 + }, + { + "sites": [ + "shopping" + ], + "task_id": 76, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "How much refund I should expect from my order canlled in {{time}}, including shipping fee", + "instantiation_dict": { + "time": "2022" + }, + "intent": "How much refund I should expect from my order canlled in 2022, including shipping fee", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "3053.97" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "3053.97" + }, + "intent_template_id": 160, + "old_task_id": 321 + }, + { + "sites": [ + "shopping" + ], + "task_id": 77, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Show me the \"{{product}}\" listings by {{sorting_order}}.", + "instantiation_dict": { + "product": "chairs", + "sorting_order": "ascending price" + }, + "intent": "Show me the \"chairs\" listings by ascending price.", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__SHOPPING__/catalogsearch/result/index/?product_list_order=price&q=chairs&product_list_dir=asc", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 208, + "old_task_id": 324 + }, + { + "sites": [ + "shopping" + ], + "task_id": 78, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "How much did I spend on shopping at One Stop Market {{time}}? They gave me a 20% discount on the total amount for orders exceeding $200 in cash", + "instantiation_dict": { + "time": "on November 2022" + }, + "intent": "How much did I spend on shopping at One Stop Market on November 2022? They gave me a 20% discount on the total amount for orders exceeding $200 in cash", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "359.546" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "359.546" + }, + "intent_template_id": 147, + "old_task_id": 333 + }, + { + "sites": [ + "shopping" + ], + "task_id": 79, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Tell me when I last ordered my {{description}}?", + "instantiation_dict": { + "description": "body butter" + }, + "intent": "Tell me when I last ordered my body butter?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "January 16th 2023" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "January 16th 2023" + }, + "intent_template_id": 169, + "old_task_id": 335 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 80, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "How many reviews our shop received {{time}}?", + "instantiation_dict": { + "time": "in May 2023" + }, + "intent": "How many reviews our shop received in May 2023?", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "0" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "0" + }, + "intent_template_id": 248, + "old_task_id": 348 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 81, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Who else have access to my repo {{repo}}, show me their usernames", + "instantiation_dict": { + "repo": "gimmiethat.space" + }, + "intent": "Who else have access to my repo gimmiethat.space, show me their usernames", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "exact_match": "yjlou" + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "yjlou" + }, + "intent_template_id": 298, + "old_task_id": 349 + }, + { + "sites": [ + "shopping" + ], + "task_id": 82, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "List products from {{product_category}} category by {{order}} price", + "instantiation_dict": { + "product_category": "living room furtniture", + "order": "descending" + }, + "intent": "List products from living room furtniture category by descending price", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__SHOPPING__/home-kitchen/furniture/living-room-furniture.html?product_list_order=price&product_list_dir=desc", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 137, + "old_task_id": 354 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 83, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Checkout merge requests requiring my review", + "instantiation_dict": {}, + "intent": "Checkout merge requests requiring my review", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__GITLAB__/dashboard/merge_requests?reviewer_username=byteblaze", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 291, + "old_task_id": 357 + }, + { + "sites": [ + "shopping" + ], + "task_id": 84, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Show me the {{info}} for order number {{order_number}}.", + "instantiation_dict": { + "info": "order statuses", + "order_number": "170 and 189" + }, + "intent": "Show me the order statuses for order number 170 and 189.", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "170: cancelled", + "189: pending" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "170: cancelled, 189: pending" + }, + "intent_template_id": 206, + "old_task_id": 361 + }, + { + "sites": [ + "map" + ], + "task_id": 85, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Measure distance between {{location/address_1}} and {{location/address_2}} by walking", + "instantiation_dict": { + "location/address_1": "Carnegie Mellon University", + "location/address_2": "CVS (closet one)" + }, + "intent": "Measure distance between Carnegie Mellon University and CVS (closet one) by walking", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "1.4km" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "1.4km" + }, + "intent_template_id": 58, + "old_task_id": 367 + }, + { + "sites": [ + "shopping" + ], + "task_id": 86, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "find discounted items.", + "instantiation_dict": {}, + "intent": "find discounted items.", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "exact_match": "N/A" + }, + "reference_url": "", + "program_html": [], + "string_note": "There is no function to show only discount items", + "reference_answer_raw_annotation": "There is no function to show only discount items." + }, + "intent_template_id": 188, + "old_task_id": 368 + }, + { + "sites": [ + "map" + ], + "task_id": 87, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Pull up the description page of {{location}} on Map", + "instantiation_dict": { + "location": "Carnegie Music Hall" + }, + "intent": "Pull up the description page of Carnegie Music Hall on Map", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": null, + "program_html": [ + { + "url": "last", + "locator": "document.querySelector('[id=\"sidebar_content\"').outerText", + "required_contents": { + "must_include": [ + "Carnegie Music Hall" + ] + } + } + ] + }, + "intent_template_id": 52, + "old_task_id": 369 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 88, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Preview the {{name}} theme for my shop", + "instantiation_dict": { + "name": "Magento Blank" + }, + "intent": "Preview the Magento Blank theme for my shop", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__SHOPPING_ADMIN__/admin/system_design_theme/edit/id/1", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 266, + "old_task_id": 374 + }, + { + "sites": [ + "shopping" + ], + "task_id": 89, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Summarize customer reviews for {{product}}.", + "instantiation_dict": { + "product": "Amazon Echo Dot 3rd generation" + }, + "intent": "Summarize customer reviews for Amazon Echo Dot 3rd generation.", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "Serena: It was the exact same one that came with my echo show that my puppy", + "Amazon Customer: It doesn’t work .", + "BlessedGirl: Great replacement for Echo Dot.", + "Jeff Sederquist: Great product no issues. It's for a Echo show 5", + "melvin m.: Very good charger but I had to buy this one because when I bought an echo show 5, I came with an echo show 10 charger in the package with the device", + "Jack Smith: Looks like pictures", + "Ignacio: Muy bueno de buena calidad", + "BusyB: My kids bent the original cord for their Echo show 5 and this replacement cord is just like the one it came with. It works just fine. Been using it for 2 weeks now and no problems.", + "Kee Williams: Great value works better than original", + "Howard A. Sinclair: Great product!!! Works exactly as it is supposed to." + ] + }, + "reference_url": "", + "program_html": [] + }, + "intent_template_id": 182, + "old_task_id": 376 + }, + { + "sites": [ + "map" + ], + "task_id": 90, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Find the {{space}} around {{location}}", + "instantiation_dict": { + "location": "Carnegie Music Hall", + "space": "hotel" + }, + "intent": "Find the hotel around Carnegie Music Hall", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__MAP__/search?query=hotels%20near%20Carnegie%20Music%20Hall", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 59, + "old_task_id": 381 + }, + { + "sites": [ + "map" + ], + "task_id": 91, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "I am arriving at Carnegie Mellon University. Find the nearby US Citizenship and Immigration Services and the walking distance to the nearest Social Security Administration from US Citizenship and Immigration Services", + "instantiation_dict": {}, + "intent": "I am arriving at Carnegie Mellon University. Find the nearby US Citizenship and Immigration Services and the walking distance to the nearest Social Security Administration from US Citizenship and Immigration Services", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "fuzzy_match": [ + "N/A" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "There is no USCIS nearby", + "reference_answer_raw_annotation": "N/A" + }, + "intent_template_id": 781, + "old_task_id": 382 + }, + { + "sites": [ + "map" + ], + "task_id": 92, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "I am arriving at Pittsburgh Airport. Show me the name of a Hyatt hotel if there is any nearby. Tell me the names of supermarkets that are within 15mins driving from the hotel", + "instantiation_dict": {}, + "intent": "I am arriving at Pittsburgh Airport. Show me the name of a Hyatt hotel if there is any nearby. Tell me the names of supermarkets that are within 15mins driving from the hotel", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "Hyatt Regency Pittsburgh International Airport", + "Giant Eagle", + "ALDI" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Hyatt Regency Pittsburgh International Airport Giant Eagle, ALDI" + }, + "intent_template_id": 782, + "old_task_id": 383 + }, + { + "sites": [ + "shopping" + ], + "task_id": 93, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "List the customer names who complain about the quality of EYZUTAK phone cases", + "instantiation_dict": {}, + "intent": "List the customer names who complain about the quality of EYZUTAK phone cases", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "Lisa Lee", + "Evelyn Kurver", + "Amanda", + "N Randall" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Lisa Lee, Evelyn Kurver, Amanda, N Randall" + }, + "intent_template_id": 666, + "old_task_id": 384 + }, + { + "sites": [ + "shopping" + ], + "task_id": 94, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "What is the rating of {{product}}", + "instantiation_dict": { + "product": "Ugreen lightning to 3.5mm cable" + }, + "intent": "What is the rating of Ugreen lightning to 3.5mm cable. Please round to the nearest whole number", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "65 |OR| 3" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "Around 3.1 stars" + }, + "intent_template_id": 1355, + "old_task_id": 386 + }, + { + "sites": [ + "shopping" + ], + "task_id": 95, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Who gave {{stars}} for phone cases from EYZUTAK", + "instantiation_dict": { + "stars": "4 or 5 stars" + }, + "intent": "Who gave 4 or 5 stars for phone cases from EYZUTAK", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "MH", + "Misba009", + "Amanda", + "Amazon Customer", + "Cally", + "Bethany Robertson" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "", + "reference_answer_raw_annotation": "MH, Misba009, Amanda, Amazon Customer, Cally, Bethany Robertson, Bethany Robertson" + }, + "intent_template_id": 1356, + "old_task_id": 387 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 96, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Post \"{{content}}\" for the merge request related to {{mr}} in {{repo}} project", + "instantiation_dict": { + "mr": "color ulitity", + "repo": "a11yproject.com", + "content": "Good idea" + }, + "intent": "Post \"Good idea\" for the merge request related to color ulitity in a11yproject.com project", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "__GITLAB__/a11yproject/a11yproject.com/-/merge_requests/1071", + "program_html": [ + { + "url": "last", + "locator": "document.querySelector('[id=\"notes-list\"').lastElementChild.querySelector('.timeline-discussion-body').outerText", + "required_contents": { + "exact_match": "Good idea" + } + } + ] + }, + "intent_template_id": 348, + "old_task_id": 392 + }, + { + "sites": [ + "reddit" + ], + "task_id": 97, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Change my reddit bio to \"{{content}}\"", + "instantiation_dict": { + "content": "Seeking SDE positions" + }, + "intent": "Change my reddit bio to \"Seeking SDE positions\"", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__REDDIT__/user/MarvelsGrantMan136", + "locator": "document.querySelector(\".user-bio__biography\").outerText", + "required_contents": { + "exact_match": "Seeking SDE positions" + } + } + ] + }, + "intent_template_id": 6, + "old_task_id": 401 + }, + { + "sites": [ + "reddit" + ], + "task_id": 98, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Upvote the newest post in {{subreddit}} subreddit", + "instantiation_dict": { + "subreddit": "books" + }, + "intent": "Upvote the newest post in books subreddit", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__REDDIT__/f/books/124260/adults-reading-to-each-other-out-loud", + "locator": "document.querySelector('div.submission__vote').querySelector('form').getAttribute('class')", + "required_contents": { + "must_include": [ + "vote vote--user-upvoted" + ] + } + } + ] + }, + "intent_template_id": 22, + "old_task_id": 404 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 99, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Go to the merge request on {{topic}} I have to review, find if the author of the merge request responded at the end, and reply \"Thank you\" if he did. Otherwise remind him with a simple @.", + "instantiation_dict": { + "topic": "verification functions" + }, + "intent": "Go to the merge request on verification functions I have to review, find if the author of the merge request responded at the end, and reply \"Thank you\" if he did. Otherwise remind him with a simple @.", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__GITLAB__/byteblaze/a11y-webring.club/-/merge_requests/40", + "locator": "document.querySelector('[id=\"notes-list\"').lastElementChild.querySelector('.timeline-discussion-body').outerText", + "required_contents": { + "exact_match": "@davepgreene" + } + } + ] + }, + "intent_template_id": 360, + "old_task_id": 415 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 100, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Set my gitlab status as {{status}}.", + "instantiation_dict": { + "status": "Enjoying life" + }, + "intent": "Set my gitlab status as Enjoying life.", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__GITLAB__/byteblaze", + "locator": "document.querySelector('.cover-status').lastChild.textContent", + "required_contents": { + "exact_match": "Enjoying life" + } + } + ] + }, + "intent_template_id": 361, + "old_task_id": 419 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 101, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Mark all {{brand}} shirts on sale", + "instantiation_dict": { + "brand": "Hollister" + }, + "intent": "Mark all Hollister shirts on sale", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING_ADMIN__/catalog/product/edit/id/126/", + "locator": "document.querySelector('input[name=\"product[sale]\"]').value", + "required_contents": { + "exact_match": "1" + } + } + ] + }, + "intent_template_id": 237, + "old_task_id": 423 + }, + { + "sites": [ + "wikipedia", + "map" + ], + "task_id": 102, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Find the page of {{description}} on the map.", + "instantiation_dict": { + "description": "the place in Pennsylvania where a plane crashed during the September 11th attacks" + }, + "intent": "Find the page of the place in Pennsylvania where a plane crashed during the September 11th attacks on the map.", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "last", + "locator": "document.querySelector('[id=\"sidebar_content\"').outerText", + "required_contents": { + "must_include": [ + "Somerset County" + ] + } + } + ] + }, + "intent_template_id": 371, + "old_task_id": 426 + }, + { + "sites": [ + "shopping" + ], + "task_id": 103, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__/tall-pink-taper-candles-4-piece-orange-colored-tapered-candles-gradient-candles-10-6-inches-tall-tie-dye-candle-set-large-dripless-long-burning-candlesticks-two-color-taper-candles-candlesticks.html |AND| __SHOPPING__/spaas-white-taper-candles-4-pack-10-inch-tall-candles-scent-free-premium-wax-candle-sticks-8-hour-long-burning-white-candlesticks-for-home-decoration-wedding-holiday-and-parties.html |AND| __SHOPPING__/white-starfish-wall-candle-sconces-set-of-2-beach-decor-ocean-themed-wall-mount-candleholders-nautical-style-beach-bathroom-decor-coastal-farmhouse-seashell-candle-holders.html", + "geolocation": null, + "intent_template": "Add the product with the lowest per unit price from my open tabs to the shopping cart", + "instantiation_dict": {}, + "intent": "Add the product with the lowest per unit price from my open tabs to the shopping cart", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING__/checkout/cart", + "locator": "", + "required_contents": { + "must_include": [ + "SPAAS White Taper Candles - 4 Pack |OR| 10 Inch Tall Candles, Scent-Free Premium Wax Candle Sticks |OR| 8 Hour Long Burning White Candlesticks for Home Decoration, Wedding, Holiday and Parties" + ] + } + } + ] + }, + "intent_template_id": 145, + "old_task_id": 431 + }, + { + "sites": [ + "shopping" + ], + "task_id": 104, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "I previously ordered some {{product}} {{time}} and later cancelled. Can you reorder it for me?", + "instantiation_dict": { + "product": "a make up removal kit", + "time": "during summer 2022" + }, + "intent": "I previously ordered some a make up removal kit during summer 2022 and later cancelled. Can you reorder it for me?", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "func:shopping_get_latest_order_url()", + "locator": "document.querySelector(\".order-details-items.ordered\").outerText", + "required_contents": { + "must_include": [ + "B0738JQG6Q" + ] + } + } + ] + }, + "intent_template_id": 156, + "old_task_id": 440 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 105, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "set the homepage URL on my GitLab profile to {{url}}", + "instantiation_dict": { + "url": "https://egg.tart.com" + }, + "intent": "set the homepage URL on my GitLab profile to https://egg.tart.com", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__GITLAB__/byteblaze", + "locator": "document.querySelector('.profile-header [itemprop=\"url\"]').outerText", + "required_contents": { + "exact_match": "egg.tart.com" + } + } + ] + }, + "intent_template_id": 331, + "old_task_id": 448 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 106, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Disable {{product}} from the site, they are facing some quality issues.", + "instantiation_dict": { + "product": "Ryker Tee Crew Neck" + }, + "intent": "Disable Ryker Tee Crew Neck from the site, they are facing some quality issues.", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING_ADMIN__/catalog/product/edit/id/478/", + "locator": "document.querySelector('[name=\"product[status]\"').value", + "required_contents": { + "exact_match": "2" + } + } + ] + }, + "intent_template_id": 242, + "old_task_id": 454 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 107, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__/catalog/product/edit/id/1481/", + "geolocation": null, + "intent_template": "{{action}} the price of this product by {{amount}}", + "instantiation_dict": { + "amount": "$5", + "action": "Reduce" + }, + "intent": "Reduce the price of this product by $5", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING_ADMIN__/catalog/product/edit/id/1481/", + "locator": "document.querySelector('[name=\"product[price]\"').value", + "required_contents": { + "exact_match": "27.00" + } + } + ] + }, + "intent_template_id": 247, + "old_task_id": 458 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 108, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Update the product description of {{product}} to highlight the real user positive reviews by quoting the comments", + "instantiation_dict": { + "product": "Antonia Racer Tank" + }, + "intent": "Update the product description of Antonia Racer Tank to highlight the real user positive reviews by quoting the comments", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING_ADMIN__/../antonia-racer-tank.html", + "locator": "document.querySelector('.data.item.content').outerText + (document.querySelector('.product.attribute.overview [itemprop=\"description\"]')?.outerText || '')", + "required_contents": { + "must_include": [ + "This is in regular rotation at the gym", + "Its colorful and looks kinda cute under my exercise tanks", + "it's very stylish for yoga or something else low impact" + ] + } + } + ] + }, + "intent_template_id": 251, + "old_task_id": 464 + }, + { + "sites": [ + "shopping" + ], + "task_id": 109, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Add {{product}} to my wish list", + "instantiation_dict": { + "product": "2 Hawaiian Bamboo Orchid Roots #zc50 - by Discount Hawaiian Gifts" + }, + "intent": "Add 2 Hawaiian Bamboo Orchid Roots #zc50 - by Discount Hawaiian Gifts to my wish list", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING__/wishlist/", + "locator": "document.querySelector('.products-grid.wishlist').outerText", + "required_contents": { + "must_include": [ + "2 Hawaiian Bamboo Orchid Roots #zc50 - by Discount Hawaiian Gifts" + ] + } + } + ] + }, + "intent_template_id": 186, + "old_task_id": 466 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 110, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Cancel order {{id}}", + "instantiation_dict": { + "id": "302" + }, + "intent": "Cancel order 302", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING_ADMIN__/sales/order/view/order_id/302/", + "locator": "document.querySelector(\"#order_status\").outerText", + "required_contents": { + "exact_match": "Canceled" + } + } + ] + }, + "intent_template_id": 257, + "old_task_id": 470 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 111, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Set up a new, empty repository with the name {{project_name}}?", + "instantiation_dict": { + "project_name": "awesome_llm_reading" + }, + "intent": "Set up a new, empty repository with the name awesome_llm_reading?", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__GITLAB__/byteblaze/awesome_llm_reading", + "locator": "", + "required_contents": { + "must_include": [ + "awesome_llm_reading" + ] + } + } + ] + }, + "intent_template_id": 292, + "old_task_id": 476 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 112, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "{{name}} wants to check my dotfile configurations. Please invite him to the repo as a guest.", + "instantiation_dict": { + "name": "Vinta" + }, + "intent": "Vinta wants to check my dotfile configurations. Please invite him to the repo as a guest.", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__GITLAB__/byteblaze/dotfiles/-/project_members", + "locator": "func:gitlab_get_project_memeber_role(__page__, 'vinta')", + "required_contents": { + "must_include": [ + "Guest" + ] + } + } + ] + }, + "intent_template_id": 294, + "old_task_id": 485 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 113, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Change the page title of \"{{old-heading}}\" page on my site to \"{{heading}}\".", + "instantiation_dict": { + "old-heading": "Home Page", + "heading": "This is the home page!! Leave here!!" + }, + "intent": "Change the page title of \"Home Page\" page on my site to \"This is the home page!! Leave here!!\".", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING_ADMIN__/cms/page/edit/page_id/2/", + "locator": "document.querySelector('input[name=\"title\"]').value", + "required_contents": { + "exact_match": "This is the home page!! Leave here!!" + } + } + ] + }, + "intent_template_id": 275, + "old_task_id": 488 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 114, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Notify {{name}} in their most recent pending order with message \"{{message}}\"", + "instantiation_dict": { + "name": "Sarah Miller", + "message": "the order is ready to be shipped soon!" + }, + "intent": "Notify Sarah Miller in their most recent pending order with message \"the order is ready to be shipped soon!\"", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "exact_match": "N/A" + }, + "reference_url": "", + "program_html": [], + "string_note": "System message: We cannot add order history." + }, + "intent_template_id": 280, + "old_task_id": 491 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 115, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Update order #{{order}} with the {{service}} tracking number {{tracking}}", + "instantiation_dict": { + "tracking": "24353446464", + "order": "307", + "service": "DHL" + }, + "intent": "Update order #307 with the DHL tracking number 24353446464", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING_ADMIN__/sales/order/commentsHistory/order_id/307/active_tab/order_shipments/", + "locator": "", + "required_contents": { + "must_include": [ + "Tracking number 24353446464 for DHL assigned" + ] + } + } + ] + }, + "intent_template_id": 284, + "old_task_id": 497 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 116, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Make all {{product}} as out of stock", + "instantiation_dict": { + "product": "Aeno capri" + }, + "intent": "Make all Aeno capri as out of stock", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING_ADMIN__/catalog/product/edit/id/1861/", + "locator": "document.querySelector('[name=\"product[quantity_and_stock_status][is_in_stock]\"').value", + "required_contents": { + "exact_match": "0" + } + } + ] + }, + "intent_template_id": 287, + "old_task_id": 505 + }, + { + "sites": [ + "shopping" + ], + "task_id": 117, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Buy the highest rated product from the {{product_category}} category within a budget {{dollar_value}}.", + "instantiation_dict": { + "product_category": "meat substitute", + "dollar_value": "between 100 and 200" + }, + "intent": "Buy the highest rated product from the meat substitute category within a budget between 100 and 200.", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "func:shopping_get_latest_order_url()", + "locator": "document.querySelector(\".order-details-items.ordered\").outerText", + "required_contents": { + "must_include": [ + "B01CTR3DLE" + ] + } + } + ] + }, + "intent_template_id": 172, + "old_task_id": 506 + }, + { + "sites": [ + "shopping" + ], + "task_id": 118, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Buy the best rating product from \"{{category}}\" category with at least 5 reviews and the product is least expensive", + "instantiation_dict": { + "category": "Men's shoe" + }, + "intent": "Buy the best rating product from \"Men's shoe\" category with at least 5 reviews and the product is least expensive", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "func:shopping_get_latest_order_url()", + "locator": "document.querySelector(\".order-details-items.ordered\").outerText", + "required_contents": { + "must_include": [ + "B01J4MM3KO" + ] + } + } + ] + }, + "intent_template_id": 216, + "old_task_id": 509 + }, + { + "sites": [ + "shopping" + ], + "task_id": 119, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Add a {{product}} to my wish list.", + "instantiation_dict": { + "product": "white desk" + }, + "intent": "Add a white desk to my wish list.", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING__/wishlist/", + "locator": "document.querySelector('.products-grid.wishlist').outerText", + "required_contents": { + "must_include": [ + "white", + "desk" + ] + } + } + ] + }, + "intent_template_id": 189, + "old_task_id": 514 + }, + { + "sites": [ + "shopping" + ], + "task_id": 120, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__/elmwood-inn-fine-teas-orange-vanilla-caffeine-free-fruit-infusion-16-ounce-pouch.html", + "geolocation": null, + "intent_template": "Add this product to my wishlist", + "instantiation_dict": {}, + "intent": "Add this product to my wishlist", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING__/wishlist/", + "locator": "document.querySelector('.products-grid.wishlist').outerText", + "required_contents": { + "must_include": [ + "Elmwood Inn Fine Teas, Orange Vanilla Caffeine-free Fruit Infusion, 16-Ounce Pouch" + ] + } + } + ] + }, + "intent_template_id": 196, + "old_task_id": 516 + }, + { + "sites": [ + "shopping" + ], + "task_id": 121, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Subscribe to the newsletter of OneStopMarket", + "instantiation_dict": {}, + "intent": "Subscribe to the newsletter of OneStopMarket", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING__/newsletter/manage/", + "locator": "document.querySelector('[title=\"General Subscription\"').checked.toString()", + "required_contents": { + "exact_match": "true" + } + } + ] + }, + "intent_template_id": 199, + "old_task_id": 521 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 122, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Star the top {{number}} most stared repos in Gitlab", + "instantiation_dict": { + "number": "eight" + }, + "intent": "Star the top eight most stared repos in Gitlab", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__GITLAB__/users/byteblaze/starred", + "locator": "", + "required_contents": { + "must_include": [ + "AndroidSlidingUpPanel", + "create-react-app", + "ffmpeg-python", + "PHP_XLSXWriter", + "AndroidAsync", + "Pytorch-GAN", + "administrate", + "keycloak" + ] + } + } + ] + }, + "intent_template_id": 354, + "old_task_id": 524 + }, + { + "sites": [ + "shopping" + ], + "task_id": 123, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Draft a refund message via their \"contact us\" form for the {{product}} I bought {{time}}. It broke after three days of use. The shop requires the order id, the reason and the amount to refund in the message. Don't submit yet", + "instantiation_dict": { + "product": "phone screen protector", + "time": "March 2023" + }, + "intent": "Draft a refund message via their \"contact us\" form for the phone screen protector I bought March 2023. It broke after three days of use. The shop requires the order id, the reason and the amount to refund in the message. Don't submit yet", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "last", + "locator": "document.querySelector('[title=\"What’s on your mind?\"').value", + "required_contents": { + "fuzzy_match": [ + "refund", + "it broke after three days of use", + "000000180", + "12.99" + ] + } + } + ] + }, + "intent_template_id": 154, + "old_task_id": 528 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 124, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Follow {{account_list}} on Gitlab", + "instantiation_dict": { + "account_list": [ + "Jakub Klinkovský", + "Koushik", + "Vinta Chen" + ] + }, + "intent": "Follow ['Jakub Klinkovský', 'Koushik', 'Vinta Chen'] on Gitlab", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__GITLAB__/users/byteblaze/following", + "locator": "document.querySelector('.user-profile').outerText", + "required_contents": { + "must_include": [ + "@lahwaacz", + "@koush", + "@vinta" + ] + } + } + ] + }, + "intent_template_id": 330, + "old_task_id": 534 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 125, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Modify the address of order #{{order_id}} to {{address}}", + "instantiation_dict": { + "order_id": "299", + "address": "456 Oak Avenue, New York, NY, 10001" + }, + "intent": "Modify the address of order #299 to 456 Oak Avenue, New York, NY, 10001", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING_ADMIN__/sales/order/view/order_id/299", + "locator": "", + "required_contents": { + "must_include": [ + "456 Oak Avenue", + "New York", + "10001" + ] + } + } + ] + }, + "intent_template_id": 240, + "old_task_id": 538 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 126, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Add a new {{option}} {{value}} to {{base_setting}} of {{product}}", + "instantiation_dict": { + "option": "color", + "value": "blue", + "base_setting": "size S and M", + "product": "Frankie Sweatshirt" + }, + "intent": "Add a new color blue to size S and M of Frankie Sweatshirt", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING_ADMIN__/catalog/product/edit/id/110/", + "locator": "document.querySelector('[data-index=\"configurable\"').outerText", + "required_contents": { + "must_include": [ + "Sweatshirt-M-Blue", + "Sweatshirt-S-Blue" + ] + } + } + ] + }, + "intent_template_id": 252, + "old_task_id": 548 + }, + { + "sites": [ + "gitlab", + "reddit" + ], + "task_id": 127, + "require_login": true, + "storage_state": "./.auth/gitlab.reddit_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "create a repository named {{name}} that includes a README file with the links to the most active {{num}} DIY ideas on DIY subreddit?", + "instantiation_dict": { + "name": "Do it myself", + "num": 8 + }, + "intent": "create a repository named Do it myself that includes a README file with the links to the most active 8 DIY ideas on DIY subreddit?", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__GITLAB__/byteblaze/Do-it-myself/-/raw/main/README.md", + "locator": "", + "required_contents": { + "must_include": [ + "__REDDIT__/f/DIY/118903/separate-glued-plastic-parts", + "__REDDIT__/f/DIY/118923/how-would-you-fix-this-dryer-vent-mess", + "__REDDIT__/f/DIY/118935/basement-bulkhead-soffit-wall-framing", + "__REDDIT__/f/DIY/118904/ge-water-heater-pilot-light-won-t-stay-lit", + "__REDDIT__/f/DIY/118960/attempting-to-move-a-wall-outlet-in-my-basement-a-few-inches", + "__REDDIT__/f/DIY/118931/afci-outlet-question", + "__REDDIT__/f/DIY/118824/teflon-tape-to-attach-washing-machine-drain-hose-to-pipe", + "__REDDIT__/f/DIY/118866/paver-base-for-shed" + ] + } + } + ] + }, + "intent_template_id": 88, + "old_task_id": 566 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 128, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Invite {{collaborator_account_list}} as collaborator to {{repo}} repo", + "instantiation_dict": { + "collaborator_account_list": "Jakub Klinkovský and Benoît Blanchon", + "repo": "gimmiethat.space" + }, + "intent": "Invite Jakub Klinkovský and Benoît Blanchon as collaborator to gimmiethat.space repo", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__GITLAB__/byteblaze/gimmiethat.space/-/project_members", + "locator": "", + "required_contents": { + "must_include": [ + "@lahwaacz", + "@bblanchon" + ] + } + } + ] + }, + "intent_template_id": 293, + "old_task_id": 567 + }, + { + "sites": [ + "shopping" + ], + "task_id": 129, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "I recently moved, my address is {{address}}, update my information on OneStopShopping accordingly", + "instantiation_dict": { + "address": "111 Magnolia Path, Atlanta, GA, 30303" + }, + "intent": "I recently moved, my address is 111 Magnolia Path, Atlanta, GA, 30303, update my information on OneStopShopping accordingly", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING__/customer/address", + "locator": "document.querySelector(\".box.box-address-shipping > .box-content\").outerText", + "required_contents": { + "must_include": [ + "111 Magnolia Path", + "Atlanta, Georgia, 30303" + ] + } + }, + { + "url": "__SHOPPING__/customer/address", + "locator": "document.querySelector(\".box.box-address-billing > .box-content\").outerText", + "required_contents": { + "must_include": [ + "111 Magnolia Path", + "Atlanta, Georgia, 30303" + ] + } + } + ] + }, + "intent_template_id": 165, + "old_task_id": 574 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 130, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Add the following users to my {{repo}} as {{role}}: {{user_list}}", + "instantiation_dict": { + "repo": "GitHub timeline item management extension", + "role": "maintainer", + "user_list": [ + "abisubramanya27", + "lahwaacz" + ] + }, + "intent": "Add the following users to my GitHub timeline item management extension as maintainer: ['abisubramanya27', 'lahwaacz']", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__GITLAB__/byteblaze/remove-board-movement-events-from-the-github-issue-timeline/-/project_members", + "locator": "func:gitlab_get_project_memeber_role(__page__, 'abisubramanya27')", + "required_contents": { + "must_include": [ + "Maintainer" + ] + } + }, + { + "url": "__GITLAB__/byteblaze/remove-board-movement-events-from-the-github-issue-timeline/-/project_members", + "locator": "func:gitlab_get_project_memeber_role(__page__, 'lahwaacz')", + "required_contents": { + "must_include": [ + "Maintainer" + ] + } + } + ] + }, + "intent_template_id": 351, + "old_task_id": 577 + }, + { + "sites": [ + "reddit" + ], + "task_id": 131, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Create a new forum named {{name}}, with a description of {{description}}, and include {{sidebar_list}} in the sidebar?", + "instantiation_dict": { + "name": "Cyberpunk", + "description": "Welcome to the future", + "sidebar_list": [ + "Games", + "Books", + "Movies", + "Future" + ] + }, + "intent": "Create a new forum named Cyberpunk, with a description of Welcome to the future, and include ['Games', 'Books', 'Movies', 'Future'] in the sidebar?", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__REDDIT__/f/Cyberpunk/edit", + "locator": "document.querySelector(\"#forum_description\").value", + "required_contents": { + "must_include": [ + "Welcome to the future" + ] + } + }, + { + "url": "__REDDIT__/f/Cyberpunk/edit", + "locator": "document.querySelector(\"#forum_sidebar\").value", + "required_contents": { + "must_include": [ + "Games", + "Books", + "Movies", + "Future" + ] + } + } + ] + }, + "intent_template_id": 7, + "old_task_id": 582 + }, + { + "sites": [ + "reddit" + ], + "task_id": 132, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Open the thread of a trending post on the forum \"{{subreddit}}\" and subscribe.", + "instantiation_dict": { + "subreddit": "machine learning" + }, + "intent": "Open the thread of a trending post on the forum \"machine learning\" and subscribe.", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__REDDIT__/f/MachineLearning", + "program_html": [ + { + "url": "__REDDIT__/", + "locator": "document.querySelector(\"#sidebar > section\").outerText", + "required_contents": { + "must_include": [ + "MachineLearning" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 4, + "old_task_id": 599 + }, + { + "sites": [ + "reddit" + ], + "task_id": 133, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Post my question, \"{{question}}\", in a subreddit where I'm likely to get an answer", + "instantiation_dict": { + "question": "is car necessary in NYC" + }, + "intent": "Post my question, \"is car necessary in NYC\", in a subreddit where I'm likely to get an answer", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__REDDIT__/f/nyc", + "program_html": [ + { + "url": "func:reddit_get_post_url('__last_url__')", + "locator": "document.querySelector('.submission__inner').outerText", + "required_contents": { + "must_include": [ + "is car necessary in NYC" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 3765, + "old_task_id": 601 + }, + { + "sites": [ + "reddit" + ], + "task_id": 134, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Find a subreddit focused on topics related to {{topic}}, and post my question, \"{{question}}\" there", + "instantiation_dict": { + "topic": "gaming consoles", + "question": "what is the recommended console to buy these days" + }, + "intent": "Find a subreddit focused on topics related to gaming consoles, and post my question, \"what is the recommended console to buy these days\" there", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__REDDIT__/f/consoles", + "program_html": [ + { + "url": "func:reddit_get_post_url('__last_url__')", + "locator": "document.querySelector('.submission__inner').outerText", + "required_contents": { + "must_include": [ + "what is the recommended console to buy these days" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 5, + "old_task_id": 605 + }, + { + "sites": [ + "reddit" + ], + "task_id": 135, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Post a review of my recent reading \"{{book}}\" in the r/books with my comment \"{{content}}\".", + "instantiation_dict": { + "book": "big little lies", + "content": "can't stop it" + }, + "intent": "Post a review of my recent reading \"big little lies\" in the r/books with my comment \"can't stop it\".", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__REDDIT__/f/books", + "program_html": [ + { + "url": "func:reddit_get_post_url('__last_url__')", + "locator": "document.querySelector('.submission__inner').outerText", + "required_contents": { + "must_include": [ + "big little lies", + "can't stop it" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 9, + "old_task_id": 612 + }, + { + "sites": [ + "reddit" + ], + "task_id": 136, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__/f/pics", + "geolocation": null, + "intent_template": "Re-post the image of {{content}} in this page to {{subreddit}} subreddit and note \"from /f/pics\"", + "instantiation_dict": { + "content": "costume contest", + "subreddit": "funny" + }, + "intent": "Re-post the image of costume contest in this page to funny subreddit and note \"from /f/pics\"", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__REDDIT__/f/funny", + "program_html": [ + { + "url": "func:reddit_get_post_url('__last_url__')", + "locator": "document.querySelector('.submission__inner').outerText", + "required_contents": { + "must_include": [ + "from /f/pics" + ] + } + }, + { + "url": "func:reddit_get_post_url('__last_url__')", + "locator": "[...document.querySelector('.submission__inner').querySelectorAll('[href],[src]')].map(elem => elem.getAttribute('href') || elem.getAttribute('src')).join(' ')", + "required_contents": { + "must_include": [ + "bd8bc5f4c846aac4df08626faa3a34a7d47c8f3bdd92bf615a54afd939f063a7.jpg" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 11, + "old_task_id": 619 + }, + { + "sites": [ + "reddit" + ], + "task_id": 137, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Create a discussion post about \"{{topic}}\" in a relevant subreddit and ask users for their opinions with the simple prompt, \"your opinion\"", + "instantiation_dict": { + "topic": "Iphone 14" + }, + "intent": "Create a discussion post about \"Iphone 14\" in a relevant subreddit and ask users for their opinions with the simple prompt, \"your opinion\"", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__REDDIT__/f/iphone", + "program_html": [ + { + "url": "func:reddit_get_post_url('__last_url__')", + "locator": "document.querySelector('.submission__inner').outerText", + "required_contents": { + "must_include": [ + "your opinion", + "Iphone 14" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 13, + "old_task_id": 626 + }, + { + "sites": [ + "reddit" + ], + "task_id": 138, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Ask for product recommendations for {{category}} within a budget of {{price}} in {{subreddit}}", + "instantiation_dict": { + "category": "running shoes", + "price": "$100", + "subreddit": "r/sports" + }, + "intent": "Ask for product recommendations for running shoes within a budget of $100 in r/sports", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__REDDIT__/f/sports", + "program_html": [ + { + "url": "func:reddit_get_post_url('__last_url__')", + "locator": "document.querySelector('.submission__inner').outerText", + "required_contents": { + "must_include": [ + "100", + "running", + "shoes" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 15, + "old_task_id": 631 + }, + { + "sites": [ + "reddit" + ], + "task_id": 139, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Post a notice on a virtual meetup for {{interest}} enthusiasts on {{date}} in the {{subreddit}} subreddit", + "instantiation_dict": { + "interest": "Harry Potter", + "date": "July 8th", + "subreddit": "books" + }, + "intent": "Post a notice on a virtual meetup for Harry Potter enthusiasts on July 8th in the books subreddit", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__REDDIT__/f/books", + "program_html": [ + { + "url": "func:reddit_get_post_url('__last_url__')", + "locator": "document.querySelector('.submission__inner').outerText", + "required_contents": { + "fuzzy_match": [ + "Harry Potter", + "July 8th", + "virtual meetup" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 16, + "old_task_id": 641 + }, + { + "sites": [ + "reddit" + ], + "task_id": 140, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Post in {{subreddit}} subreddit about what could machine learning help the correpong field.", + "instantiation_dict": { + "subreddit": "books" + }, + "intent": "Post in books subreddit about what could machine learning help the correpong field.", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__REDDIT__/f/books", + "program_html": [ + { + "url": "func:reddit_get_post_url('__last_url__')", + "locator": "document.querySelector('.submission__inner').outerText", + "required_contents": { + "fuzzy_match": [ + "machine learning", + "help" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 19, + "old_task_id": 645 + }, + { + "sites": [ + "reddit" + ], + "task_id": 141, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__/f/technology/134852/ai-experts-disown-musk-backed-campaign-citing-their-research", + "geolocation": null, + "intent_template": "Reply to {{position_description}} with my comment \"{{content_description}}\"", + "instantiation_dict": { + "position_description": "the post", + "content_description": "???" + }, + "intent": "Reply to the post with my comment \"???\"", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__REDDIT__/f/technology/134852/ai-experts-disown-musk-backed-campaign-citing-their-research", + "program_html": [ + { + "url": "last", + "locator": "document.querySelector('.comment__body').outerText", + "required_contents": { + "exact_match": "???" + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 23, + "old_task_id": 652 + }, + { + "sites": [ + "shopping" + ], + "task_id": 142, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Fill the \"contact us\" form in the site for a refund on the {{product}} I bought, stating that it broke after just three days of use. Also, ensure to include the order number #{{order_id}} and the product SKU. Don't submit yet, I will check.", + "instantiation_dict": { + "product": "speaker", + "order_id": "148" + }, + "intent": "Fill the \"contact us\" form in the site for a refund on the speaker I bought, stating that it broke after just three days of use. Also, ensure to include the order number #148 and the product SKU. Don't submit yet, I will check.", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__SHOPPING__/contact", + "program_html": [ + { + "url": "last", + "locator": "document.querySelector('[title=\"What’s on your mind?\"').value", + "required_contents": { + "fuzzy_match": [ + "refund", + "broke after three days of use", + "148", + "B003FVW3VA" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 153, + "old_task_id": 657 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 143, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Submit a merge request for {{source_branch}} branch to be merged into {{target_branch}} branch, assign {{reviewer}} as the reviewer", + "instantiation_dict": { + "source_branch": "a11yproject.com/redesign", + "target_branch": "master", + "reviewer": "Roshan Jossy" + }, + "intent": "Submit a merge request for a11yproject.com/redesign branch to be merged into master branch, assign Roshan Jossy as the reviewer", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__GITLAB__/a11yproject/a11yproject.com/-/merge_requests", + "program_html": [ + { + "url": "last", + "locator": "document.querySelectorAll(\".detail-page-description > a.gl-font-monospace\")[0].outerText", + "required_contents": { + "exact_match": "redesign" + } + }, + { + "url": "last", + "locator": "document.querySelectorAll(\".detail-page-description > a.gl-font-monospace\")[1].outerText", + "required_contents": { + "exact_match": "main" + } + }, + { + "url": "last", + "locator": "document.querySelector('.block.reviewer').outerText", + "required_contents": { + "must_include": [ + "Roshan Jossy" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 335, + "old_task_id": 668 + }, + { + "sites": [ + "shopping", + "reddit" + ], + "task_id": 144, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Gather the titles of {{product}} reviews with {{rating}} rating from OneStopShop, and post them in the games subreddit under the title \"real user feedback on {{product}}\"", + "instantiation_dict": { + "product": "Racing Wheel Overdrive for Xbox X", + "rating": "1 star" + }, + "intent": "Gather the titles of Racing Wheel Overdrive for Xbox X reviews with 1 star rating from OneStopShop, and post them in the games subreddit under the title \"real user feedback on Racing Wheel Overdrive for Xbox X\"", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__REDDIT__/f/gaming", + "program_html": [ + { + "url": "func:reddit_get_post_url('__last_url__')", + "locator": "document.querySelector('.submission__title').outerText", + "required_contents": { + "exact_match": "real user feedback on Racing Wheel Overdrive for Xbox X" + } + }, + { + "url": "func:reddit_get_post_url('__last_url__')", + "locator": "document.querySelector('.submission__body').outerText", + "required_contents": { + "must_include": [ + "Unable to set neutral steering", + "Doesn’t work with PC", + "Crazy problems in automatic mode", + "pedals stopped working", + "Only works with certain games" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 101, + "old_task_id": 673 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 145, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Lookup orders that are {{status}}", + "instantiation_dict": { + "status": "canceled" + }, + "intent": "Lookup orders that are canceled", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__SHOPPING_ADMIN__/sales/order/", + "program_html": [ + { + "url": "last", + "locator": "document.querySelector(\"div.admin__data-grid-filters-current\").outerText", + "required_contents": { + "must_include": [ + "Canceled" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 253, + "old_task_id": 678 + }, + { + "sites": [ + "reddit", + "gitlab" + ], + "task_id": 146, + "require_login": true, + "storage_state": "./.auth/gitlab.reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Find a GitLab repository related to {{topic}} and make a Reddit post linking to it in a relevant subreddit", + "instantiation_dict": { + "topic": "chatGPT" + }, + "intent": "Find a GitLab repository related to chatGPT and make a Reddit post linking to it in a relevant subreddit", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__REDDIT__/f/machinelearning |OR| __REDDIT__/f/singularity |OR| __REDDIT__/f/deeplearning |OR| __REDDIT__/f/technology", + "program_html": [ + { + "url": "last", + "locator": "document.querySelector('.submission__inner').outerText", + "required_contents": { + "must_include": [ + "chatgpt", + "__GITLAB__/convexegg/chatgpt" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 116, + "old_task_id": 682 + }, + { + "sites": [ + "reddit", + "gitlab" + ], + "task_id": 147, + "require_login": true, + "storage_state": "./.auth/gitlab.reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Promote {{repo}} to subreddit {{subreddit}} with the description from the repo itself.", + "instantiation_dict": { + "repo": "auth0/angular-storage", + "subreddit": "technology" + }, + "intent": "Promote auth0/angular-storage to subreddit technology with the description from the repo itself.", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__REDDIT__/f/technology", + "program_html": [ + { + "url": "last", + "locator": "document.querySelector('.submission__inner').outerText + [...document.querySelector('.submission__inner').querySelectorAll('[href],[src]')].map(elem => elem.getAttribute('href') || elem.getAttribute('src')).join(' ')", + "required_contents": { + "must_include": [ + "A storage library for AngularJS done right", + "__GITLAB__/auth0/angular-storage" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 117, + "old_task_id": 686 + }, + { + "sites": [ + "shopping" + ], + "task_id": 148, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Draft an email to the shop owner via their contact us function for a coupon as {{reason}}", + "instantiation_dict": { + "reason": "my refund is suppoed to be replaced by a coupon" + }, + "intent": "Draft an email to the shop owner via their contact us function for a coupon as my refund is suppoed to be replaced by a coupon", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match" + ], + "reference_answers": null, + "reference_url": "__SHOPPING__/contact/index/", + "program_html": [], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 163, + "old_task_id": 693 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 149, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Today is 3/15/2023, generate a {{report}} {{time_span}}", + "instantiation_dict": { + "report": "sales order report", + "time_span": "for last month" + }, + "intent": "Today is 3/15/2023, generate a sales order report for last month", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__SHOPPING_ADMIN__/reports/report_sales/sales", + "program_html": [ + { + "url": "last", + "locator": "document.querySelector('[id=\"sales_report_from\"').value", + "required_contents": { + "exact_match": "2/1/23" + } + }, + { + "url": "last", + "locator": "document.querySelector('[id=\"sales_report_to\"').value", + "required_contents": { + "exact_match": "2/28/23" + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 268, + "old_task_id": 704 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 150, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "Create a {{type}} report from {{start_date}} to {{end_date}}", + "instantiation_dict": { + "type": "shipping", + "start_date": "08/05/2022", + "end_date": "03/01/2023" + }, + "intent": "Create a shipping report from 08/05/2022 to 03/01/2023", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__SHOPPING_ADMIN__/reports/report_sales/shipping", + "program_html": [ + { + "url": "last", + "locator": "document.querySelector('[id=\"sales_report_from\"').value", + "required_contents": { + "exact_match": "8/5/22" + } + }, + { + "url": "last", + "locator": "document.querySelector('[id=\"sales_report_to\"').value", + "required_contents": { + "exact_match": "3/1/23" + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 271, + "old_task_id": 710 + }, + { + "sites": [ + "reddit" + ], + "task_id": 151, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Thumbs down the top {{k}} post ever in {{subreddit}}.", + "instantiation_dict": { + "subreddit": "gadgets", + "k": "1" + }, + "intent": "Thumbs down the top 1 post ever in gadgets.", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__REDDIT__/f/gadgets/19459/a-custom-gaming-pc-built-inside-a-vintage-1940s-motorola", + "locator": "document.querySelector('div.submission__vote').querySelector('form').getAttribute('class')", + "required_contents": { + "must_include": [ + "vote vote--user-downvoted" + ] + } + } + ] + }, + "intent_template_id": 24, + "old_task_id": 714 + }, + { + "sites": [ + "reddit" + ], + "task_id": 152, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Like all submissions created by {{user}} in subreddit {{subreddit}}", + "instantiation_dict": { + "subreddit": "earthporn", + "user": "CameronKelsey" + }, + "intent": "Like all submissions created by CameronKelsey in subreddit earthporn", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__REDDIT__/f/EarthPorn/98332/my-favorite-place-on-the-planet-henry-s-fork-of-the-snake", + "locator": "document.querySelector('div.submission__vote').querySelector('form').getAttribute('class')", + "required_contents": { + "must_include": [ + "vote vote--user-upvoted" + ] + } + }, + { + "url": "__REDDIT__/f/EarthPorn/98297/2-years-later-this-is-still-one-of-the-most-incredible", + "locator": "document.querySelector('div.submission__vote').querySelector('form').getAttribute('class')", + "required_contents": { + "must_include": [ + "vote vote--user-upvoted" + ] + } + }, + { + "url": "__REDDIT__/f/EarthPorn/98256/i-can-t-wait-for-all-this-green-to-start-coming-back-little", + "locator": "document.querySelector('div.submission__vote').querySelector('form').getAttribute('class')", + "required_contents": { + "must_include": [ + "vote vote--user-upvoted" + ] + } + } + ] + }, + "intent_template_id": 25, + "old_task_id": 720 + }, + { + "sites": [ + "reddit" + ], + "task_id": 153, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "DisLike all submissions created by {{user}} in subreddit {{subreddit}}", + "instantiation_dict": { + "user": "AdamCannon", + "subreddit": "UpliftingNews" + }, + "intent": "DisLike all submissions created by AdamCannon in subreddit UpliftingNews", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__REDDIT__/f/UpliftingNews/16087/same-sex-marriage-is-now-legal-in-all-of-mexico-s-states", + "locator": "document.querySelector('div.submission__vote').querySelector('form').getAttribute('class')", + "required_contents": { + "must_include": [ + "vote vote--user-downvoted" + ] + } + } + ] + }, + "intent_template_id": 1510, + "old_task_id": 729 + }, + { + "sites": [ + "reddit" + ], + "task_id": 154, + "require_login": true, + "storage_state": "./.auth/reddit_state.json", + "start_url": "__REDDIT__", + "geolocation": null, + "intent_template": "Edit my post on {{post}} by adding a line to the body that says \"{{content}}\"", + "instantiation_dict": { + "post": "Star Trek Starfleet Academy series", + "content": "Every watch makes me feel like a kid again" + }, + "intent": "Edit my post on Star Trek Starfleet Academy series by adding a line to the body that says \"Every watch makes me feel like a kid again\"", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__REDDIT__/f/television/135201/star-trek-starfleet-academy-series-from-alex-kurtzman-and", + "locator": "document.querySelector('.submission__body').outerText", + "required_contents": { + "exact_match": "Every watch makes me feel like a kid again" + } + } + ] + }, + "intent_template_id": 27, + "old_task_id": 733 + }, + { + "sites": [ + "wikipedia", + "map" + ], + "task_id": 155, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Show me the way from {{location}} to the home stadium of {{sport_team}} {{time}}", + "instantiation_dict": { + "location": "Carnegie Mellon University", + "sport_team": "Boston home NBA team", + "time": "" + }, + "intent": "Show me the way from Carnegie Mellon University to the home stadium of Boston home NBA team ", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "last", + "locator": "document.querySelector('[name=\"route_from\"').value", + "required_contents": { + "must_include": [ + "Carnegie Mellon University", + "Pittsburgh" + ] + } + }, + { + "url": "last", + "locator": "document.querySelector('[name=\"route_to\"').value", + "required_contents": { + "must_include": [ + "TD Garden", + "Boston", + "Massachusetts" + ] + } + }, + { + "url": "last", + "locator": "document.querySelector(\"div#content select.routing_engines\").selectedIndex", + "required_contents": { + "exact_match": "1" + } + } + ] + }, + "intent_template_id": 94, + "old_task_id": 741 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 156, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Create a new {{scope}} project \"awesome-llms\" and add {{account_list}} as members", + "instantiation_dict": { + "scope": "public", + "account_list": "primer, convexegg, abishek" + }, + "intent": "Create a new public project \"awesome-llms\" and add primer, convexegg, abishek as members", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__GITLAB__/byteblaze/awesome-llms", + "locator": "document.querySelector('.visibility-icon').getAttribute('title')", + "required_contents": { + "must_include": [ + "public" + ] + } + }, + { + "url": "__GITLAB__/byteblaze/awesome-llms/-/project_members", + "locator": "", + "required_contents": { + "must_include": [ + "@primer", + "@convexegg", + "@abisubramanya27" + ] + } + } + ] + }, + "intent_template_id": 332, + "old_task_id": 745 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 157, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Start a private project {{project_name}} with {{template}} template and add {{account_list}} as members", + "instantiation_dict": { + "project_name": "web_agent_android_xl", + "template": "Android", + "account_list": "primer, convexegg, abishek" + }, + "intent": "Start a private project web_agent_android_xl with Android template and add primer, convexegg, abishek as members", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__GITLAB__/byteblaze/web_agent_android_xl", + "locator": "document.querySelector('.visibility-icon').getAttribute('title')", + "required_contents": { + "must_include": [ + "Private" + ] + } + }, + { + "url": "__GITLAB__/byteblaze/web_agent_android_xl/-/commits", + "locator": "", + "required_contents": { + "must_include": [ + "Initialized from 'Android' project template" + ] + } + }, + { + "url": "__GITLAB__/byteblaze/web_agent_android_xl/-/project_members", + "locator": "", + "required_contents": { + "must_include": [ + "@primer", + "@convexegg", + "@abisubramanya27" + ] + } + } + ] + }, + "intent_template_id": 2100, + "old_task_id": 748 + }, + { + "sites": [ + "map", + "shopping_admin" + ], + "task_id": 158, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Show me the route and driving time from {{city1}} to {{city2}}", + "instantiation_dict": { + "city1": "Allentown, PA", + "city2": "the city where my E-commerce customer Amanda Kim lives" + }, + "intent": "Show me the route and driving time from Allentown, PA to the city where my E-commerce customer Amanda Kim lives", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "last", + "locator": "document.querySelector(\"div#content select.routing_engines\").selectedIndex", + "required_contents": { + "exact_match": "1" + } + }, + { + "url": "last", + "locator": "document.querySelector('[name=\"route_from\"').value", + "required_contents": { + "must_include": [ + "Allentown" + ] + } + }, + { + "url": "last", + "locator": "document.querySelector('[name=\"route_to\"').value", + "required_contents": { + "must_include": [ + "Hoboken", + "New Jersey" + ] + } + } + ] + }, + "intent_template_id": 42, + "old_task_id": 760 + }, + { + "sites": [ + "map" + ], + "task_id": 159, + "require_login": true, + "storage_state": null, + "start_url": "__MAP__", + "geolocation": null, + "intent_template": "Get directions from {{location/address_1}} to {{location/address_2}} using {{transportation}} options.", + "instantiation_dict": { + "location/address_1": "Carnegie Music Hall in NYC", + "location/address_2": "Carnegie Mellon University", + "transportation": "driving" + }, + "intent": "Get directions from Carnegie Music Hall in NYC to Carnegie Mellon University using driving options.", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "last", + "locator": "document.querySelector(\"div#content select.routing_engines\").selectedIndex", + "required_contents": { + "exact_match": "1" + } + }, + { + "url": "last", + "locator": "document.querySelector('[name=\"route_from\"').value", + "required_contents": { + "must_include": [ + "Carnegie Hall", + "West 57th Street", + "Manhattan", + "New York" + ] + } + }, + { + "url": "last", + "locator": "document.querySelector('[name=\"route_to\"').value", + "required_contents": { + "must_include": [ + "Carnegie Mellon University", + "Pittsburgh" + ] + } + } + ] + }, + "intent_template_id": 54, + "old_task_id": 762 + }, + { + "sites": [ + "shopping_admin" + ], + "task_id": 160, + "require_login": true, + "storage_state": "./.auth/shopping_admin_state.json", + "start_url": "__SHOPPING_ADMIN__", + "geolocation": null, + "intent_template": "{{quantity}} {{product}} arrived, update the stock", + "instantiation_dict": { + "quantity": "5", + "product": "blue Cronus yoga pants with size 33" + }, + "intent": "5 blue Cronus yoga pants with size 33 arrived, update the stock", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__SHOPPING_ADMIN__/catalog/product/edit/id/872/", + "locator": "document.querySelector('[name=\"product[quantity_and_stock_status][qty]\"').value", + "required_contents": { + "exact_match": "5" + } + }, + { + "url": "__SHOPPING_ADMIN__/catalog/product/edit/id/872/", + "locator": "document.querySelector('[name=\"product[quantity_and_stock_status][is_in_stock]\"').value", + "required_contents": { + "exact_match": "1" + } + } + ] + }, + "intent_template_id": 241, + "old_task_id": 768 + }, + { + "sites": [ + "gitlab", + "reddit" + ], + "task_id": 161, + "require_login": true, + "storage_state": "./.auth/gitlab.reddit_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Make a folder named {{directory}} on the {{gitlab_repo}} repo and include a file called urls.txt that consists of the links to the 5 most recent posts from {{subreddit}}.", + "instantiation_dict": { + "directory": "car", + "subreddit": "cars", + "gitlab_repo": "gimmiethat.space" + }, + "intent": "Make a folder named car on the gimmiethat.space repo and include a file called urls.txt that consists of the links to the 5 most recent posts from cars.", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "exact_match": "N/A" + }, + "reference_url": "", + "program_html": [], + "string_note": "the subreddit does not exist", + "reference_answer_raw_annotation": "N/A" + }, + "intent_template_id": 84, + "old_task_id": 791 + }, + { + "sites": [ + "shopping" + ], + "task_id": 162, + "require_login": true, + "storage_state": "./.auth/shopping_state.json", + "start_url": "__SHOPPING__", + "geolocation": null, + "intent_template": "Change the delivery address for my most recent order to {{address}}.", + "instantiation_dict": { + "address": "77 Massachusetts Ave, Cambridge, MA" + }, + "intent": "Change the delivery address for my most recent order to 77 Massachusetts Ave, Cambridge, MA.", + "require_reset": false, + "eval": { + "eval_types": [ + "string_match" + ], + "reference_answers": { + "must_include": [ + "N/A" + ] + }, + "reference_url": "", + "program_html": [], + "string_note": "The shipping address cannot be changed after an order is placed in the website.", + "reference_answer_raw_annotation": "N/A" + }, + "intent_template_id": 191, + "old_task_id": 798 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 163, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Create an issue in {{repo}} repo with title \"{{issue}}\". Assign the issue to {{account}}. Set due date to be {{due}}", + "instantiation_dict": { + "repo": "a11yproject", + "issue": "404 for many URLs", + "account": "myself", + "due": "2030-1-3" + }, + "intent": "Create an issue in a11yproject repo with title \"404 for many URLs\". Assign the issue to myself. Set due date to be 2030-1-3", + "require_reset": false, + "eval": { + "eval_types": [ + "url_match", + "program_html" + ], + "reference_answers": null, + "reference_url": "__GITLAB__/a11yproject/a11yproject.com/-/issues", + "program_html": [ + { + "url": "last", + "locator": "document.querySelector('[data-qa-selector=\"title_content\"]').outerText", + "required_contents": { + "exact_match": "404 for many URLs" + } + }, + { + "url": "last", + "locator": "document.querySelector('[data-testid=\"sidebar-due-date\"').outerText", + "required_contents": { + "must_include": [ + "Jan 3, 2030" + ] + } + }, + { + "url": "last", + "locator": "document.querySelector('.block.assignee').outerText", + "required_contents": { + "must_include": [ + "Byte Blaze" + ] + } + } + ], + "url_note": "GOLD in PRED" + }, + "intent_template_id": 327, + "old_task_id": 809 + }, + { + "sites": [ + "gitlab" + ], + "task_id": 164, + "require_login": true, + "storage_state": "./.auth/gitlab_state.json", + "start_url": "__GITLAB__", + "geolocation": null, + "intent_template": "Assign the issue regarding {{issue}} in {{repo}} to {{account}}.", + "instantiation_dict": { + "repo": "a11yproject", + "issue": 404, + "account": "myself" + }, + "intent": "Assign the issue regarding 404 in a11yproject to myself.", + "require_reset": false, + "eval": { + "eval_types": [ + "program_html" + ], + "reference_answers": null, + "reference_url": "", + "program_html": [ + { + "url": "__GITLAB__/dashboard/issues?scope=all&state=opened&assignee_username=byteblaze", + "locator": "", + "required_contents": { + "must_include": [ + "404s, bad host, timeouts, bad urls for URLs linked from website" + ] + } + } + ] + }, + "intent_template_id": 999, + "old_task_id": 811 + } +] \ No newline at end of file diff --git a/VAB-WebArena-Lite/new/tokenizers.py b/VAB-WebArena-Lite/new/tokenizers.py new file mode 100644 index 0000000..2234040 --- /dev/null +++ b/VAB-WebArena-Lite/new/tokenizers.py @@ -0,0 +1,29 @@ +from typing import Any + +import tiktoken +from transformers import LlamaTokenizer # type: ignore + + +class Tokenizer(object): + def __init__(self, provider: str, model_name: str) -> None: + if provider == "openai": + self.tokenizer = tiktoken.encoding_for_model(model_name) + elif provider == "huggingface": + self.tokenizer = LlamaTokenizer.from_pretrained(model_name) + # turn off adding special tokens automatically + self.tokenizer.add_special_tokens = False # type: ignore[attr-defined] + self.tokenizer.add_bos_token = False # type: ignore[attr-defined] + self.tokenizer.add_eos_token = False # type: ignore[attr-defined] + elif provider in ["google", "api", "finetune"]: + self.tokenizer = None # Not used for input length computation, as Gemini is based on characters + else: + raise NotImplementedError + + def encode(self, text: str) -> list[int]: + return self.tokenizer.encode(text) + + def decode(self, ids: list[int]) -> str: + return self.tokenizer.decode(ids) + + def __call__(self, text: str) -> list[int]: + return self.tokenizer.encode(text) diff --git a/VAB-WebArena-Lite/new/utils.py b/VAB-WebArena-Lite/new/utils.py new file mode 100644 index 0000000..4558585 --- /dev/null +++ b/VAB-WebArena-Lite/new/utils.py @@ -0,0 +1,87 @@ +import argparse +from typing import Any + +try: + from vertexai.preview.generative_models import Image + from llms import generate_from_gemini_completion +except: + print('Google Cloud not set up, skipping import of vertexai.preview.generative_models.Image and llms.generate_from_gemini_completion') + +from llms import ( + generate_from_huggingface_completion, + generate_from_openai_chat_completion, + generate_from_openai_completion, + generate_with_api, + lm_config, +) + +APIInput = str | list[Any] | dict[str, Any] + + +def call_llm( + lm_config: lm_config.LMConfig, + prompt: APIInput, +) -> str: + response: str + if lm_config.provider == "openai": + if lm_config.mode == "chat": + assert isinstance(prompt, list) + response = generate_from_openai_chat_completion( + messages=prompt, + model=lm_config.model, + temperature=lm_config.gen_config["temperature"], + top_p=lm_config.gen_config["top_p"], + context_length=lm_config.gen_config["context_length"], + max_tokens=lm_config.gen_config["max_tokens"], + stop_token=None, + ) + elif lm_config.mode == "completion": + assert isinstance(prompt, str) + response = generate_from_openai_completion( + prompt=prompt, + engine=lm_config.model, + temperature=lm_config.gen_config["temperature"], + max_tokens=lm_config.gen_config["max_tokens"], + top_p=lm_config.gen_config["top_p"], + stop_token=lm_config.gen_config["stop_token"], + ) + else: + raise ValueError( + f"OpenAI models do not support mode {lm_config.mode}" + ) + elif lm_config.provider == "huggingface": + assert isinstance(prompt, str) + response = generate_from_huggingface_completion( + prompt=prompt, + model_endpoint=lm_config.gen_config["model_endpoint"], + temperature=lm_config.gen_config["temperature"], + top_p=lm_config.gen_config["top_p"], + stop_sequences=lm_config.gen_config["stop_sequences"], + max_new_tokens=lm_config.gen_config["max_new_tokens"], + ) + elif lm_config.provider == "google": + assert isinstance(prompt, list) + assert all( + [isinstance(p, str) or isinstance(p, Image) for p in prompt] + ) + response = generate_from_gemini_completion( + prompt=prompt, + engine=lm_config.model, + temperature=lm_config.gen_config["temperature"], + max_tokens=lm_config.gen_config["max_tokens"], + top_p=lm_config.gen_config["top_p"], + ) + elif lm_config.provider in ["api", "finetune"]: + args = { + "temperature": lm_config.gen_config["temperature"], # openai, gemini, claude + "max_tokens": lm_config.gen_config["max_tokens"], # openai, gemini, claude + "top_k": lm_config.gen_config["top_p"], # qwen + } + response = generate_with_api(prompt, lm_config.model, args) + + else: + raise NotImplementedError( + f"Provider {lm_config.provider} not implemented" + ) + + return response diff --git a/VAB-WebArena-Lite/new/wa_parallel_run.sh b/VAB-WebArena-Lite/new/wa_parallel_run.sh new file mode 100644 index 0000000..2977e54 --- /dev/null +++ b/VAB-WebArena-Lite/new/wa_parallel_run.sh @@ -0,0 +1,78 @@ +#!/bin/bash +DATASET='webarena' # webarena, visualwebarena +result_dir='' +provider='' +model='' +instruction_path='agent/prompts/jsons/p_som_cot_id_actree_3s.json' # e.g., agent/prompts/jsons/p_cot_id_actree_2s.json +test_config_base_dir='config_files/wa/test_webarena_lite' # e.g., config_files/wa/test_webarena_lite +temperature=0.0 + +SERVER='' # your server address +MAP_SERVER='' # the same as SERVER +OPENAI_API_KEY='' +OPENAI_ORGANIZATION='' +CONDA_ENV_NAME='' # the name of your conda environment + +ENV_VARIABLES="export DATASET=${DATASET}; export SHOPPING='http://${SERVER}:7770';export SHOPPING_ADMIN='http://${SERVER}:7780/admin';export REDDIT='http://${SERVER}:9999';export GITLAB='http://${SERVER}:8023';export MAP='http://${MAP_SERVER}:3000';export WIKIPEDIA='http://${SERVER}:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing';export HOMEPAGE='http://${SERVER}:4399';export OPENAI_API_KEY=${OPENAI_API_KEY};export OPENAI_ORGANIZATION=${OPENAI_ORGANIZATION}" + +# get the number of tmux panes +num_panes=$(tmux list-panes | wc -l) + +# calculate how many panes need to be created +let "panes_to_create = 8 - num_panes" + +# array of tmux commands to create each pane +tmux_commands=( + 'tmux split-window -h' + 'tmux split-window -v' + 'tmux select-pane -t 0; tmux split-window -v' + 'tmux select-pane -t 0; tmux split-window -v' + 'tmux select-pane -t 2; tmux split-window -v' + 'tmux select-pane -t 4; tmux split-window -v' + 'tmux select-pane -t 6; tmux split-window -v' +) + +# create panes up to 8 +for ((i=0; i<$panes_to_create; i++)); do + eval ${tmux_commands[$i]} +done + +#!/bin/bash + +# Function to run a job +run_job() { + tmux select-pane -t $1 + tmux send-keys "tmux set mouse on; conda activate ${CONDA_ENV_NAME}; ${ENV_VARIABLES}; until python run.py --viewport_width 1280 --viewport_height 720 --test_start_idx $2 --test_end_idx $3 --provider ${provider} --model ${model} --instruction_path ${instruction_path} --temperature ${temperature} --test_config_base_dir ${test_config_base_dir} --result_dir ${result_dir}; do echo 'crashed' >&2; sleep 1; done" C-m + sleep 3 +} + +TOLERANCE=2 +run_batch() { + args=("$@") # save all arguments in an array + num_jobs=${#args[@]} # get number of arguments + + for ((i=1; i<$num_jobs; i++)); do + run_job $i ${args[i-1]} ${args[i]} + done + + # Wait for all jobs to finish + while tmux list-panes -F "#{pane_pid} #{pane_current_command}" | grep -q python; do + sleep 100 # wait for 10 seconds before checking again + done + + # Run checker + while ! python scripts/check_error_runs.py ${result_dir} --delete_errors --tolerance ${TOLERANCE}; do + echo "Check failed, rerunning jobs..." + for ((i=1; i<$num_jobs; i++)); do + run_job $i ${args[i-1]} ${args[i]} + done + + # Wait for all jobs to finish + while tmux list-panes -F "#{pane_pid} #{pane_current_command}" | grep -q python; do + sleep 100 # wait for 10 seconds before checking again + done + done + +} + +run_batch 0 24 48 72 96 120 143 165 diff --git a/VAB-WebArena-Lite/refresh_website_dockers.sh b/VAB-WebArena-Lite/refresh_website_dockers.sh new file mode 100644 index 0000000..c6f8996 --- /dev/null +++ b/VAB-WebArena-Lite/refresh_website_dockers.sh @@ -0,0 +1,71 @@ +#!/bin/bash + +docker stop shopping +docker stop shopping_admin +docker stop forum +docker stop gitlab + +docker rm shopping +docker rm shopping_admin +docker rm forum +docker rm gitlab + +# One Stop Shop +# docker load --input shopping_final_0712.tar +docker run --name shopping -p 7770:80 -d shopping_final_0712 + +# CMS +# docker load --input shopping_admin_final_0719.tar +docker run --name shopping_admin -p 7780:80 -d shopping_admin_final_0719 + +# Reddit +# docker load --input postmill-populated-exposed-withimg.tar +docker run --name forum -p 9999:80 -d postmill-populated-exposed-withimg + +# GitLab +# docker load --input gitlab-populated-final-port8023.tar +docker run --name gitlab --shm-size="10g" -d -p 8023:8023 gitlab-populated-final-port8023 /opt/gitlab/embedded/bin/runsvdir-start + +sleep 60 + +# Define your actual server hostname +YOUR_ACTUAL_HOSTNAME="http://localhost" +# Remove trailing / if it exists +YOUR_ACTUAL_HOSTNAME=${YOUR_ACTUAL_HOSTNAME%/} + +# OSS +docker exec shopping /var/www/magento2/bin/magento setup:store-config:set --base-url="${YOUR_ACTUAL_HOSTNAME}:7770" +docker exec shopping mysql -u magentouser -pMyPassword magentodb -e "UPDATE core_config_data SET value='${YOUR_ACTUAL_HOSTNAME}:7775/' WHERE path = 'web/secure/base_url';" +docker exec shopping /var/www/magento2/bin/magento cache:flush + +# Disable re-indexing of products +docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule catalogrule_product +docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule catalogrule_rule +docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule catalogsearch_fulltext +docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule catalog_category_product +docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule customer_grid +docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule design_config_grid +docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule inventory +docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule catalog_product_category +docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule catalog_product_attribute +docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule catalog_product_price +docker exec shopping /var/www/magento2/bin/magento indexer:set-mode schedule cataloginventory_stock + +# CMS +docker exec shopping_admin /var/www/magento2/bin/magento setup:store-config:set --base-url="${YOUR_ACTUAL_HOSTNAME}:7780" +docker exec shopping_admin mysql -u magentouser -pMyPassword magentodb -e "UPDATE core_config_data SET value='${YOUR_ACTUAL_HOSTNAME}:7780/' WHERE path = 'web/secure/base_url';" +docker exec shopping_admin /var/www/magento2/bin/magento cache:flush + +# Forum +docker exec forum sed -i '/@RateLimit/,/)/d' /var/www/html/src/DataObject/CommentData.php +docker exec forum sed -i '/@RateLimit/,/)/d' /var/www/html/src/DataObject/SubmissionData.php +docker exec forum sed -i '/@RateLimit/,/)/d' /var/www/html/src/DataObject/UserData.php +docker exec forum bin/console cache:clear --env=prod + +sleep 60 + +# Gitlab +docker exec gitlab sed -i "s|^external_url.*|external_url '${YOUR_ACTUAL_HOSTNAME}:8023'|" /etc/gitlab/gitlab.rb +docker exec gitlab sed -i "s/.*postgresql\['max_connections'.*/postgresql\['max_connections'\] = 2000/g" /etc/gitlab/gitlab.rb +docker exec gitlab gitlab-ctl reconfigure +docker exec gitlab gitlab-ctl restart diff --git a/VAB-WebArena-Lite/replace.sh b/VAB-WebArena-Lite/replace.sh new file mode 100644 index 0000000..add244c --- /dev/null +++ b/VAB-WebArena-Lite/replace.sh @@ -0,0 +1,37 @@ +#!/bin/bash + +# 1. reset the original environment +cd visualwebarena +rm -rf .git +cd .. + +# 2. replace files +# webarena-lite +cp -f new/test_webarena_lite.raw.json visualwebarena/config_files/wa/test_webarena_lite.raw.json +cp -f new/generate_test_data.py visualwebarena/scripts/generate_test_data.py + +# agent +cp -f new/run.py visualwebarena/run.py +cp -f new/agent.py visualwebarena/agent/agent.py +cp -f new/prompt_constructor.py visualwebarena/agent/prompts/prompt_constructor.py + +# llms +cp -f new/utils.py visualwebarena/llms/utils.py +cp -f new/llms_init.py visualwebarena/llms/__init__.py +cp -f new/lm_config.py visualwebarena/llms/lm_config.py +cp -f new/tokenizers.py visualwebarena/llms/tokenizers.py +cp -f new/api_utils.py visualwebarena/llms/providers/api_utils.py +cp -f new/openai_utils.py visualwebarena/llms/providers/openai_utils.py + +# eval +cp -f new/evaluators.py visualwebarena/evaluation_harness/evaluators.py +cp -f new/helper_functions.py visualwebarena/evaluation_harness/helper_functions.py + +# misc +cp -f README.md visualwebarena/README.md +cp -f new/wa_parallel_run.sh visualwebarena/wa_parallel_run.sh + +# 3. remove temporary files +mv visualwebarena/* . +rm -rf new +rm -rf visualwebarena diff --git a/assets/framework.png b/assets/framework.png new file mode 100644 index 0000000..2e10872 Binary files /dev/null and b/assets/framework.png differ diff --git a/docs/detailed_setups/VAB-Minecraft.md b/docs/detailed_setups/VAB-Minecraft.md new file mode 100644 index 0000000..699b851 --- /dev/null +++ b/docs/detailed_setups/VAB-Minecraft.md @@ -0,0 +1,49 @@ +# Setup for VAB-Minecraft + +## Installation + +1. We have tested on Ubuntu. VAB-Minecraft requires at least 4 GB NVIDIA GPU and NVIDIA GPU driver version >= 530.30.02. + +2. Besides [docker](https://www.docker.com/), install [NVIDIA container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) on your machine. + +3. Get pre-built docker image. + + - If you have access to docker hub: + + ```bash + docker pull tianjiezhang/vab_minecraft:latest + ``` + + - Or you can download from ModelScope. + + 1. Make sure `git-lfs` is installed. + + 2. Download from ModelScope: + + ```bash + git lfs install + + git clone https://www.modelscope.cn/datasets/VisualAgentBench/VAB-Minecraft.git + ``` + + 3. Load the docker image from ModelScope dataset. + + ```bash + docker load -i VAB-Minecraft/vab_minecraft.tar + ``` + +4. Download weights of Steve-1 to `data/minecraft`. Please make sure you have access to google drive. + + ```bash + python scripts/minecraft_download.py + ``` + +## Get Started + +According to your hardware equipment, fill `available_ports` and `available_devices` in the task configuration file `configs/tasks/minecraft.yaml`. + +- `available_ports`: Please fill in available ports in your machine. Each concurrent docker container requires 1 port for communication with the task server. Ensure that you provide enough ports to accommodate the expected concurrency. + +- `available_devices`: Please fill in GPU IDs and their corresponding capability of concurrency. Each concurrent docker container occupies about **3.3 GB** memory. Ensure that you provide enough GPU memory to accommodate the expected concurrency. + +**Note: If you manually shut down the task server and assigner, please ensure you also stop the Minecraft containers to free up the ports!** diff --git a/docs/README_setup.md b/docs/detailed_setups/VAB-OmniGibson.md similarity index 57% rename from docs/README_setup.md rename to docs/detailed_setups/VAB-OmniGibson.md index af93a5f..6586d21 100644 --- a/docs/README_setup.md +++ b/docs/detailed_setups/VAB-OmniGibson.md @@ -57,57 +57,3 @@ 2. Load the new value: `sudo sysctl -p`. **Note: If you manually shut down the task server and assigner, please ensure you also stop the OmniGibson containers to free up the ports!** - -# Setup for VAB-Minecraft - -## Installation - -1. We have tested on Ubuntu. VAB-Minecraft requires at least 4 GB NVIDIA GPU and NVIDIA GPU driver version >= 530.30.02. - -2. Besides [docker](https://www.docker.com/), install [NVIDIA container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) on your machine. - -3. Get pre-built docker image. - - - If you have access to docker hub: - - ```bash - docker pull tianjiezhang/vab_minecraft:latest - ``` - - - Or you can download from ModelScope. - - 1. Make sure `git-lfs` is installed. - - 2. Download from ModelScope: - - ```bash - git lfs install - - git clone https://www.modelscope.cn/datasets/VisualAgentBench/VAB-Minecraft.git - ``` - - 3. Load the docker image from ModelScope dataset. - - ```bash - docker load -i VAB-Minecraft/vab_minecraft.tar - ``` - -4. Download weights of Steve-1 to `data/minecraft`. Please make sure you have access to google drive. - - ```bash - python scripts/minecraft_download.py - ``` - -## Get Started - -According to your hardware equipment, fill `available_ports` and `available_devices` in the task configuration file `configs/tasks/minecraft.yaml`. - -- `available_ports`: Please fill in available ports in your machine. Each concurrent docker container requires 1 port for communication with the task server. Ensure that you provide enough ports to accommodate the expected concurrency. - -- `available_devices`: Please fill in GPU IDs and their corresponding capability of concurrency. Each concurrent docker container occupies about **3.3 GB** memory. Ensure that you provide enough GPU memory to accommodate the expected concurrency. - -**Note: If you manually shut down the task server and assigner, please ensure you also stop the Minecraft containers to free up the ports!** - -# Setup for VAB-CSS - -TODO \ No newline at end of file