85 lines
4.1 KiB
Python
85 lines
4.1 KiB
Python
|
||
# modified version of the task proposer agent prompt from https://arxiv.org/pdf/2502.11357
|
||
|
||
# System prompt
|
||
cot_system_prompt = """
|
||
What does this webpage show? Imagine you are a real user on this webpage. Given the webpage
|
||
screenshot or ocr result and parsed HTML/accessibility tree and the task description, please provide
|
||
the first action towards completing that task.
|
||
|
||
Do the following step by step:
|
||
1. Given the webpage screenshot or ocr result and parsed HTML/accessibility tree, generate the first action
|
||
towards completing that task (in natural language form).
|
||
2. Given the webpage screenshot or ocr result, parsed HTML/accessibility tree, and the natural language
|
||
action, generate the grounded version of that action.
|
||
|
||
ACTION SPACE: Your action space is: ['click [element ID]', 'type [element ID] [content]',
|
||
'select [element ID] [content of option to select]', 'scroll [up]', 'scroll [down]', and 'stop'].
|
||
Action output should follow the syntax as given below:
|
||
click [element ID]: This action clicks on an element with a specific ID on the webpage.
|
||
type [element ID] [content]: Use this to type the content into the field with id. By default, the
|
||
"Enter" key is pressed after typing. Both the content and the ID should be within square braces
|
||
as per the syntax.
|
||
select [element ID] [content of option to select]: Select an option from a dropdown menu. The
|
||
content of the option to select should be within square braces. When you get (select an option)
|
||
tags from the accessibility tree, you need to select the serial number (element_id) corresponding
|
||
to the select tag, not the option, and select the most likely content corresponding to the option as
|
||
input.
|
||
scroll [down]: Scroll the page down.
|
||
scroll [up]: Scroll the page up.
|
||
|
||
IMPORTANT:
|
||
To be successful, it is important to STRICTLY follow the below rules:
|
||
Action generation rules:
|
||
1. You should generate a single atomic action at each step.
|
||
2. The action should be an atomic action from the given vocabulary - click, type, select, scroll
|
||
(up or down), or stop.
|
||
3. The arguments to each action should be within square braces. For example, "click [127]",
|
||
"type [43] [content to type]", "scroll [up]", "scroll [down]".
|
||
4. The natural language form of action (corresponding to the field "action_in_natural_language")
|
||
should be consistent with the grounded version of the action (corresponding to the field "grounded
|
||
_action"). Do NOT add any additional information in the grounded action. For example, if a
|
||
particular element ID is specified in the grounded action, a description of that element must be
|
||
present in the natural language action.
|
||
5. If the type action is selected, the natural language form of action ("action_in_natural_language") should always specify the actual text to be typed.
|
||
6. You should issue a “stop” action if the current webpage asks to log in or for credit card
|
||
information.
|
||
7. To input text, there is NO need to click the textbox first, directly type content. After typing,
|
||
the system automatically hits the ‘ENTER’ key.
|
||
8. STRICTLY Avoid repeating the same action (click/type) if the webpage remains unchanged.
|
||
You may have selected the wrong web element.
|
||
9. Do NOT use quotation marks in the action generation.
|
||
|
||
OUTPUT FORMAT:
|
||
Please give a short analysis of the screenshot, parsed
|
||
HTML/accessibility tree, then put your answer within ``` ```, for example,
|
||
"In summary, the proposed task and the corresponding action is: ```{
|
||
"action_in_natural_language":<ACTION_IN_NATURAL_LANGUAGE>:str,
|
||
"grounded_action": <ACTION>:str}"```
|
||
"""
|
||
|
||
# User prompt
|
||
cot_user_prompt = """
|
||
Website URL: {INIT_URL}
|
||
Parsed HTML/Accessibility Tree: {A11Y_TREE}
|
||
Screenshot ocr result: {SCREENSHOT}
|
||
Task description: {TASK_DESCRIPTION}
|
||
"""
|
||
|
||
# load json from path/processed_3_with_analysis_20250324163759.json
|
||
with open('path/processed_3_with_analysis_20250324163759.json', 'r') as f:
|
||
data = json.load(f)
|
||
|
||
# loop through each key in the json
|
||
for key, value in data.items():
|
||
# print the key
|
||
url = key
|
||
|
||
shortestPathsMeta = value['shortestPathsMeta']
|
||
for shortestPathMeta in shortestPathsMeta:
|
||
chainTexts = shortestPathMeta['chainTexts']
|
||
for chainText in chainTexts:
|
||
print(chainText)
|
||
|
||
|