8000
Skip to content

Frontier-Health/OmniParser

 
 

Repository files navigation

OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

Logo

arXiv License

📢 [Project Page] [V2 Blog Post] [Models V2] [Models V1.5] [HuggingFace Space Demo]

OmniParser is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.

Install

First clone the repo, and then install environment:

cd OmniParser
uv init --python 3.12
uv add -r requirements.txt

Set up huggingface cli:

curl -LsSf https://hf.co/cli/install.sh | bash

Ensure you have the V2 weights downloaded in weights folder (ensure caption weights folder is called icon_caption_florence). If not download them with:

   # download the model checkpoints to local directory OmniParser/weights/
   for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do hf download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
   mv weights/icon_caption weights/icon_caption_florence

 Frontier Health model server (AWS deployment)

To create and run a new model server endpoint:

alias omni='set -a && source .env && set +a && ec2_manage.sh'

then, create a new server with:

omni create

Deploy to an existing server with:

omni deploy

stop the instance:

omni stop

restart the instance:

omni start

terminate the instance (permanently):

omni terminate

Check the server is running:

curl http://{SERVER_IP}:8000/health/

Model Weights License

For the model checkpoints on huggingface model hub, please note that icon_detect model is under AGPL license since it is a license inherited from the original yolo model. And icon_caption_blip2 & icon_caption_florence is under MIT license. Please refer to the LICENSE file in the folder of each model: https://huggingface.co/microsoft/OmniParser.

📚 Citation

Our technical report can be found here. If you find our work useful, please consider citing our work:

@misc{lu2024omniparserpurevisionbased,
      title={OmniParser for Pure Vision Based GUI Agent}, 
      author={Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah},
      year={2024},
      eprint={2408.00203},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.00203}, 
}

About

A simple screen parsing tool towards pure vision based GUI agent

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 47.6%
  • Python 37.3%
  • Shell 10.2%
  • PowerShell 4.3%
  • Other 0.6%
0