OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

📢 [Project Page] [V2 Blog Post] [Models V2] [Models V1.5] [HuggingFace Space Demo]

OmniParser is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.

Install

First clone the repo, and then install environment:

cd OmniParser
uv init --python 3.12
uv add -r requirements.txt

Set up huggingface cli:

curl -LsSf https://hf.co/cli/install.sh | bash

Ensure you have the V2 weights downloaded in weights folder (ensure caption weights folder is called icon_caption_florence). If not download them with:

   # download the model checkpoints to local directory OmniParser/weights/
   for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do hf download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
   mv weights/icon_caption weights/icon_caption_florence

Frontier Health model server (AWS deployment)

To create and run a new model server endpoint:

alias omni='set -a && source .env && set +a && ec2_manage.sh'

then, create a new server with:

omni create

Deploy to an existing server with:

omni deploy

stop the instance:

omni stop

restart the instance:

omni start

terminate the instance (permanently):

omni terminate

Check the server is running:

curl http://{SERVER_IP}:8000/health/

Model Weights License

For the model checkpoints on huggingface model hub, please note that icon_detect model is under AGPL license since it is a license inherited from the original yolo model. And icon_caption_blip2 & icon_caption_florence is under MIT license. Please refer to the LICENSE file in the folder of each model: https://huggingface.co/microsoft/OmniParser.

📚 Citation

Our technical report can be found here. If you find our work useful, please consider citing our work:

@misc{lu2024omniparserpurevisionbased,
      title={OmniParser for Pure Vision Based GUI Agent}, 
      author={Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah},
      year={2024},
      eprint={2408.00203},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.00203}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
docs		docs
eval		eval
imgs		imgs
omnitool		omnitool
util		util
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
demo.ipynb		demo.ipynb
docker-compose.gpu.yml		docker-compose.gpu.yml
ec2_manage.sh		ec2_manage.sh
gradio_demo.py		gradio_demo.py
pyproject.toml		pyproject.toml
s3_inspector.py		s3_inspector.py
test_parse.py		test_parse.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

Install

Frontier Health model server (AWS deployment)

Model Weights License

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

Install

Frontier Health model server (AWS deployment)

Model Weights License

📚 Citation

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages