This project is a web crawler application built using Flask for the backend and React with Vite for the frontend. The crawler fetches and parses web pages, extracts links, and recursively follows these links to a specified depth, creating a sitemap of the domain. This project was inspired by the web-crawler-with-depth repository by Hiurge.
- Recursive Web Crawling: Crawls a given URL to a specified depth, following links within the same domain.
- Flask Backend: Handles the crawling logic and provides an API endpoint for starting the crawl.
- React Frontend: Simple user interface for inputting the URL and depth, and displaying the resulting sitemap.
- CORS Enabled: Allows cross-origin requests between the frontend and backend.
- Error Handling: Gracefully handles invalid requests and network errors.
- Backend: Flask, BeautifulSoup, requests, tldextract
- Frontend: React, Vite
- Networking: Flask-CORS
- Python 3.x
- Node.js and npm
-
Clone the Repository
git clone https://github.com/AdhamAfis/web-crawler.git cd web-crawle -
Backend Setup
cd backend python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate` pip install -r requirements.txt
-
Frontend Setup
cd frontend npm install
-
Start the Backend
cd backend flask run -
Start the Frontend
cd frontend npm run dev -
Access the Application Open your browser and navigate to
http://localhost:5173.
Starts the crawling process.
- URL:
/crawl - Method:
POST - Request Body:
{ "url": "https://example.com", "depth": 2 } - Response:
{ "domain": "https://example.com", "crawl_depth": 2, "sitemap": { ... } }
- backend: Contains the Flask application and crawling logic.
- frontend: Contains the React application for the user interface.
- Flask Setup: Initializes the Flask application and enables CORS.
- Crawling Functions: Includes
get_soup,get_raw_links,clean_links,crawl_page, andcrawl_recursivefunctions for fetching and parsing web pages. - API Endpoint: Defines the
/crawlendpoint for starting the crawl process.
- React Setup: Basic setup with state management using hooks.
- Form Handling: Handles URL and depth input, and submits the data to the backend.
- Sitemap Display: Displays the resulting sitemap in a formatted JSON view.
This project was inspired by the web-crawler-with-depth repository by Hiurge. Their implementation provided a solid foundation and ideas for developing this crawler.
This project is licensed under the MIT License. See the LICENSE file for details.