Web Crawler Project

Overview

This project is a web crawler application built using Flask for the backend and React with Vite for the frontend. The crawler fetches and parses web pages, extracts links, and recursively follows these links to a specified depth, creating a sitemap of the domain. This project was inspired by the web-crawler-with-depth repository by Hiurge.

Features

Recursive Web Crawling: Crawls a given URL to a specified depth, following links within the same domain.
Flask Backend: Handles the crawling logic and provides an API endpoint for starting the crawl.
React Frontend: Simple user interface for inputting the URL and depth, and displaying the resulting sitemap.
CORS Enabled: Allows cross-origin requests between the frontend and backend.
Error Handling: Gracefully handles invalid requests and network errors.

Technologies Used

Backend: Flask, BeautifulSoup, requests, tldextract
Frontend: React, Vite
Networking: Flask-CORS

Getting Started

Prerequisites

Python 3.x
Node.js and npm

Installation

Clone the Repository

git clone https://github.com/AdhamAfis/web-crawler.git
cd web-crawle

Backend Setup

cd backend
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
pip install -r requirements.txt

Frontend Setup
```
cd frontend
npm install
```

Running the Application

Start the Backend
```
cd backend
flask run
```
Start the Frontend
```
cd frontend
npm run dev
```
Access the Application Open your browser and navigate to http://localhost:5173.

API Endpoint

POST /crawl

Starts the crawling process.

URL: /crawl
Method: POST

Request Body:

{
  "url": "https://example.com",
  "depth": 2
}

Response:

{
  "domain": "https://example.com",
  "crawl_depth": 2,
  "sitemap": { ... }
}

Project Structure

backend: Contains the Flask application and crawling logic.
frontend: Contains the React application for the user interface.

Code Explanation

Backend (`backend/app.py`)

Flask Setup: Initializes the Flask application and enables CORS.
Crawling Functions: Includes get_soup, get_raw_links, clean_links, crawl_page, and crawl_recursive functions for fetching and parsing web pages.
API Endpoint: Defines the /crawl endpoint for starting the crawl process.

Frontend (`frontend/src/App.jsx`)

React Setup: Basic setup with state management using hooks.
Form Handling: Handles URL and depth input, and submits the data to the backend.
Sitemap Display: Displays the resulting sitemap in a formatted JSON view.

Inspiration

This project was inspired by the web-crawler-with-depth repository by Hiurge. Their implementation provided a solid foundation and ideas for developing this crawler.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler Project

Overview

Features

Technologies Used

Getting Started

Prerequisites

Installation

Running the Application

API Endpoint

POST /crawl

Project Structure

Code Explanation

Backend (`backend/app.py`)

Frontend (`frontend/src/App.jsx`)

Inspiration

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Crawler Project

Overview

Features

Technologies Used

Getting Started

Prerequisites

Installation

Running the Application

API Endpoint

POST /crawl

Project Structure

Code Explanation

Backend (backend/app.py)

Frontend (frontend/src/App.jsx)

Inspiration

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Backend (`backend/app.py`)

Frontend (`frontend/src/App.jsx`)

Packages