8000
Skip to content

AdhamAfis/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler Project

Overview

This project is a web crawler application built using Flask for the backend and React with Vite for the frontend. The crawler fetches and parses web pages, extracts links, and recursively follows these links to a specified depth, creating a sitemap of the domain. This project was inspired by the web-crawler-with-depth repository by Hiurge.

Features

  • Recursive Web Crawling: Crawls a given URL to a specified depth, following links within the same domain.
  • Flask Backend: Handles the crawling logic and provides an API endpoint for starting the crawl.
  • React Frontend: Simple user interface for inputting the URL and depth, and displaying the resulting sitemap.
  • CORS Enabled: Allows cross-origin requests between the frontend and backend.
  • Error Handling: Gracefully handles invalid requests and network errors.

Technologies Used

  • Backend: Flask, BeautifulSoup, requests, tldextract
  • Frontend: React, Vite
  • Networking: Flask-CORS

Getting Started

Prerequisites

  • Python 3.x
  • Node.js and npm

Installation

  1. Clone the Repository

    git clone https://github.com/AdhamAfis/web-crawler.git
    cd web-crawle
  2. Backend Setup

    cd backend
    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
    pip install -r requirements.txt
  3. Frontend Setup

    cd frontend
    npm install

Running the Application

  1. Start the Backend

    cd backend
    flask run
  2. Start the Frontend

    cd frontend
    npm run dev
  3. Access the Application Open your browser and navigate to http://localhost:5173.

API Endpoint

POST /crawl

Starts the crawling process.

  • URL: /crawl
  • Method: POST
  • Request Body:
    {
      "url": "https://example.com",
      "depth": 2
    }
  • Response:
    {
      "domain": "https://example.com",
      "crawl_depth": 2,
      "sitemap": { ... }
    }

Project Structure

  • backend: Contains the Flask application and crawling logic.
  • frontend: Contains the React application for the user interface.

Code Explanation

Backend (backend/app.py)

  • Flask Setup: Initializes the Flask application and enables CORS.
  • Crawling Functions: Includes get_soup, get_raw_links, clean_links, crawl_page, and crawl_recursive functions for fetching and parsing web pages.
  • API Endpoint: Defines the /crawl endpoint for starting the crawl process.

Frontend (frontend/src/App.jsx)

  • React Setup: Basic setup with state management using hooks.
  • Form Handling: Handles URL and depth input, and submits the data to the backend.
  • Sitemap Display: Displays the resulting sitemap in a formatted JSON view.

Inspiration

This project was inspired by the web-crawler-with-depth repository by Hiurge. Their implementation provided a solid foundation and ideas for developing this crawler.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

0