×
Register Here to Apply for Jobs or Post Jobs. X

Research Crawling Engineer

Job in London, Laurel County, Kentucky, 40741, USA
Listing for: Startup Talents
Full Time position
Listed on 2026-05-05
Job specializations:
  • Software Development
    Data Engineer, Software Engineer
Salary/Wage Range or Industry Benchmark: 150000 - 225000 USD Yearly USD 150000.00 225000.00 YEAR
Job Description & How to Apply Below

London, United States | Posted on 30/04/2026

They also operate a massive distributed crawler, giving them unique access to high‑quality public web data at global scale.

About the role

They are hiring a Research Crawling Engineer (Full remote - USA/EU, 6 hour overlap with EST). You will join a company at the forefront of developing a web‑scale crawler and knowledge graph that improves access to public web data and extends the value of AI to the people.

As a Research Crawling Engineer, you will design and operate large‑scale web data acquisition systems for research and model development. Your work will span distributed systems, scraping infrastructure, and data pipelines.

Key Responsibilities
  • Operate at the boundary of scale and reliability
  • Adapt to constantly changing web environments
  • Balance throughput, coverage, and data quality
  • Own end‑to‑end data acquisition pipelines
MISSIONS
  • Design high‑throughput, fault‑tolerant systems for data collection (millions to billions of URLs/day)
  • Handle anti‑bot systems, rate limits, and dynamic/JS‑heavy sites
  • Develop pipelines for cleaning, deduplication, filtering, and normalisation
  • Construct and maintain datasets for research and model training
  • Monitor crawl performance, coverage, and data quality; iterate quickly
  • Collaborate with research teams to align data collection with modeling needs
  • Optimize infrastructure for cost, latency, and reliability
Example Projects
  • Build a distributed crawler for a continuously updated, high‑quality web project
  • Design a system to classify and filter billions of pages for pretraining
  • Extract structured data from dynamic, JS‑heavy sites at scale
  • Improve deduplication and quality scoring across multimodal datasets
Requirements
  • Strong programming experience in one or more of:
    Go, Rust, Python, Java, or C++
  • Experience working for reputable companies
  • Experience building and maintaining large‑scale web crawlers or data pipelines
  • Experience designing high‑throughput, fault‑tolerant systems for data collection (millions to billions of URLs/day)
  • Experience handling anti‑bot systems, rate limits, and dynamic/JS‑heavy sites
  • Experience constructing and maintaining datasets for research and model training
  • Familiarity with distributed systems and parallel processing
  • Experience working with large datasets (TB–PB scale preferred)
  • Ability to debug unstable or adversarial environments
Preferred / Bonus
  • Experience with NLP pipelines or dataset curation for ML
  • Familiarity with LLM pretraining data or retrieval systems
  • Knowledge of proxy systems, IP rotation, and large‑scale request orchestration
  • Background in data quality evaluation or benchmarking
  • Experience running workloads on cloud or bare‑metal infrastructure
Main Evaluation Criteria
  • Ability to design systems that scale without degrading quality
  • Practical problem‑solving under real‑world constraints
  • Speed of iteration and ownership
  • Measurable improvements in data coverage, quality, or efficiency
  • Contract:

    Permanent role (Full remote – USA or 6 hour overlap with EST)
  • Salary: $150k to $225k based on experience and demonstrated ability to operate at scale + equity package/tokens
#J-18808-Ljbffr
To View & Apply for jobs on this site that accept applications from your location or country, tap the button below to make a Search.
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).
 
 
 
Search for further Jobs Here:
(Try combinations for better Results! Or enter less keywords for broader Results)
Location
Increase/decrease your Search Radius (miles)
0
200
Filters
Education Level
Experience Level (years)
Posted in last:
Salary