Research Crawling Engineer Job London Kentucky USA,Software Development

London, United States | Posted on 30/04/2026

They also operate a massive distributed crawler, giving them unique access to high‑quality public web data at global scale.

About the role

They are hiring a Research Crawling Engineer (Full remote - USA/EU, 6 hour overlap with EST). You will join a company at the forefront of developing a web‑scale crawler and knowledge graph that improves access to public web data and extends the value of AI to the people.

As a Research Crawling Engineer, you will design and operate large‑scale web data acquisition systems for research and model development. Your work will span distributed systems, scraping infrastructure, and data pipelines.

Key Responsibilities

Operate at the boundary of scale and reliability
Adapt to constantly changing web environments
Balance throughput, coverage, and data quality
Own end‑to‑end data acquisition pipelines

MISSIONS

Design high‑throughput, fault‑tolerant systems for data collection (millions to billions of URLs/day)
Handle anti‑bot systems, rate limits, and dynamic/JS‑heavy sites
Develop pipelines for cleaning, deduplication, filtering, and normalisation
Construct and maintain datasets for research and model training
Monitor crawl performance, coverage, and data quality; iterate quickly
Collaborate with research teams to align data collection with modeling needs
Optimize infrastructure for cost, latency, and reliability

Example Projects

Build a distributed crawler for a continuously updated, high‑quality web project
Design a system to classify and filter billions of pages for pretraining
Extract structured data from dynamic, JS‑heavy sites at scale
Improve deduplication and quality scoring across multimodal datasets

Requirements

Strong programming experience in one or more of:
Go, Rust, Python, Java, or C++
Experience working for reputable companies
Experience building and maintaining large‑scale web crawlers or data pipelines
Experience designing high‑throughput, fault‑tolerant systems for data collection (millions to billions of URLs/day)
Experience handling anti‑bot systems, rate limits, and dynamic/JS‑heavy sites
Experience constructing and maintaining datasets for research and model training
Familiarity with distributed systems and parallel processing
Experience working with large datasets (TB–PB scale preferred)
Ability to debug unstable or adversarial environments

Preferred / Bonus

Experience with NLP pipelines or dataset curation for ML
Familiarity with LLM pretraining data or retrieval systems
Knowledge of proxy systems, IP rotation, and large‑scale request orchestration
Background in data quality evaluation or benchmarking
Experience running workloads on cloud or bare‑metal infrastructure

Main Evaluation Criteria

Ability to design systems that scale without degrading quality
Practical problem‑solving under real‑world constraints
Speed of iteration and ownership
Measurable improvements in data coverage, quality, or efficiency
Contract:

Permanent role (Full remote – USA or 6 hour overlap with EST)
Salary: $150k to $225k based on experience and demonstrated ability to operate at scale + equity package/tokens

#J-18808-Ljbffr