أبلاي إيدج ابدأ البحث عن عمل

Data Engineer

Indax Inc · San Francisco, CA

قدّم وتابع مع أبلاي إيدج
You own the most important technical bet we're making: a real-time knowledge graph fed by continuous, large-scale data ingestion across thousands of municipal and regulatory websites. Fresh, intentful, linked industrial data at scale — that is the core product, and you are the person who makes it real.You'll work in person in the Bay Area with two founders.We are looking for someone obsessed about data pipelines and scraping infrastructure. Someone who learns fast, ships fast, and doesn't wait to be told what to do next. Someone who has worked in a startup before and understands what it actually means to own something: not just write the code, but watch it in production, fix it when it breaks, and make it better without being asked.We are looking for extreme agency, urgency, humility and exceptional ability.What you'll ownReal-time knowledge graphDesign and operate the property graph at the center of everything — linking facilities, companies, permits, equipment categories, operators, and regulatory events into a live, queryable intelligence layer. Build entity resolution pipelines that reconcile inconsistent identifiers across thousands of sources using deterministic and probabilistic matching. Enrich nodes from EPA ECHO, EIA, SIC/NAICS crosswalks, and county parcel registries. The graph needs to stay fresh, coherent, and fast as sources change beneath it.Large-scale scraping across thousands of municipal websitesBuild and operate the ingestion layer that continuously pulls from municipal permit portals, state environmental databases, GIS feeds, federal registries, and hundreds of other public sources — each with its own structure, authentication quirks, rate limits, and failure modes. Write scrapers, async API clients, and custom parsers for PDF, XML, GeoJSON, and shapefiles. Handle anti-bot systems, proxy rotation, session management, and graceful degradation. This is production scraping infrastructure at a scale most engineers never touch.CRON Jobs to guarantee data freshnessYou will build reliable scraping architecture for thousands of municipal websites, and the scheduled triggers that extract, enrich, match and deduplicate project data at scale.Data lake and regional architectureDesign the storage layer partitioned and indexed by region: county, MSA, state. Structure the data so the knowledge graph and downstream systems can scope efficiently to any geographic footprint at volume.Infrastructure and reliabilityGCP primary. Prefect for orchestration. Sentry for observability. Docker and GitHub Actions for CI/CD.StackLanguages: Python, SQLDatabases: PostgreSQL with PostGIS, Supabase, Neo4j-compatible graph schemaCloud: AWS (inference), GCP (primary)Orchestration: ModalScraping: Browserbase, Playwright, Selenium, httpx, aiohttpTransformation: Large-scale LLM batch jobs on ModalObservability: SentryCI/CD: Docker, GitHub ActionsRequirements2–4 years in data or software engineering with real production ownershipWeb scraping at scale — multi-site, high-volume, production scraping across arbitrary page structures; anti-bot handling, proxy rotation, session management, rate limitingKnowledge graphs in production — design, build, and query property graphs with millions of nodes and edges; entity resolution, graph traversal, and schema evolution under real loadPostgreSQL in production — schema design, query optimization, PostGISScheduled automation at scale — reliable job scheduling, failure recovery, alerting, concurrent job execution across dozens of pipelinesAWS/GCP production experienceA track record of shipping against messy, undocumented, real-world data sourcesPlusNeo4j or property graph modelingGeospatial work — PostGIS, shapefiles, GeoJSON, spatial indexingGeographic data partitioning by county, MSA, or jurisdictionU.S. public regulatory data — EPA, state environmental agencies, municipal permit systemsIndustrial, energy, construction, or supply chain data backgroundIf you want to build the data backbone of the American industrial economy and have your fingerprints on something that matters.