This role sits at the intersection of distributed data engineering, entity matching, identity resolution, and large-scale healthcare data processing. You will lead a small team of engineers while remaining deeply hands-on technically, owning the systems and pipelines powering automatching, grouping logic, identity mapping, deduplication, and enrichment workflows processing tens of millions of records.
You will partner closely with Product, AI/ML, Analytics, and Engineering teams to improve platform accuracy, scalability, reliability, and operational efficiency across one of H1’s most critical data platforms.
You will:
- Lead the design, optimization, and scalability of distributed Spark/PySpark pipelines powering entity resolution and large-scale healthcare data processing.
- Own systems supporting automatching, identity mapping, grouping logic, deduplication, enrichment, and auto-approval workflows across healthcare provider and organization datasets.
- Build and maintain scalable processing frameworks for PubMed, clinical trial, ct.gov, conference, and other healthcare data sources.
- Drive infrastructure optimization initiatives focused on improving throughput, runtime, observability, and cloud compute cost efficiency.
- Partner closely with AI/ML teams to integrate matching and resolution models into EMERALD and improve matching precision and recall.
- Lead complex technical initiatives from architecture and design through deployment, monitoring, and long-term production support.
- Serve as a technical leader and mentor across the team through code reviews, technical guidance, and engineering best practices.
- Collaborate directly with Product and business stakeholders to align technical solutions with operational and customer needs.
- Support production operations, incident response, troubleshooting, and ongoing platform reliability.
You bring strong hands-on engineering expertise across distributed computing, large-scale data processing, and infrastructure optimization while also helping guide technical direction and mentor engineers across the organization.
- Deep expertise with distributed data processing frameworks such as Apache Spark and Hadoop, particularly within AWS environments.
- Strong proficiency in Python (PySpark), Scala, Java, or other modern programming languages used for large-scale distributed processing.
- Experience building scalable ETL/ELT frameworks across both batch and streaming architectures.
- Experience with entity resolution, identity mapping, automatching, deduplication, or large-scale matching systems is strongly preferred.
- Strong understanding of distributed file formats including Apache Parquet and Apache AVRO.
- Experience with streaming technologies such as Kafka, Spark Streaming, or KSQL.
- Strong grasp of software engineering fundamentals including distributed systems, data structures, concurrency, and system design.
- Experience performing root cause analysis across large-scale distributed systems and complex data pipelines.
- Ability to write clean, maintainable, modular, and production-grade code.
- Experience improving performance, scalability, observability, and infrastructure efficiency within distributed systems.
- Strong communication and collaboration skills across both technical and non-technical stakeholders.
- Familiarity with modern development and infrastructure tooling including Git, CI/CD pipelines, Docker, Kubernetes, Terraform, Argo, Hudi, and JIRA.
Anticipated role close date: 8/1/2026
Tech jobs straight from company career pages. No recruiters, no middlemen, no spam.