Shape the Future with Dun & Bradstreet
At Dun & Bradstreet, we believe data has the power to create a better tomorrow. As a global leader in business decisioning data and analytics, we help companies worldwide grow, manage risk, and innovate. For over 180 years, businesses have trusted us to turn uncertainty into opportunity. We’re a diverse, global team that values creativity, collaboration, and bold ideas. Are you ready to make an impact and help shape what’s next? Join us! Explore opportunities at dnb.com/careers.
Job Summary:
We are looking for a skilled Data Engineer to join our Global Product Data (GPD) team in Hyderabad. You will play a critical role in building and maintaining automated web scraping pipelines that extract structured data from diverse online sources, transforming raw data into production-ready datasets for our Master Data Repository (MDR).
This role is part of a strategic initiative to bring web scraping and data acquisition capabilities in-house, replacing external vendor dependencies. You will work closely with the data engineering and product teams to ensure high-quality, reliable, and timely data delivery.
Key Responsibilities:
Design, develop, and maintain scalable web scraping solutions to extract data from a wide range of websites and online platforms
Build robust data pipelines and automation workflows for data collection, cleaning, validation, and transformation
Process and prepare scraped data into MDR production-ready formats, meeting strict quality and timeline requirements
Monitor and troubleshoot scraping jobs, handling anti-bot mechanisms, CAPTCHAs, rate limiting, and site structure changes
Collaborate with cross-functional teams to understand data requirements, prioritize sources, and define scraping specifications
Document scraping processes, data schemas, and technical decisions for knowledge sharing and continuity
Identify opportunities for process improvement and automation to increase efficiency and reduce turnaround time
Support the transition of work from external vendors, ensuring seamless continuity of data deliveries
Key Skills:
8+ years of professional experience in web scraping, data extraction, or data engineering
Strong proficiency in Python, with hands-on experience using scraping libraries and frameworks (Scrapy, BeautifulSoup, Selenium, Playwright, or similar)
Experience building and scheduling automated data pipelines (cron, Airflow, or equivalent orchestration tools)
Solid understanding of HTML, CSS, DOM structure, and browser developer tools for inspecting and reverse-engineering web pages
Familiarity with REST APIs, JSON, and techniques for extracting data from API endpoints
Experience with relational databases (PostgreSQL, MySQL) and proficiency in SQL
Ability to handle anti-scraping measures: proxy rotation, headless browsers, CAPTCHA handling, and request throttling
Strong problem-solving skills and attention to data quality and accuracy
Good to have Skills:
Experience with cloud platforms (AWS, GCP, or Azure) for deploying and scaling scraping infrastructure
Familiarity with containerization (Docker) and CI/CD pipelines
Experience with data transformation tools or ETL frameworks
Knowledge of natural language processing (NLP) or AI-assisted data extraction techniques
Prior experience in education data, institutional data, or similar structured-data domains
Experience with NoSQL databases (MongoDB, Elasticsearch) for handling semi-structured data