INTL
Freelancer
전문가
외주
원격 가능
Semi-Structured Data Transformation
예산
$1,500~$12,500 INR
예상 기간
1~3개월
난이도
전문가
기술 스택
Python
PySpark
SQL
Scala
Spark
Hadoop
BigQuery
Snowflake
Airflow
dbt
데이터베이스
ETL/ELT
Java
AI 분석 요약
이 프로젝트는 관계형 데이터베이스에 저장된 반정형 데이터(중첩 JSON/key-value)를 분석에 적합한 정규화된 테이블로 변환하는 강력한 ETL/ELT 파이프라인 구축을 목표로 합니다. 대규모 데이터 처리 역량과 Spark, BigQuery, Snowflake 또는 최적화된 SQL/Python 기반 파이프라인 개발 경험이 필수적입니다.
프로젝트 원문 설명
I’m sitting on a large database that stores semi-structured records, and I need a robust transformation layer that turns this raw content into analysis-ready tables. The data is already captured and stored; the task begins once the records land in the database and ends when the transformed results are written back to a target schema (or files, if that proves more efficient).
Key points you should know
• Source: relational database containing nested JSON / key-value blobs.
• Goal: parse, normalize, and flatten these blobs into well-defined columns while preserving relationships and lineage.
• Scale: millions of rows, so solutions that leverage Spark, Hadoop, BigQuery, Snowflake, or well-tuned SQL/Python pipelines are welcome—as long as they remain maintainable.
Deliverables
1. Transformation code (Python, PySpark, SQL, or Scala) with clear comments.
2. A runnable job definition or workflow file (Airflow DAG, Spark submit script, dbt model, etc.) that shows how to execute the pipeline end-to-end.
3. Simple README explaining prerequisites, run steps, and how new fields should be added in future.
Acceptance criteria
• Pipeline processes at least 10 GB of source data without errors.
• Output tables/files match the target schema I’ll provide and contain no missing or malformed records.
• Execution can be parameterized for date ranges or incremental loads.
If you’ve built similar ETL or ELT jobs against semi-structured data and can demonstrate performance at scale, I’d love to see your approach.
Key points you should know
• Source: relational database containing nested JSON / key-value blobs.
• Goal: parse, normalize, and flatten these blobs into well-defined columns while preserving relationships and lineage.
• Scale: millions of rows, so solutions that leverage Spark, Hadoop, BigQuery, Snowflake, or well-tuned SQL/Python pipelines are welcome—as long as they remain maintainable.
Deliverables
1. Transformation code (Python, PySpark, SQL, or Scala) with clear comments.
2. A runnable job definition or workflow file (Airflow DAG, Spark submit script, dbt model, etc.) that shows how to execute the pipeline end-to-end.
3. Simple README explaining prerequisites, run steps, and how new fields should be added in future.
Acceptance criteria
• Pipeline processes at least 10 GB of source data without errors.
• Output tables/files match the target schema I’ll provide and contain no missing or malformed records.
• Execution can be parameterized for date ranges or incremental loads.
If you’ve built similar ETL or ELT jobs against semi-structured data and can demonstrate performance at scale, I’d love to see your approach.
Freelancer에서 원본 확인
원본 보기