Data Pipeline tips and advice for getting started working with data
You are here because you have Data Pipeline, or you have the publicly available Data Pipeline data, and you are ready to get your hands dirty with some clean data!
Watch the Data Pipeline getting started video
Downloading Data Pipeline data using AWS CLI
Getting AWS CLI set up locally is very simple. Start here with these instructions from the AWS Documentation.
Setting up credentials (only necessary for accessing a private Data Pipeline S3 bucket, not the demo data public S3 bucket)
aws configure --profile parsely_dpl
AWS Access Key ID [None]: ENTER ACCESS ID
AWS Secret Access Key [None]: ENTER SECRET KEY
Default region name [None]: us-east-1
Default output format [None]: json
Download one file
Once AWS CLI is installed, it can be used to download a file from your Parse.ly Data Pipeline S3 bucket, or from the public Parse.ly S3 bucket with demo Data Pipeline data.
For the demo Parse.ly Data Pipeline data
aws --no-sign-request s3 cp s3://parsely-dw-parse-ly-demo events/file_name.gz
For your private customer-specific S3 bucket Make sure to use the profile
flag
aws s3 cp s3://parsely-dw-bucket-name-here events/file_name.gz --profile parsely_dpl
Download all the files
If you want to download all of the files in an S3 bucket, follow the below instructions. Caution, this might be a large amount of data!
For the demo Parse.ly Data Pipeline data
aws --no-sign-request s3 cp s3://parsely-dw-parse-ly-demo . --recursive
For your private customer-specific S3 bucket Make sure to use the profile
flag
aws s3 cp s3://parsely-dw-bucket-name-here . --recursive --profile parsely_dpl
Copy the data to your own S3 bucket
There are a few ways to do this! Using AWS CLI is a simple way to perform this task locally.
For the demo Parse.ly Data Pipeline data
aws s3 --no-sign-request cp s3://parsely-dw-parse-ly-demo s3://your-bucket-here --recursive
For your private customer-specific S3 bucket Make sure to use the profile
flag
aws s3 cp s3://parsely-dw-bucket-name-here s3://your-bucket-here --recursive --profile parsely_dpl
Copy the Data Pipeline data to Redshift or Google BigQuery
Parse.ly has made a Github repository specific to these use cases!.
The README.md in the repository linked above is going to have the most detailed information for how to install and use it best. Below are a few examples of how to use it for handy tasks once you have the parsely_raw_data
repo installed.
Copy the S3 data to your Redshift database
This command will create an Amazon Redshift table using the specified Parse.ly schema and load the Data Pipeline data into the new table.
python -m parsely_raw_data.redshift
Copy the S3 data to your Google BigQuery
This command will create a Google BigQuery table using the specified Parse.ly schema and load the Data Pipeline data into the new table
python -m parsely_raw_data.bigquery
Query the Data Pipeline data using AWS Athena
Don’t want to move any data? We got you covered! AWS Athena is a SQL interface that allows you to query S3 files directly.
- Create an Athena table using the Parse.ly Data Pipeline Athena schema
- Load the data into the recommended year-month partitions:
ALTER TABLE table_name_here ADD PARTITION (year='YYYY', month='MM') location 's3://parsely-dw-bucket-name-here/events/YYYY/MM'
- Use Athena to query the Data Pipeline data!
Getting started queries to answer common questions
These queries as formatted as if you are using Athena to query the Data Pipeline data.
Retrieve all records
This query will retrieve all records inside the Athena table that is reading from the S3 files. Please note that this will only retrieve partitions that have loaded (see section above). The more specific you are about a partition you are querying, the cheaper your Athena query will be.
select * from parsely_data_pipeline_table_name
Bot traffic investigation
Bot traffic is getting smarter! To investigate the user agent and IP address to a specific post on a certain day, you can use the following query as a template
select
user_agent,
visitor_ip,
count(action) as pageviews
from parsely_data_pipeline_table_name
where
year = 'yyyy' and --this makes the query cheaper!
month = 'mm' and --this makes the query cheaper!
action = 'pageview' and
url like '%only-include-unique-url-path-here%' and
date(ts_action) = 'yyyy-mm-dd'
Engaged-time by referrer type
This is a template query to retrieve engaged time by referrer category.
select
channel,
ref_category,
sum(engaged_time_inc) as engaged_time_seconds,
sum(engaged_time_inc)/60 as engaged_time_minutes
from parsely_data_pipeline_table_name
where
year = 'yyyy' and
month = 'mm'
group by 1,2
order by 3 desc
View conversions
Your conversions data is included in your Data Pipeline data! Use the template below to get you started with the conversions data.
select
*
from parsely_data_pipeline_table_name
where
year = 'yyyy' and --this makes the query cheaper!
month = 'mm' and --this makes the query cheaper!
action = 'conversion'
Use dbt and a pre-formatted star schema to organize the Data Pipeline data in Redshift
Coming soon! Contact Parse.ly Support to let us know you’re interested in receiving updates on this effort!
Last updated: January 26, 2023