Thursday, May 22, 2025

Data, GCP

  Data, Main Only

- serverless, microservices, event-driven

- Case 1

- Case 2

Tuesday, September 3, 2024

Data Architecture, On-premise

Main Only

- PB+ data

- 2016 version



Monday, September 2, 2024

Data Architecture, GCP, I

 Data, Main Only

- serverless, microservices, event-driven pipelines


BigQuery

- easy data for everyone

Wednesday, January 4, 2023

Tip 4 Others

Regexp

Match "an apple" or "an apple pie", extract without "an"

((?<=an )apple|(?<=an )apple pie)$

Find the shortest matches between two strings

(from ((?!from).)*? on)

Extract table names from SQL

(from|join)\s+(\S+)

Google Sheets

Add missing dates to the list of dates with values

=ARRAYFORMULA({TO_DATE(ROW( INDIRECT("A"&DATEVALUE(MIN(A:A))&":B"&DATEVALUE(MAX(A:A))))), IFERROR(VLOOKUP(TO_DATE(ROW( INDIRECT("A"&DATEVALUE(MIN(A:A))&":B"&DATEVALUE(MAX(A:A))))), A:B, 2, 0))})

Automatic timestamp when a cell on the same row gets updated

function onEdit(e) {

  var ss = SpreadsheetApp.getActiveSheet();

  var r = ss.getActiveCell();

  //1.Change 'Sheet1' to be matching your sheet name

  if (r.getColumn() > 1 && ss.getName()=='Sheet1') { // 2. If Edit is done in any column after Column (A)  And sheet name is Sheet1 then:

var celladdress ='A'+ r.getRowIndex() 

    ss.getRange(celladdress).setValue(new Date()).setNumberFormat("yyyy-MM-dd hh:mm:ss");

  }

};

Automatic sorting when a cell gets updated

function onEdit(event){

  var sheet = event.source.getActiveSheet();

  var editedCell = sheet.getActiveCell();


  var columnToSortBy = 4;

  var tableRange = "B3:E9";


  if(editedCell.getColumn() == columnToSortBy){   

    var range = sheet.getRange(tableRange);

    range.sort( { column : columnToSortBy } );

  }

}


Python

Connect to BigQuery without credential files like JSON

from pydata_google_auth import get_user_credentials
scopes = ["https://www.googleapis.com/auth/bigquery"]
credentials = get_user_credentials(scopes)
//  notebooks running on servers or cloud resources
credentials = get_user_credentials(scopes, use_local_webserver=False)

Increase the density of x-ticks

test = sns.distplot(df.BattlePower_diff, 100, kde=False)

plt.xticks(rotation=90)

test.xaxis.set_major_locator(plt.MaxNLocator(200))


MongoDB

filter between two dates with time zone

{

    $or:[

        {createdAt: {$gte: ISODate("2023-07-26T00:00:00.000+09:00"), $lt: ISODate("2023-07-27T00:00:00.000+09:00")}},

        {updatedAt: {$gte: ISODate("2023-07-26T00:00:00.000+09:00"), $lt: ISODate("2023-07-27T00:00:00.000+09:00")}},

    ]

}


{

    $or:[

        {createdAt: {$gte: ISODate("2023-07-25T15:00:00.000+00:00"), $lt: ISODate("2023-07-26T15:00:00.000+00:00")}},

        {updatedAt: {$gte: ISODate("2023-07-25T15:00:00.000+00:00"), $lt: ISODate("2023-07-26T15:00:00.000+00:00")}},

    ]

}

Monday, December 5, 2022

Dataproc Alert

 Monitoring, Alerting 2 Slack

  • EDIT NOTIFICATION CHANNELS
  • Slack > ADD NEW > ALLOW
  • Slack Channel Name > Cloud Alerting Display Name > SEND TEST NOTIFICATION
    • If a private channel is specified, Invite Google Cloud Monitoring to the channel
  • Save
  • Up to Alerting overview page
  • CREATE POLICY > SELECT A METRIC
  • Cloud Dataproc Cluster > N metrics > Cluster > N metrics > Failed jobs > Apply
  • ADD FILTER > error_type = DATAPROC_SERVER_ERROR > DONE
  • Rolling window function, delta > NEXT
  • Threshold value, 0 > NEXT
  • Notification Channels > Slack, the channel name > OK
  • Alert policy name > NEXT
  • CREATE POLICY

Eventarc, Workflows 2 Slack
  • Eventarc, google.cloud.dataproc.v1.ClusterController.DeleteCluster triggers a Workflow
  • Workflows, check job status > send a slack message if it equals ERROR
  • Sample
  • main:
        params: [input]
        steps:
        - assignArgs:
                assign:
                - projectID: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
                - response: ${input}
                - jobId: ${"step-spark-postgresql-2-bigquery-" + text.substring(response.data.resource.labels.cluster_name, 36, 49)}
        - checkOperation:
                switch:
                    - condition: ${"first" in response.data.operation}
                      next: getJob
                next: returnOutput
        - getJob:
                call: http.get
                args:
                    auth:
                        type: OAuth2
                        scopes: https://www.googleapis.com/auth/cloud-platform
                    url: ${"https://dataproc.googleapis.com/v1/projects/" + projectID + "/regions/asia-northeast1/jobs/" + jobId}
                result: response
        - checkStatus:
                switch:
                    - condition: ${response.body.status.state == "ERROR"}
                      next: sendSlack
                next: returnOutput
        - sendSlack:
                call: http.post
                args:
                    headers: {Content-type: application/json, Authorization: Bearer xoxb-your-token}
                    url: https://slack.com/api/chat.postMessage
                    body: { "channel": "your-channel-id", "text": '${"https://console.cloud.google.com/dataproc/jobs/" + jobId + "?region=asia-northeast1&project=" + projectID}'}
        - returnOutput:
                return: ${response}