Sunday, November 27, 2022

How to use Dive to inspect Docker image contents

Install Dive 

brew install dive


Now once installed, just give like below the argument needs to be Image ID

dive 7d35abc40782

references:

https://github.com/wagoodman/dive


Docker steps involved in Building Image

If the Dockerfile content is like this 

#from base image

FROM ubuntu:14.04

#author name

MAINTAINER RAGHU

#commands to run in the container

RUN echo "hello Raghu"

RUN sleep 10

RUN echo "TASK COMPLETED"



Command used to build the image: docker build -t raghavendar/hands-on:2.0 .


Sending build context to Docker daemon 20.04 MB

Step 1 : FROM ubuntu:14.04

---> b1719e1db756

Step 2 : MAINTAINER RAGHU

---> Running in 532ed79e6d55

---> ea6184bb8ef5

Removing intermediate container 532ed79e6d55

Step 3 : RUN echo "hello Raghu"

---> Running in da327c9b871a

hello Raghu

---> f02ff92252e2

Removing intermediate container da327c9b871a

Step 4 : RUN sleep 10

---> Running in aa58dea59595

---> fe9e9648e969

Removing intermediate container aa58dea59595

Step 5 : RUN echo "TASK COMPLETED"

---> Running in 612adda45c52

TASK COMPLETED

---> 86c73954ea96

Removing intermediate container 612adda45c52

Successfully built 86c73954ea96



Some explanation of the build process is as below 

Yes, Docker images are layered. When you build a new image, Docker does this for each instruction (RUN, COPY etc.) in your Dockerfile:


create a temporary container from the previous image layer (or the base FROM image for the first command;

run the Dockerfile instruction in the temporary "intermediate" container;

save the temporary container as a new image layer.


The final image layer is tagged with whatever you name the image - this will be clear if you run docker history raghavendar/hands-on:2.0, you'll see each layer and an abbreviation of the instruction that created it.


Your specific queries:

1) 532 is a temporary container created from image ID b17, which is your FROM image, ubuntu:14.04.

2) ea6 is the image layer created as the output of the instruction, i.e. from saving intermediate container 532.

3) yes. Docker calls this the Union File System and it's the main reason why images are so efficient.

references:

https://stackoverflow.com/questions/39705085/how-are-intermediate-containers-formed

Sunday, November 20, 2022

Javascript for vs foreach method

 


Javascript map vs foreach

Map actually creates a new array. For each does not. this is the main difference. 

on some machines, forEach() was more than 70% slower than map(). Your browser is probably different. You can check out the full test results here:

 references:

https://codeburst.io/javascript-map-vs-foreach-f38111822c0f

Thursday, November 17, 2022

npm registry commands useful

npm config get registry

npm config list

npm config edit

npm config get

npm config set <name> <url>

npm cache clean

npm config delete registry => deletes all registry 

Install from specific registry 

npm install @cisco-bpa-platform/ui-template-manager


Wednesday, November 16, 2022

Camunda Cron Job Expressions

Below are few good resources for finding out the Cron job expressions. There is also a problem that depending on the version of the Camunda engine, the support for these cron expressions are limited. 

It might throw errors like this below. 

NGINE-09026 Exception while parsing cron expression '0 0 * * * MON': Support for specifying both a day-of-week AND a day-of-month parameter is not implemented. [ deploy-error ]
ENGINE-09026 Exception while parsing cron expression '0 0 * ? ? MON': '?' can only be specfied for Day-of-Month or Day-of-Week. [ deploy-error ]
ENGINE-09026 Exception while parsing cron expression '* * * * * 1': Support for specifying both a day-of-week AND a day-of-month parameter is not implemented. [ deploy-error ]
ENGINE-09026 Exception while parsing cron expression '* * * * * 1': Support for specifying both a day-of-week AND a day-of-month parameter is not implemented. [ deploy-error ]
ENGINE-09026 Exception while parsing cron expression '* * * * * MON': Support for specifying both a day-of-week AND a day-of-month parameter is not implemented. [ deploy-error ]


https://www.freeformatter.com/cron-expression-generator-quartz.html

https://crontab.guru/every-week


https://github.com/camunda/zeebe/issues/9673



Tuesday, November 15, 2022

AI/ML How to print Dataframes in style

The approaches are if using IPython, then use the display 

from IPython.display import display

display(df)


If is also possible to apply the style using the below 


df.style


def color_negative_red(val):

    """

    Takes a scalar and returns a string with

    the css property `'color: red'` for negative

    strings, black otherwise.

    """

    color = 'blue' if val > 90 else 'black'

    return 'color: % s' % color


df.style.applymap(color_negative_red)


If using print, then tabulate is a good option 


from tabulate import tabulate


print (tabulate(df, headers = 'keys', tablefmt = 'psql'))

There are many table formats available such other than psql 


with psql, it looks like this below 


references:

https://www.geeksforgeeks.org/display-the-pandas-dataframe-in-table-style/ 


AI/ML What is Document Term Matrix

The text data is represented in the form of a matrix. The rows of the matrix represent the sentences from the data which needs to be analyzed and the columns of the matrix represent the word. The dice under the matrix represent the number of occurrences of the words. Let’s understand it with an example.


import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

docs = [sentence1, sentence2, sentence3]

print(docs)

docs = [sentence1, sentence2, sentence3]

print(docs)


vec = CountVectorizer()

X = vec.fit_transform(docs)


#now this can be converted to and printed using data frame 

df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

df.head()


An example view from another workspace is 


References:

https://analyticsindiamag.com/a-guide-to-term-document-matrix-with-its-implementation-in-r-and-python/

AI/ML Logistic regression - Accuracy Value

Higher accuracy is indication of model performing better. 

Accuracy = TP+TN/TP+FP+FN+TN

TP = True positives

TN = True negatives

FN = False negatives

TN = True negatives


F1-score = 2*(Recall*Precision)/Recall+Precision where,


Precision = TP/TP+FP

Recall = TP/TP+FN


The scikit library gives better methods to do this

from sklearn.metrics import accuracy_score

accuracy_score(y_true, y_pred)


References:

https://stackoverflow.com/questions/47437893/how-to-calculate-logistic-regression-accuracy

AI/ML What is TF , IDF, and TFIDF ?

 The TF-IDF of a term is calculated by multiplying TF and IDF scores. It is basically, importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF.


Term frequency Inverse document frequency (TFIDF) is a statistical formula to convert text documents into vectors based on the relevancy of the word. It is based on the bag of the words model to create a matrix containing the information about less relevant and most relevant words in the document


Term Frequency (TF)


It is the ratio of the occurrence of the word (w) in document (d) per the total number of words in the documents. With this simple formulation, we are measuring the frequency of a word in the document. 

For example, if the sentence has 6 words and contains two “the”, the TF ratio of this word would be (2/6).


Inverse Document Frequency (IDF)

 

IDF calculates the importance of a word in a corpus D. The most frequently used words like “of, we, are” have little to no significance. It is calculated by dividing the total number of documents in the corpus by the number of documents containing the word.


References:

https://www.kdnuggets.com/2022/09/convert-text-documents-tfidf-matrix-tfidfvectorizer.html

AIML What is CountVectorizer and n-gram analysis


It is a scikit-learn package. This is mainly used for analysing commonly occurring words or phrase from a given set of documents such as web pages. 


The usage is something similar below 


import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer


pd.set_option('max_columns', 10)

pd.set_option('max_rows', 10)


df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/gonutrition.csv')

df.head()


# this works similar to the machine learning fit mechanism. We need to fit the vectoriser to the data that we need to analyse 


text = df['product_description']

model = CountVectorizer(ngram_range = (1, 1))

matrix = model.fit_transform(text).toarray()

df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())

df_output.T.tail(5)


df_output.shape


we set the CountVectorizer to 1, 1 to return unigrams or single words. Increasing the ngram_range will mean the vocabulary is expanded from single words to short phrases of your desired lengths. For example, setting the ngram_range to 2, 2 will return bigrams (2-grams) or two word phrases.


text = df['product_description']

model = CountVectorizer(ngram_range = (2, 2), stop_words='english')

matrix = model.fit_transform(text).toarray()

df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())

df_output.T.tail(5)


References:

https://practicaldatascience.co.uk/machine-learning/how-to-use-count-vectorization-for-n-gram-analysis#:~:text=CountVectorizer%20will%20tokenize%20the%20data,such%20as%20%E2%80%9Cwhey%20protein%E2%80%9D.


Monday, November 14, 2022

AI/ML What is Swifter package

Swifter package works with data frame which can be used to apply() function on data frame in an efficient manner. This reduces the computation time to a great extent. From documents it says that it could apply functions 100 times faster compared to regular pandas 

Usually the apply is used 

%time df['square'] = df['num'].apply(lambda x: x * 2)

This takes around 42 ms 

With swifter it is applied like this 

%time df['square'] = df['num'].swifter.apply(lambda x: x * 2)

To import and use this package, below to be done 

import pandas as pd

import swifter

pip install -U pandas

The above is required to update pandas

References:

https://morioh.com/p/26c8b6f1a4a1

AI/ML What is POS Tagging?

This is a mechanism for mark up the words in text format for a particular part of a speech based on its definition and context.

Some of the examples are 

JJS adjective, superlative (largest)

LS list market

MD modal (could, will)

NN noun, singular (cat, tree)

NNS noun plural (desks)

NNP proper noun, singular (sarah)

NNPS proper noun, plural (indians or americans)


To count tokens, 


from collections import Counter

import nltk

text = "Shiv is one of the best sites to learn WEB, SAP, Ethical Hacking and much more online."

lower_case = text.lower()

tokens = nltk.word_tokenize(lower_case)

tags = nltk.pos_tag(tokens)

counts = Counter( tag for word,  tag in tags)

print(counts)



references:

https://www.guru99.com/pos-tagging-chunking-nltk.html

AI/ML Google collab error Please use NLTK Downloader to obtain the resources

 


How to overcome this error? 

This happens for most of the NLTK ones. Just do this

import nltk

nltk.download('punkt')

nltk.download('wordnet')

nltk.download('omw-1.4')

AI/ML perfplot for performance Plotting

AI/ML perfplot for performance Plotting 

perfplot extends Python's timeit by testing snippets with input parameters (e.g., the size of an array) and plotting the results. This has also option for live update to the ui.

The code is like below 


import perfplot, string

np.random.seed(123)



def shape(df):

    return df[df.education == 'a'].shape[0]


def len_df(df):

    return len(df[df['education'] == 'a'])


def query_count(df):

    return df.query('education == "a"').education.count()


def sum_mask(df):

    return (df.education == 'a').sum()


def sum_mask_numpy(df):

    return (df.education.values == 'a').sum()


def make_df(n):

    L = list(string.ascii_letters)

    df = pd.DataFrame(np.random.choice(L, size=n), columns=['education'])

    return df


perfplot.show(

    setup=make_df,

    kernels=[shape, len_df, query_count, sum_mask, sum_mask_numpy],

    n_range=[2**k for k in range(2, 25)],

    logx=True,

    logy=True,

    equality_check=False, 

    xlabel='len(df)')



References:

https://stackoverflow.com/questions/35277075/python-pandas-counting-the-occurrences-of-a-specific-value

https://pypi.org/project/perfplot/

AI/ML Dataframe how to replace string with another in column

 df2 = pd.DataFrame([

    [-0.532681, 'foo', 0],

    [1.490752, 'bar', 1],

    [-1.387326, 'foo', 2],

    [0.814772, 'baz', ' '],     

    [-0.222552, '   ', 4],

    [-1.176781,  'qux', '  '],         

],columns=['one', 'two', 'three'])

# df2

df2['two'] = df2['two'].replace(r'^\s*$', np.nan, regex=True)

df2


If we have to do the same in entire dataframe, just do this below 

df2 = df2.replace(r'^\s*$', np.nan, regex=True)



Sunday, November 13, 2022

docker-compose volume mount syntax

Using the host : guest format you can do any of the following:


volumes:

  # Just specify a path and let the Engine create a volume

  - /var/lib/mysql


  # Specify an absolute path mapping

  - /opt/data:/var/lib/mysql


  # Path on the host, relative to the Compose file

  - ./cache:/tmp/cache


  # User-relative path

  - ~/configs:/etc/configs/:ro


  # Named volume

  - datavolume:/var/lib/mysql



Long Syntax


As of docker-compose v3.2 you can use long syntax which allows the configuration of additional fields that can be expressed in the short form such as mount type (volume, bind or tmpfs) and read_only.


version: "3.2"

services:

  web:

    image: nginx:alpine

    ports:

      - "80:80"

    volumes:

      - type: volume

        source: mydata

        target: /data

        volume:

          nocopy: true

      - type: bind

        source: ./static

        target: /opt/app/static


networks:

  webnet:


volumes:

  mydata:




references:

https://stackoverflow.com/questions/40905761/how-do-i-mount-a-host-directory-as-a-volume-in-docker-compose

AI/ML Dataframe count specific values in dataframe column

#below gives values counts on the name column. If there are many empties in the column, then it gives that count

print(data['name'].value_counts())

Below are some variants. 

data['marks'].value_counts(ascending=False)

data['age'].describe()

Group by gives values in a column group by

data.groupby('subjects').size()



Group by count works like this below 
print(data.groupby('name').count())


If we want to put into Bins

data['age'].value_counts(bins=6)


references:

https://www.geeksforgeeks.org/how-to-count-occurrences-of-specific-value-in-pandas-column/#:~:text=We%20can%20count%20by%20using,values%20in%20a%20particular%20column.







Saturday, November 12, 2022

What does pd.json_normalize() do?

JSON file can sometimes be clumsy having different levels and hierarchy. Pandas have a nice inbuilt function called json_normalize() to flatten the simple to moderately semi-structured nested JSON structures to flat tables.

Below are different ways Pandas provide to convert JSON to dataframe 


# Use json_normalize() to convert JSON to DataFrame

dict= json.loads(data)

df = json_normalize(dict['technologies']) 


# Convert JSON to DataFrame Using read_json()

df2 = pd.read_json(jsonStr, orient ='index')


# Use pandas.DataFrame.from_dict() to Convert JSON to DataFrame

dict= json.loads(data)

df2 = pd.DataFrame.from_dict(dict, orient="index")



If the  JSON is like below


{

  '_index': 'complaint-public-v2',

  '_type': 'complaint',

  '_id': '3230997',

  '_score': 0.0,

  '_source': {'tags': None,

   'zip_code': '49508',

   'complaint_id': '3230997',

   'issue': 'Managing an account',

   'date_received': '2019-05-01T12:00:00-05:00',

   'state': 'MI',

   'consumer_disputed': 'N/A',

   'product': 'Checking or savings account',

   'company_response': 'Closed with monetary relief',

   'company': 'JPMORGAN CHASE & CO.',

   'submitted_via': 'Referral',

   'date_sent_to_company': '2019-05-02T12:00:00-05:00',

   'company_public_response': None,

   'sub_product': 'Checking account',

   'timely': 'Yes',

   'complaint_what_happened': '',

   'sub_issue': 'Problem making or receiving payments',

   'consumer_consent_provided': 'N/A'}

}



df = pd.json_normalize(test_dict) 

df.columns


Index(['_index', '_type', '_id', '_score', '_source.tags', '_source.zip_code',

       '_source.complaint_id', '_source.issue', '_source.date_received',

       '_source.state', '_source.consumer_disputed', '_source.product',

       '_source.company_response', '_source.company', '_source.submitted_via',

       '_source.date_sent_to_company', '_source.company_public_response',

       '_source.sub_product', '_source.timely',

       '_source.complaint_what_happened', '_source.sub_issue',

       '_source.consumer_consent_provided'],

      dtype='object')


When load data frame using json_normalize, it becomes like this below 




references:

https://www.geeksforgeeks.org/python-pandas-flatten-nested-json/

Sunday, November 6, 2022

Topic Modeling

Usually done through LDA(Latent Dirichlet Allocation). It identifies topics that describes a document or set of documents. 

The word latent is because the topics will only evolve during the modelling process. Topic modelling is an unsupervised task. 

This is mainly done by identifying the patterns of word clusters and frequencies of words in the document. 

LDA short summary (Latent Dirichlet Allocation)

Dirichlet is form of distribution, which is different from Normal distribution. The ML algorithms can be applied where the data is normally distributed and it works with real numbers. In Dirichlet, the plotted data sum up to 1. Dirichlet is a probability distribution that is sampling over probability simplex instead of sampling from the space of real numbers as in Normal distribution.  

LDA brings the words in the topic with their distribution using Dichrlet distribution. The words assigned to the topic with their distribution using Dichrlet distribution. 

References:

https://www.analyticsvidhya.com/blog/2021/05/topic-modelling-in-natural-language-processing/

Stemming and Lemmatization

From a corpus of words, a word is converted to its base form . Eg: fix, fixing, fixed gives fix. Different types of stemming are

1. Porter Stemmer, 

2. Lancaster Stemmer,

3. Snowball Stemmer

Lemmatization cuts the word to gets its lemma word meaning it gets a much more meaningful form than what stemming does. The output we get after Lemmatization is called ‘lemma’.

For e.g. Having is converted to Hav in Stemming, while Lemmatization converts it to Have. 

Some of them are WordNet Lemmatization, TextBlob, Spacy, Tree Tagger, Pattern, Genism, and Stanford CoreNLP lemmatization. 

references:

https://www.analyticsvidhya.com/blog/2021/05/topic-modelling-in-natural-language-processing/

Camunda Timer Event

Time duration

A duration is defined as a ISO 8601 durations format, which defines the amount of intervening time in a time interval and are represented by the format P(n)Y(n)M(n)DT(n)H(n)M(n)S. Note that the n is replaced by the value for each of the date and time elements that follow the n.


The capital letters P, Y, M, W, D, T, H, M, and S are designators for each of the date and time elements and are not replaced, but can be omitted.


P is the duration designator (for period) placed at the start of the duration representation.

Y is the year designator that follows the value for the number of years.

M is the month designator that follows the value for the number of months.

W is the week designator that follows the value for the number of weeks.

D is the day designator that follows the value for the number of days.

T is the time designator that precedes the time components of the representation.

H is the hour designator that follows the value for the number of hours.

M is the minute designator that follows the value for the number of minutes.

S is the second designator that follows the value for the number of seconds.

Examples:


PT15S - 15 seconds

PT1H30M - 1 hour and 30 minutes

P14D - 14 days

P14DT1H30M - 14 days, 1 hour and 30 minutes

P3Y6M4DT12H30M5S - 3 years, 6 months, 4 days, 12 hours, 30 minutes and 5 seconds

If the duration is zero or negative, the timer fires immediately.


references:

https://docs.camunda.io/docs/components/modeler/bpmn/timer-events/