Install Dive
brew install dive
Now once installed, just give like below the argument needs to be Image ID
dive 7d35abc40782
references:
https://github.com/wagoodman/dive
Install Dive
brew install dive
Now once installed, just give like below the argument needs to be Image ID
dive 7d35abc40782
references:
https://github.com/wagoodman/dive
If the Dockerfile content is like this
#from base image
FROM ubuntu:14.04
#author name
MAINTAINER RAGHU
#commands to run in the container
RUN echo "hello Raghu"
RUN sleep 10
RUN echo "TASK COMPLETED"
Command used to build the image: docker build -t raghavendar/hands-on:2.0 .
Sending build context to Docker daemon 20.04 MB
Step 1 : FROM ubuntu:14.04
---> b1719e1db756
Step 2 : MAINTAINER RAGHU
---> Running in 532ed79e6d55
---> ea6184bb8ef5
Removing intermediate container 532ed79e6d55
Step 3 : RUN echo "hello Raghu"
---> Running in da327c9b871a
hello Raghu
---> f02ff92252e2
Removing intermediate container da327c9b871a
Step 4 : RUN sleep 10
---> Running in aa58dea59595
---> fe9e9648e969
Removing intermediate container aa58dea59595
Step 5 : RUN echo "TASK COMPLETED"
---> Running in 612adda45c52
TASK COMPLETED
---> 86c73954ea96
Removing intermediate container 612adda45c52
Successfully built 86c73954ea96
Some explanation of the build process is as below
Yes, Docker images are layered. When you build a new image, Docker does this for each instruction (RUN, COPY etc.) in your Dockerfile:
create a temporary container from the previous image layer (or the base FROM image for the first command;
run the Dockerfile instruction in the temporary "intermediate" container;
save the temporary container as a new image layer.
The final image layer is tagged with whatever you name the image - this will be clear if you run docker history raghavendar/hands-on:2.0, you'll see each layer and an abbreviation of the instruction that created it.
Your specific queries:
1) 532 is a temporary container created from image ID b17, which is your FROM image, ubuntu:14.04.
2) ea6 is the image layer created as the output of the instruction, i.e. from saving intermediate container 532.
3) yes. Docker calls this the Union File System and it's the main reason why images are so efficient.
references:
https://stackoverflow.com/questions/39705085/how-are-intermediate-containers-formed
Map actually creates a new array. For each does not. this is the main difference.
on some machines, forEach() was more than 70% slower than map(). Your browser is probably different. You can check out the full test results here:
references:
https://codeburst.io/javascript-map-vs-foreach-f38111822c0f
npm config get registry
npm config list
npm config edit
npm config get
npm config set <name> <url>
npm cache clean
npm config delete registry => deletes all registry
Install from specific registry
npm install @cisco-bpa-platform/ui-template-manager
Below are few good resources for finding out the Cron job expressions. There is also a problem that depending on the version of the Camunda engine, the support for these cron expressions are limited.
It might throw errors like this below.
https://www.freeformatter.com/cron-expression-generator-quartz.html
https://crontab.guru/every-week
https://github.com/camunda/zeebe/issues/9673
The approaches are if using IPython, then use the display
from IPython.display import display
display(df)
If is also possible to apply the style using the below
df.style
def color_negative_red(val):
"""
Takes a scalar and returns a string with
the css property `'color: red'` for negative
strings, black otherwise.
"""
color = 'blue' if val > 90 else 'black'
return 'color: % s' % color
df.style.applymap(color_negative_red)
If using print, then tabulate is a good option
from tabulate import tabulate
print (tabulate(df, headers = 'keys', tablefmt = 'psql'))
There are many table formats available such other than psql
with psql, it looks like this below
references:
https://www.geeksforgeeks.org/display-the-pandas-dataframe-in-table-style/
The text data is represented in the form of a matrix. The rows of the matrix represent the sentences from the data which needs to be analyzed and the columns of the matrix represent the word. The dice under the matrix represent the number of occurrences of the words. Let’s understand it with an example.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
docs = [sentence1, sentence2, sentence3]
print(docs)
docs = [sentence1, sentence2, sentence3]
print(docs)
vec = CountVectorizer()
X = vec.fit_transform(docs)
#now this can be converted to and printed using data frame
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
df.head()
An example view from another workspace is
References:
https://analyticsindiamag.com/a-guide-to-term-document-matrix-with-its-implementation-in-r-and-python/
Higher accuracy is indication of model performing better.
Accuracy = TP+TN/TP+FP+FN+TN
TP = True positives
TN = True negatives
FN = False negatives
TN = True negatives
F1-score = 2*(Recall*Precision)/Recall+Precision where,
Precision = TP/TP+FP
Recall = TP/TP+FN
The scikit library gives better methods to do this
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
References:
https://stackoverflow.com/questions/47437893/how-to-calculate-logistic-regression-accuracy
The TF-IDF of a term is calculated by multiplying TF and IDF scores. It is basically, importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF.
Term frequency Inverse document frequency (TFIDF) is a statistical formula to convert text documents into vectors based on the relevancy of the word. It is based on the bag of the words model to create a matrix containing the information about less relevant and most relevant words in the document
Term Frequency (TF)
It is the ratio of the occurrence of the word (w) in document (d) per the total number of words in the documents. With this simple formulation, we are measuring the frequency of a word in the document.
For example, if the sentence has 6 words and contains two “the”, the TF ratio of this word would be (2/6).
Inverse Document Frequency (IDF)
IDF calculates the importance of a word in a corpus D. The most frequently used words like “of, we, are” have little to no significance. It is calculated by dividing the total number of documents in the corpus by the number of documents containing the word.
References:
https://www.kdnuggets.com/2022/09/convert-text-documents-tfidf-matrix-tfidfvectorizer.html
It is a scikit-learn package. This is mainly used for analysing commonly occurring words or phrase from a given set of documents such as web pages.
The usage is something similar below
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
pd.set_option('max_columns', 10)
pd.set_option('max_rows', 10)
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/gonutrition.csv')
df.head()
# this works similar to the machine learning fit mechanism. We need to fit the vectoriser to the data that we need to analyse
text = df['product_description']
model = CountVectorizer(ngram_range = (1, 1))
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
df_output.shape
we set the CountVectorizer to 1, 1 to return unigrams or single words. Increasing the ngram_range will mean the vocabulary is expanded from single words to short phrases of your desired lengths. For example, setting the ngram_range to 2, 2 will return bigrams (2-grams) or two word phrases.
text = df['product_description']
model = CountVectorizer(ngram_range = (2, 2), stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names())
df_output.T.tail(5)
References:
https://practicaldatascience.co.uk/machine-learning/how-to-use-count-vectorization-for-n-gram-analysis#:~:text=CountVectorizer%20will%20tokenize%20the%20data,such%20as%20%E2%80%9Cwhey%20protein%E2%80%9D.
Swifter package works with data frame which can be used to apply() function on data frame in an efficient manner. This reduces the computation time to a great extent. From documents it says that it could apply functions 100 times faster compared to regular pandas
Usually the apply is used
%time df['square'] = df['num'].apply(lambda x: x * 2)
This takes around 42 ms
With swifter it is applied like this
%time df['square'] = df['num'].swifter.apply(lambda x: x * 2)
To import and use this package, below to be done
import pandas as pd
import swifter
pip install -U pandas
The above is required to update pandas
References:
https://morioh.com/p/26c8b6f1a4a1
This is a mechanism for mark up the words in text format for a particular part of a speech based on its definition and context.
Some of the examples are
JJS adjective, superlative (largest)
LS list market
MD modal (could, will)
NN noun, singular (cat, tree)
NNS noun plural (desks)
NNP proper noun, singular (sarah)
NNPS proper noun, plural (indians or americans)
To count tokens,
from collections import Counter
import nltk
text = "Shiv is one of the best sites to learn WEB, SAP, Ethical Hacking and much more online."
lower_case = text.lower()
tokens = nltk.word_tokenize(lower_case)
tags = nltk.pos_tag(tokens)
counts = Counter( tag for word, tag in tags)
print(counts)
references:
https://www.guru99.com/pos-tagging-chunking-nltk.html
How to overcome this error?
This happens for most of the NLTK ones. Just do this
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
AI/ML perfplot for performance Plotting
perfplot extends Python's timeit by testing snippets with input parameters (e.g., the size of an array) and plotting the results. This has also option for live update to the ui.
The code is like below
import perfplot, string
np.random.seed(123)
def shape(df):
return df[df.education == 'a'].shape[0]
def len_df(df):
return len(df[df['education'] == 'a'])
def query_count(df):
return df.query('education == "a"').education.count()
def sum_mask(df):
return (df.education == 'a').sum()
def sum_mask_numpy(df):
return (df.education.values == 'a').sum()
def make_df(n):
L = list(string.ascii_letters)
df = pd.DataFrame(np.random.choice(L, size=n), columns=['education'])
return df
perfplot.show(
setup=make_df,
kernels=[shape, len_df, query_count, sum_mask, sum_mask_numpy],
n_range=[2**k for k in range(2, 25)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
References:
https://stackoverflow.com/questions/35277075/python-pandas-counting-the-occurrences-of-a-specific-value
https://pypi.org/project/perfplot/
df2 = pd.DataFrame([
[-0.532681, 'foo', 0],
[1.490752, 'bar', 1],
[-1.387326, 'foo', 2],
[0.814772, 'baz', ' '],
[-0.222552, ' ', 4],
[-1.176781, 'qux', ' '],
],columns=['one', 'two', 'three'])
# df2
df2['two'] = df2['two'].replace(r'^\s*$', np.nan, regex=True)
df2
If we have to do the same in entire dataframe, just do this below
df2 = df2.replace(r'^\s*$', np.nan, regex=True)
Using the host : guest format you can do any of the following:
volumes:
# Just specify a path and let the Engine create a volume
- /var/lib/mysql
# Specify an absolute path mapping
- /opt/data:/var/lib/mysql
# Path on the host, relative to the Compose file
- ./cache:/tmp/cache
# User-relative path
- ~/configs:/etc/configs/:ro
# Named volume
- datavolume:/var/lib/mysql
Long Syntax
As of docker-compose v3.2 you can use long syntax which allows the configuration of additional fields that can be expressed in the short form such as mount type (volume, bind or tmpfs) and read_only.
version: "3.2"
services:
web:
image: nginx:alpine
ports:
- "80:80"
volumes:
- type: volume
source: mydata
target: /data
volume:
nocopy: true
- type: bind
source: ./static
target: /opt/app/static
networks:
webnet:
volumes:
mydata:
references:
https://stackoverflow.com/questions/40905761/how-do-i-mount-a-host-directory-as-a-volume-in-docker-compose
#below gives values counts on the name column. If there are many empties in the column, then it gives that count
print(data['name'].value_counts())
Below are some variants.
data['marks'].value_counts(ascending=False)
data['age'].describe()
Group by gives values in a column group by
data.groupby('subjects').size()
JSON file can sometimes be clumsy having different levels and hierarchy. Pandas have a nice inbuilt function called json_normalize() to flatten the simple to moderately semi-structured nested JSON structures to flat tables.
Below are different ways Pandas provide to convert JSON to dataframe
# Use json_normalize() to convert JSON to DataFrame
dict= json.loads(data)
df = json_normalize(dict['technologies'])
# Convert JSON to DataFrame Using read_json()
df2 = pd.read_json(jsonStr, orient ='index')
# Use pandas.DataFrame.from_dict() to Convert JSON to DataFrame
dict= json.loads(data)
df2 = pd.DataFrame.from_dict(dict, orient="index")
If the JSON is like below
{
'_index': 'complaint-public-v2',
'_type': 'complaint',
'_id': '3230997',
'_score': 0.0,
'_source': {'tags': None,
'zip_code': '49508',
'complaint_id': '3230997',
'issue': 'Managing an account',
'date_received': '2019-05-01T12:00:00-05:00',
'state': 'MI',
'consumer_disputed': 'N/A',
'product': 'Checking or savings account',
'company_response': 'Closed with monetary relief',
'company': 'JPMORGAN CHASE & CO.',
'submitted_via': 'Referral',
'date_sent_to_company': '2019-05-02T12:00:00-05:00',
'company_public_response': None,
'sub_product': 'Checking account',
'timely': 'Yes',
'complaint_what_happened': '',
'sub_issue': 'Problem making or receiving payments',
'consumer_consent_provided': 'N/A'}
}
df = pd.json_normalize(test_dict)
df.columns
Index(['_index', '_type', '_id', '_score', '_source.tags', '_source.zip_code',
'_source.complaint_id', '_source.issue', '_source.date_received',
'_source.state', '_source.consumer_disputed', '_source.product',
'_source.company_response', '_source.company', '_source.submitted_via',
'_source.date_sent_to_company', '_source.company_public_response',
'_source.sub_product', '_source.timely',
'_source.complaint_what_happened', '_source.sub_issue',
'_source.consumer_consent_provided'],
dtype='object')
When load data frame using json_normalize, it becomes like this below
references:
https://www.geeksforgeeks.org/python-pandas-flatten-nested-json/
Usually done through LDA(Latent Dirichlet Allocation). It identifies topics that describes a document or set of documents.
The word latent is because the topics will only evolve during the modelling process. Topic modelling is an unsupervised task.
This is mainly done by identifying the patterns of word clusters and frequencies of words in the document.
LDA short summary (Latent Dirichlet Allocation)
Dirichlet is form of distribution, which is different from Normal distribution. The ML algorithms can be applied where the data is normally distributed and it works with real numbers. In Dirichlet, the plotted data sum up to 1. Dirichlet is a probability distribution that is sampling over probability simplex instead of sampling from the space of real numbers as in Normal distribution.
LDA brings the words in the topic with their distribution using Dichrlet distribution. The words assigned to the topic with their distribution using Dichrlet distribution.
References:
https://www.analyticsvidhya.com/blog/2021/05/topic-modelling-in-natural-language-processing/
From a corpus of words, a word is converted to its base form . Eg: fix, fixing, fixed gives fix. Different types of stemming are
1. Porter Stemmer,
2. Lancaster Stemmer,
3. Snowball Stemmer
Lemmatization cuts the word to gets its lemma word meaning it gets a much more meaningful form than what stemming does. The output we get after Lemmatization is called ‘lemma’.
For e.g. Having is converted to Hav in Stemming, while Lemmatization converts it to Have.
Some of them are WordNet Lemmatization, TextBlob, Spacy, Tree Tagger, Pattern, Genism, and Stanford CoreNLP lemmatization.
references:
https://www.analyticsvidhya.com/blog/2021/05/topic-modelling-in-natural-language-processing/
Time duration
A duration is defined as a ISO 8601 durations format, which defines the amount of intervening time in a time interval and are represented by the format P(n)Y(n)M(n)DT(n)H(n)M(n)S. Note that the n is replaced by the value for each of the date and time elements that follow the n.
The capital letters P, Y, M, W, D, T, H, M, and S are designators for each of the date and time elements and are not replaced, but can be omitted.
P is the duration designator (for period) placed at the start of the duration representation.
Y is the year designator that follows the value for the number of years.
M is the month designator that follows the value for the number of months.
W is the week designator that follows the value for the number of weeks.
D is the day designator that follows the value for the number of days.
T is the time designator that precedes the time components of the representation.
H is the hour designator that follows the value for the number of hours.
M is the minute designator that follows the value for the number of minutes.
S is the second designator that follows the value for the number of seconds.
Examples:
PT15S - 15 seconds
PT1H30M - 1 hour and 30 minutes
P14D - 14 days
P14DT1H30M - 14 days, 1 hour and 30 minutes
P3Y6M4DT12H30M5S - 3 years, 6 months, 4 days, 12 hours, 30 minutes and 5 seconds
If the duration is zero or negative, the timer fires immediately.
references:
https://docs.camunda.io/docs/components/modeler/bpmn/timer-events/