Home NLP - Getting started [Part 6]
Post
Cancel

NLP - Getting started [Part 6]

NLP - Getting started [Part 6]

Information extraction

Information extraction in natural language processing (NLP) is the process of automatically extracting structured information from unstructured text data.

The goal of information extraction is to transform unstructured text data into structured data that can be easily analyzed, searched, and visualized.

Billions of data are generated everyday, ranging from social media posts to scientific literature, so information extraction has become an essential tool for processing this data.

As a result, many organizations rely on Information Extraction techniques to use solid NLP algorithms to automate manual tasks.

Information extraction involves identifying specific entities, relationships, and events of interest in text data, such as named entities like people, organizations, dates and locations, and converting them into a structured format that can be analyzed and used for various applications.

Information extraction is a challenging task that requires the use of various techniques, including named entity recognition (NER), regular expressions, and text matching, among others.

There is a wide range of applications involved, including search engines, chatbots, recommendation systems, and fraud detection, among others, making it a vital tool in the field of NLP.

image

Example:

image

image

Information Extraction Works

There are many techniques involved in the process of information extraction.

image

  • Basic information extraction using named entity recognition (NER) with the spaCy library.

General Pipeline of the Information Extraction Process

image

Information Retrieval

Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need.

The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing.

Process of accessing and retrieving the most appropriate information from text based on a particular query given by the user, with the help of context-based indexing or metadata.

These days we frequently think first of web search, but there are many other cases:

  • E-mail search

  • Searching your laptop

  • Corporate knowledge bases

  • Legal information retrieval

Basic assumptions of Information Retrieval

Collection: A set of documents

  • Assume it is a static collection for the moment

Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task

image

Types of Information Retrieval Models

  • Classic IR Model

It is the most basic and straightforward IR model. This paradigm is founded on mathematical information that was easily recognized and comprehended. The three traditional IR models are Boolean, Vector, and Probabilistic.

  • Non-Classic IR Model

It is diametrically opposed to the traditional IR model. Addition than probability, similarity, and Boolean operations, such IR models are based on other ideas. Non-classical IR models include situation theory models, information logic models, and interaction models.

  • Alternative IR Model

It is an improvement to the traditional IR model that makes use of some unique approaches from other domains. Alternative IR models include fuzzy models, cluster models, and latent semantic indexing (LSI) models.

image

Information Retrieval applications

  • Search engines, searching for text documents, images, videos, and so on.

  • Question answering over a set of documents (e.g. with a chatbot or a smart speaker).

  • Recommender systems.

  • Summarization of a set of documents.

Note

Basic information retrieval using the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization method. This code demonstrates how to create a simple information retrieval system with a set of documents, and then how to search through them using a query.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Import necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Sample documents to form the corpus
documents = [
    "The quick brown fox jumped over the lazy dog.",
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
    "Python is a high-level, interpreted, and general-purpose programming language.",
    "Artificial intelligence and machine learning are technologies that use algorithms to simulate human intelligence."
]
# Initialize a TF-IDF Vectorizer
# This vectorizer converts a collection of raw documents into a matrix of TF-IDF features.
tfidf_vectorizer = TfidfVectorizer()
# Fit the vectorizer on the documents and transform the documents into their respective TF-IDF
vectors
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
# Define a query to search within the documents
query = "programming and algorithms"
# Convert the query into the TF-IDF vector using the same vectorizer
query_tfidf = tfidf_vectorizer.transform([query])
# Compute cosine similarity between the query TF-IDF vector and the document TF-IDF vectors
cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix).flatten()
# Get the index of the most similar document in the corpus
most_similar_document_index = np.argmax(cosine_similarities)
# Print the most similar document and its similarity score
print("Most Similar Document:", documents[most_similar_document_index])
print("Similarity Score:", cosine_similarities[most_similar_document_index])

Summarization

Text summarization is the process of generating short, fluent, and most importantly accurate summary of a respectively longer text document.

The main idea behind automatic text summarization is to be able to find a short subset of the most essential information from the entire set and present it in a human-readable format.

As online textual data grows, automatic text summarization methods have the potential to be very helpful because more useful information can be read in a short time.

image

There are broadly two different approaches that are used for text summarization:

image

Extractive Summarization

We identify the important sentences or phrases from the original text and extract only those from the text. Those extracted sentences would be our summary.

image

Abstractive Summarization

Here, we generate new sentences from the original text. This is in contrast to the extractive approach we saw earlier where we used only the sentences that were present. The sentences generated through abstractive summarization might not be present in the original text:

image

Example TextRank Algorithm for Extractive Text Summarization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Import the spaCy library to use for NLP tasks.
import spacy
# Import the pyTextRank library, which adds additional text processing capabilities on top of spaCy.
import pytextrank
# Load the large English language model in spaCy.
nlp = spacy.load("en_core_web_lg")
# Add the text ranking algorithm from pyTextRank into the spaCy pipeline for processing documents.
nlp.add_pipe("textrank")
# Define a string containing a lengthy and complex text about deep learning.
example_text = """Deep learning (also known as deep structured learning) is part of a
    broader family of machine learning methods based on artificial neural networks with
    representation learning. Learning can be supervised, semi-supervised or unsupervised.
    Deep-learning architectures such as deep neural networks, deep belief networks, deep reinforcement learning,
    recurrent neural networks and convolutional neural networks have been applied to
    fields including computer vision, speech recognition, natural language processing,
    machine translation, bioinformatics, drug design, medical image analysis, material
    inspection and board game programs, where they have produced results comparable to
    and in some cases surpassing human expert performance. Artificial neural networks
    (ANNs) were inspired by information processing and distributed communication nodes
    in biological systems. ANNs have various differences from biological brains. Specifically,
    neural networks tend to be static and symbolic, while the biological brain of most living organisms
    is dynamic (plastic) and analogue. The adjective "deep" in deep learning refers to the use of multiple
    layers in the network. Early work showed that a linear perceptron cannot be a universal classifier,
    but that a network with a nonpolynomial activation function with one hidden layer of unbounded width can.
    Deep learning is a modern variation which is concerned with an unbounded number of layers of bounded size,
    which permits practical application and optimized implementation, while retaining theoretical universality
    under mild conditions. In deep learning the layers are also permitted to be heterogeneous and to deviate widely
    from biologically informed connectionist models, for the sake of efficiency, trainability and understandability,
    whence the structured part."""
# Output the original size of the document.
print('Original Document Size:', len(example_text))
# Process the example_text through the spaCy pipeline which now includes text ranking.
doc = nlp(example_text)
# Iterate through sentences identified by pyTextRank as the top 2 key phrases and sentences,
# and print each sentence along with its length.
for sent in doc._.textrank.summary(limit_phrases=2, limit_sentences=2):
    print(sent)
    print('Summary Length:', len(sent))

Output:

image

Abstractive Summarization Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Import necessary libraries from transformers
from transformers import pipeline
# Create a summarization pipeline using the "facebook/bart-large-cnn" model.
# This model is pretrained for abstractive summarization tasks.
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
# This is a simple text about machine learning.
text = """
    Machine learning is a field of artificial intelligence that uses statistical
    techniques to give computer systems the ability to "learn" (i.e., progressively
    improve performance on a specific task) from data, without being explicitly
    programmed. The name Machine Learning was coined in 1959 by Arthur Samuel.
    Evolving from the study of pattern recognition and computational learning theory
    in artificial intelligence, machine learning explores the study and construction
    of algorithms that can learn from and make predictions on data. These algorithms
    operate by building a model from sample inputs and using that model to make
    predictions or decisions, rather than following strictly static program
    instructions.
"""
# Call the summarizer on the text.
# The summary will try to condense the information in the text into a shorter
form, maintaining the essential details.
summary = summarizer(text, max_length=100, min_length=50, do_sample=False)
# Print the generated summary
print("Summary:", summary[0]['summary_text'])

Attention Mechanisms and Transformers

Attention Mechanisms

The concept of attention mechanism was first introduced in a 2014 paper on neural machine translation.

Prior to this, RNN encoder-decoder frameworks encoded variable-length source sentences into fixed-length vectors that would then be decoded into variable-length target sentences. This approach not only restricts the network’s ability to cope with large sentences but also results in performance drop for long input sentences.

Rather than trying to force-fit all the information from an input sentence into a fixed-length vector, the paper proposed the implementation of a mechanism of attention in the decoder.

In this approach, the information from an input sentence is encoded across a sequence of vectors, instead of a fixed-length vector, with the attention mechanism allowing the decoder to adaptively choose a subset of these vectors to decode the translation.

Transformer

The Transformer was the first transduction model to implement self-attention as an alternative to recurrence and convolutions.

Attention allows models to dynamically focus on appropriate parts of the input data , akin to the way humans pay attention to certain aspects of a visual scene or conversation. This selective focus is particularly crucial in tasks where context is key, such as language understanding or image recognition.

The Transformer was the first transduction model to implement self-attention as an alternative to recurrence and convolutions.

In the context of transformers, attention mechanisms serve to weigh the influence of different input tokens when producing an output.

This is not purely a replication of human attention but an enhancement, enabling machines to surpass human performance in certain tasks.

Transformer Architecture

Transformer consists of two primary components: an encoder and a decoder. Each component is designed to perform distinct yet complementary functions in the processing of input and generation of output.

image

Inputs and Outputs:

  • Input Embedding: Converts input data into a fixed-size numerical format that the model can process.
  • Output Embedding: Similar to input embedding, but for the output data.

Positional Encoding: Adds information about the position of each element in the sequence to the embeddings, helping the model understand the order of the data.

Nx (Layers): This part of the diagram is repeated several times (N times). Each layer consists of the following components:

  • Masked Multi-Head Attention: Focuses on different positions of the input to better understand relationships between elements.

  • Multi-Head Attention: Similar to masked version but used in different contexts within the model.

  • Add & Norm: A step that combines the outputs of attention with the original input (add) and normalizes the result (norm).

  • Feed Forward: A neural network that processes each position of the input independently.

  • Softmax and Linear:

    • Linear: A transformation that prepares the outputs for probability prediction.
    • Softmax: Converts the linear outputs into probabilities, indicating the likelihood of each possible output element.
This post is licensed under CC BY 4.0 by the author.