Use this component when you wish to acquire data from other sources or extract structured data from text. Most tools in this component include data cleaning components to, for example, detect and/or correct inconsistent data.
Featured Packages
-
FrameIt Semantic Role Labeling
https://github.com/biggorilla-gh/frameit
FrameIt is a system for creating custom frames for text corpora. FrameIt uses Python3 + Spacy2
Features:
– Intent detection for individual sentences using a CNN model
– Entity extraction paired with intents using either CNN or heuristic models
– SRL system allows for loading multiple Frames for intent detection simultaneously, allowing for the differentiation of similar domains
– Easy to train and customize using jupyter notebooks
Evaluation scripts for convenient experimental design and iteration
Functions for all languages supported by Spacy2 modelsDeveloped by Megagon Labs
-
Scrapy
Scrapy is a framework for extracting data from websites. Scrapy can be used to build a crawler or spider to crawl multiple websites and retrieve selected data.
-
Usagi
https://github.com/biggorilla-gh/usagi
Usagi is an open source platform to build data discovery systems. Usagi crawls and extracts metadata about datasets and builds catalogs and indices to make datasets discoverable by search and browsing.
-
pandas
pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for Python.
-
JSON
https://docs.python.org/3/library/json.html
The json library parses JSON strings into dictionaries and lists and vice versa.
-
CSV
https://docs.python.org/3/library/csv.html
The csv module implements classes to read and write tabular data in CSV format. It allows programmers to say, “write this data in the format preferred by Excel,” or “read data from this file which was generated by Excel,” without knowing the precise details of the CSV format used by Excel. Programmers can also describe the CSV formats understood by other applications or define their own special-purpose CSV formats.
-
xlrd
https://pypi.python.org/pypi/xlrd
xlrd is a Python package that parses Excel data. It has accompanying packages for writing and formatting information in Excel format.
-
PDFtables
https://pypi.python.org/pypi/pdftables
PDFtables parses PDF files and extracts what it believes to be tables.
-
Slate
https://pypi.python.org/pypi/slate
Slate is a Python package that simplifies the process of extracting text from PDF files. It depends on the PDFMiner package.
-
PDFminer
https://pypi.python.org/pypi/pdfminer/
PDFminer is Python package for extracting information from PDF files into text.
PDFminer includes a tool that can convert PDF files into HTML in addition to text.
-
Stanford Open IE and the general NLP suite for named entity recognition, relation extraction etc.
http://nlp.stanford.edu/software/openie.html
Stanford CoreNLP provides a set of human language technology tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and syntactic dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get the quotes people said, etc.
-
KOKO
https://github.com/biggorilla-gh/koko
Koko is an information extraction tool (developed in Python 3) that allows users to query a text corpus and extract those entities that is of interest to them.
-
SpaCy
SpaCy is a library for advanced Natural Language Processing in Python and Cython.
-
Google Cloud Natural Language API
https://cloud.google.com/natural-language/
Google Cloud Natural Language API provides developers with access to Google-powered, machine learning-based text analysis components such as sentiment analysis, entity recognition, and syntax analysis.
-
NLTK
NLTK is an open-source platform for building Python programs to process human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. NLTK also provides wrappers for industrial-strength NLP libraries.
-
lxml
Library for processing XML and HTML in the Python language.
-
beautiful soup
Helps easily read and parse Web pages. Great for initial parsing and scraping.
-
Apache Nutch
Apache Nutch is an extensible and scalable open source web crawler written in Java.
-
Data Synthesizer
https://github.com/
DataResponsibly/DataSynthesizer Data Synthesizer can generate a synthetic dataset from a sensitive one for release to public
-
Tweepy
Tweepy is a Python library for accessing the Twitter API to extract tweets.
-
urllib2
https://docs.python.org/2/library/urllib2.html
urllib and urllib2 are part of the Python standard library for making simple HTTP requests to visit web pages and get their content.
-
urllib
https://docs.python.org/2/library/urllib.html
urllib and urllib2 are part of the Python standard library for making simple HTTP requests to visit web pages and get their content.
-
Requests
http://docs.python-requests.org/
Requests is a HTTP library for Python that provides the necessary apis to scrap websites. Requests can make complex requests to visit a page and get content, such as those requiring additional headers, complex POST data, or authentication credentials.
Registered Packages
-
Python Client for Google Maps Services
https://github.com/googlemaps/google-maps-services-python
This library brings the Google Maps API Web Services to your Python application.
The Python Client for Google Maps Services is a Python Client library for the following Google Maps APIs: Directions API Distance Matrix API Elevation API Geocoding API Geolocation API Time Zone API Roads API Places API