interpro-manager

module
v0.0.0-...-35b420c Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 22, 2026 License: BSD-2-Clause

README

interpro-manager

Go Reference Go Report Card CI/CD License Funding

CLI tool for interacting with EMBL-EBI's InterPro protein database — download protein records by taxonomy and submit sequences to InterProScan6 for analysis.

Contents

Prerequisites

  • Go 1.25+
  • A valid email address (required for scan — any email works)

Install

go install github.com/dictybase/interpro-manager/cmd/interpro-cli@latest

Or build from source:

git clone https://github.com/dictybase/interpro-manager.git
cd interpro-manager
go build -o interpro-manager ./cmd/interpro-cli

Commands

download

Fetch InterPro protein records for a taxonomy ID, filter results that contain a gene symbol, and save as TSV.

interpro-manager download [--taxon-id ID] [--output FILE] [--page-size N]
Flag Default Description
--taxon-id 44689 NCBI taxonomy ID (default: Dictyostelium discoideum)
--output interpro_proteins.tsv Output TSV file path
--page-size 20 Records per API page

Example:

# Download D. discoideum proteins (default)
interpro-manager download

# Download proteins for a different organism with custom output
interpro-manager download --taxon-id 9606 --output human_proteins.tsv --page-size 50

The TSV output contains the following columns:

Column Source
Accession Protein accession
Source Database Source organism database
Gene Gene symbol
Name Protein name
Length Sequence length
Source Organism Organism name
scan

Submit protein sequences from FASTA files to the InterProScan6 job dispatcher, poll for completion, and download JSON results.

interpro-manager scan --fasta FILE --email ADDRESS [--output DIR] [--seq-type TYPE] [--poll-interval DURATION] [--timeout DURATION]
Flag Default Description
--fasta (required) Path to FASTA file with protein sequences
--email env: EBI_EMAIL Email address for job submission (any valid email)
--output interproscan_results Output directory for JSON result files
--seq-type p Sequence type (p for protein, n for nucleotide)
--poll-interval 10s Interval between job status checks
--timeout 120s Maximum time to wait for job completion

Example:

# Set email and scan a FASTA file
export EBI_EMAIL=yourname@example.com
interpro-manager scan --fasta proteins.fa

# Custom output directory with shorter polling
interpro-manager scan \
  --fasta proteins.fa \
  --email yourname@example.com \
  --output results/ \
  --poll-interval 5s \
  --timeout 300s

Results are saved as {sequence-id}_{job-id}.json in the output directory, one file per FASTA record.

Project Structure

.
├── cmd/
│   └── interpro-cli/
│       └── main.go              # CLI entry point, command registration
├── internal/
│   ├── interpro/
│   │   ├── command.go           # download subcommand
│   │   ├── client.go            # HTTP client, JSON deserialization
│   │   ├── extract.go           # TSV formatting, gene filter
│   │   ├── loop.go              # Pagination loop for download
│   │   ├── model.go             # API response types
│   │   ├── scan.go              # scan subcommand orchestrator
│   │   ├── scan_loop.go         # FASTA record streaming
│   │   ├── scan_model.go        # Scan request/job types
│   │   ├── scan_poll.go         # Job status polling
│   │   ├── scan_result.go       # Result download
│   │   ├── scan_submit.go       # Job submission
│   │   └── tsv.go               # File I/O utilities
│   └── seqio/
│       ├── fasta.go             # FASTA parser
│       └── fasta_test.go        # FASTA parser tests
└── docs/
    └── ...                      # Design docs and reference material
Packages
Package Responsibility
internal/interpro Core business logic — API clients, TSV generation, scan orchestration, job polling
internal/seqio Pure functional FASTA parser using state-machine based iterators

Both packages are built with fp-go functional programming combinators and use urfave/cli for the CLI framework.

Development

# Run tests
gotestsum --format pkgname-and-test-fails --format-hide-empty-pkg -- ./...

# Run tests with verbose output
gotestsum --format testdox --format-hide-empty-pkg -- ./...

# Lint
golangci-lint run ./...

# Format
golangci-lint fmt

# Build
go build -o interpro-manager ./cmd/interpro-cli

Sources

Directories

Path Synopsis
cmd
interpro-cli command
internal
interpro
Package interpro provides the implementation of the `interpro download` command, which fetches InterPro protein records for a given taxonomy ID and writes them to an output file in TSV format.
Package interpro provides the implementation of the `interpro download` command, which fetches InterPro protein records for a given taxonomy ID and writes them to an output file in TSV format.
seqio
Package seqio is a generic namespace shared by all biological sequence input and output handlers.
Package seqio is a generic namespace shared by all biological sequence input and output handlers.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL