Classifying Noisy Protein Sequence Data: A Case Study of Immunoglobulin Light Chains

Abstract

Summary: The classification of protein sequences obtained from patients with various immunoglobulin-related conformational diseases may provide insight into structural correlates of pathogenicity. However, clinical data are very sparse and, in the case of antibody-related proteins, the collected sequences have large variability with only a small subset of variations relevant to the protein pathogenicity (function). On this basis, these sequences represent a model system for development of strategies to recognize the small subset of function-determining variations among the much larger number of primary structure diversifications introduced during evolution. Under such conditions, most protein classification algorithms have limited accuracy. To address this problem, we propose a support vector machine (SVM)-based classifier that combines sequence and 3D structural averaging information. Each amino acid in the sequence is represented by a set of six physicochemical properties: hydrophobicity, hydrophilicity volume, surface area, bulkiness and refractivity. Each position in the sequence is described by the properties of the amino acid at that position and the properties of its neighbors in 3D space or in the sequence. A structure template is selected to determine neighbors in 3D space and a window size is used to determine the neighbors in the sequence. The test data consist of 209 proteins of human antibody immunoglobulin light chains, each represented by aligned sequences of 120 amino acids. The methodology is applied to the classification of protein sequences collected from patients with and without amyloidosis, and indicates that the proposed modified classifiers are more robust to sequence variability than standard SVM classifiers, improving classification error between 5 and 25% and sensitivity between 9 and 17%. The classification results might also suggest possible mechanisms for the propensity of immunoglobulin light chains to amyloid formation.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2005
Accession Number
ADA481112

Entities

People

  • Chenggang Yu
  • Fred J. Stevens
  • Jaques Reifman
  • Kelly Yackovich
  • Nela Zavaljevski

Organizations

  • Argonne National Laboratory

Tags

Communities of Interest

  • Engineered Resilient Systems

DTIC Thesaurus Topics

  • Algorithms
  • Amino Acids
  • Artificial Intelligence
  • Computational Biology
  • Computational Science
  • Computer Programs
  • Databases
  • Hydrophobic Properties
  • Identification
  • Information Processing
  • Information Science
  • Information Systems
  • Kernel Functions
  • Machine Learning
  • Molecular Dynamics
  • Neural Networks
  • Supervised Machine Learning

Fields of Study

  • Biology

Readers

  • Molecular Genetics
  • Neural Network Machine Learning.

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • Space
  • Space - Space Objects